メインコンテンツまでスキップ

ai_prep_search function

Applies to: check marked yes Databricks SQL check marked yes Databricks Runtime

Beta

This feature is in Beta. Workspace admins can control access to this feature from the Previews page. See Manage Databricks previews.

The ai_prep_search() function transforms the structured output of ai_parse_document into a format optimized for RAG vector search and information retrieval systems. For each input document, the function splits content into semantic chunks, enriches each chunk with document-level context such as the document title, section headers, page references, and produces an embedding-ready representation.

Requirements

  • Databricks Runtime 18.2 or above.
  • If you are using Serverless compute, the following is also required:
    • The serverless environment version must be set to 3 or above, as this enables features like VARIANT.
    • Must use either Python or SQL. For additional serverless features and limitations, see Serverless compute limitations.
  • The ai_prep_search function is available using Databricks notebooks, SQL editor, Databricks workflows, jobs, or Lakeflow Spark Declarative Pipelines.

Syntax

ai_prep_search(parsed [, options])

Arguments

  • parsed: A VARIANT expression representing the structured output of ai_parse_document.
  • options: An optional MAP<STRING, STRING>. Supported keys:
    • 'version': The version of the output schema to use.

Returns

A VARIANT containing document chunks formatted for vector search indexing. Each row in the output represents one input document.

The output schema is:

JSON
{
"document": {
"contents": [
{
"chunk_id": STRING, // Unique identifier composed of the document ID and chunk position
"chunk_position": INT, // 0-based position of the chunk within the document
"chunk_to_retrieve": STRING, // Raw text content of the chunk
"chunk_to_embed": STRING, // Context-enriched text prepared for embedding; see chunk_to_embed format
"pages": [
{
"page_id": INT, // Page index that this chunk appears on
"image_uri": STRING // Path to the page image for multi-modal retrieval
}
]
}
],
"pages": [
{
"id": INT, // 0-based page index
"image_uri": STRING // Path to the rendered page image, populated when
// imageOutputPath is set in ai_parse_document
}
],
"source_uri": STRING // Source document URI
},
"error_status": {...}
}
重要

The function output schema is versioned using a major.minor format. Databricks might upgrade the supported or default version to reflect improved representations based on ongoing research.

  • Minor version upgrades are backward-compatible and might only introduce new fields.
  • Major version upgrades might include breaking changes such as field additions, removals, or renamings.

chunk_to_embed format

The chunk_to_embed field is a single string built per chunk by combining the raw chunk text with document-level context to improve retrieval quality during semantic search.

The string is composed of the following parts:

  • Document metadata: Document Title, Page Header, Page Footer, Section Header, Caption, Footnote, Page Number. Extracted directly from the parsed document structure.
  • LLM-discovered document fields: additional Key: Value lines for document-level fields auto-discovered by an LLM, such as "Company", "Document Type", "Fiscal Year", "Patient ID", or "Contract Number". The field names are chosen per document by the model and vary across documents.
  • Document context sentence: a single sentence summarizing what the document is about, generated by an LLM.
  • Content: the raw chunk text. Same value as the chunk_to_retrieve field for the chunk.
  • Table summary: a short LLM-generated paraphrase of the table contents. For chunks that contain a table, the function appends this table summary and a set of related natural-language questions that the table can answer.
  • Related questions: natural-language questions the table is capable of answering, used to improve retrieval recall for table content.

The string follows this template:

Document Title: {doc_title}
Page Header: {page_header}
Page Footer: {page_footer}
Section Header: {section_header}
Caption: {caption}
Footnote: {footnote}
Page Number: {page_number}
{additional_llm_discovered_fields}

{document_context_sentence}

Table summary: {table_summary}

Content:
{chunk_to_retrieve}

Related questions:
{qa_text}

Example rendered chunk_to_embed

Document Title: Acme Corp 2024 Annual Report
Page Header:
Page Footer:
Section Header: Risk Factors
Caption:
Footnote:
Page Number: 14
Company: Acme Corp
Document Type: 10-K
Fiscal Year: 2024

Acme Corp's 2024 annual report covering financial performance and risk disclosures across global operating segments.

Content:
Our business faces a number of risks, including competition from established providers, evolving regulatory requirements, and concentration in a small number of large customers.
注記

The fixed metadata fields are always rendered with their label, with the value left empty when not available. The LLM-discovered fields, document context sentence, table summary, and related questions are omitted entirely when not available or not applicable. The exact composition might be updated in future versions to improve retrieval quality.

Examples

Chain with ai_parse_document

The following example chains ai_prep_search with ai_parse_document to produce search-ready chunks from raw documents stored in a Unity Catalog volume:

SQL
WITH parsed_documents AS (
SELECT ai_parse_document(content) AS parsed
FROM READ_FILES('/Volumes/mydata/documents/', format => 'binaryFile')
)
SELECT ai_prep_search(parsed) AS result
FROM parsed_documents;

Build a vector search source table

The following example flattens the output into individual chunk rows and writes them to a Delta table. The table can then be used as a source for a Databricks AI Search index, using chunk_to_embed as the embedding column and chunk_id as the primary key.

SQL
WITH parsed_documents AS (
SELECT
path,
ai_parse_document(content) AS parsed
FROM READ_FILES('/Volumes/mydata/documents/', format => 'binaryFile')
),
prepped_documents AS (
SELECT
path,
ai_prep_search(parsed) AS result
FROM parsed_documents
)
SELECT
chunk.value:chunk_id::STRING AS chunk_id,
chunk.value:chunk_position::INT AS chunk_position,
chunk.value:chunk_to_retrieve::STRING AS chunk_to_retrieve,
chunk.value:chunk_to_embed::STRING AS chunk_to_embed,
prepped_documents.path AS source_uri
FROM
prepped_documents,
LATERAL variant_explode(prepped_documents.result:document.contents) AS chunk;

The resulting rows have the following schema:

Column name

Type

chunk_id

STRING

chunk_position

INT

chunk_to_retrieve

STRING

chunk_to_embed

STRING

source_uri

STRING

Enable multi-modal retrieval

When ai_parse_document is called with the imageOutputPath option, rendered page images are saved to a Unity Catalog volume and the image_uri field in each chunk's pages array is populated. These image references can be passed to a vision-capable model at query time to answer questions that require visual context, such as block diagrams, charts, or tables that are not fully represented in text.

SQL
WITH parsed_documents AS (
SELECT ai_parse_document(
content,
map(
'imageOutputPath', '/Volumes/catalog/schema/volume/page_images/',
'descriptionElementTypes', '*'
)
) AS parsed
FROM READ_FILES('/Volumes/mydata/documents/', format => 'binaryFile')
),
prepped_documents AS (
SELECT ai_prep_search(parsed) AS result
FROM parsed_documents
)
SELECT
chunk.value:chunk_id::STRING AS chunk_id,
chunk.value:chunk_to_embed::STRING AS chunk_to_embed,
chunk.value:pages AS pages
FROM
prepped_documents,
LATERAL variant_explode(prepped_documents.result:document.contents) AS chunk;

Limitations

  • The ai_prep_search function requires valid ai_parse_document output as input. Passing other VARIANT data or an unsupported schema version might produce unexpected results or errors.
  • The maximum input size is consistent with the maximum output size of ai_parse_document.