ai_prep_search function
Applies to: Databricks SQL
Databricks Runtime
This feature is in Beta. Workspace admins can control access to this feature from the Previews page. See Manage Databricks previews.
The ai_prep_search() function transforms the structured output of ai_parse_document into a format optimized for RAG vector search and information retrieval systems. For each input document, the function splits content into semantic chunks, enriches each chunk with document-level context such as the document title, section headers, page references, and produces an embedding-ready representation.
Requirements
- Databricks Runtime 18.2 or above.
- If you are using Serverless compute, the following is also required:
- The serverless environment version must be set to 3 or above, as this enables features like
VARIANT. - Must use either Python or SQL. For additional serverless features and limitations, see Serverless compute limitations.
- The serverless environment version must be set to 3 or above, as this enables features like
- The
ai_prep_searchfunction is available using Databricks notebooks, SQL editor, Databricks workflows, jobs, or Lakeflow Spark Declarative Pipelines.
Syntax
ai_prep_search(parsed [, options])
Arguments
parsed: AVARIANTexpression representing the structured output ofai_parse_document.options: An optionalMAP<STRING, STRING>. Supported keys:'version': The version of the output schema to use.
Returns
A VARIANT containing document chunks formatted for vector search indexing. Each row in the output represents one input document.
The output schema is:
{
"document": {
"contents": [
{
"chunk_id": STRING, // Unique identifier composed of the document ID and chunk position
"chunk_position": INT, // 0-based position of the chunk within the document
"chunk_to_retrieve": STRING, // Raw text content of the chunk
"chunk_to_embed": STRING, // Context-enriched text prepared for embedding; see chunk_to_embed format
"pages": [
{
"page_id": INT, // Page index that this chunk appears on
"image_uri": STRING // Path to the page image for multi-modal retrieval
}
]
}
],
"pages": [
{
"id": INT, // 0-based page index
"image_uri": STRING // Path to the rendered page image, populated when
// imageOutputPath is set in ai_parse_document
}
],
"source_uri": STRING // Source document URI
},
"error_status": {...}
}
The function output schema is versioned using a major.minor format. Databricks might upgrade the supported or default version to reflect improved representations based on ongoing research.
- Minor version upgrades are backward-compatible and might only introduce new fields.
- Major version upgrades might include breaking changes such as field additions, removals, or renamings.
chunk_to_embed format
The chunk_to_embed field is a single string built per chunk by combining the raw chunk text with document-level context to improve retrieval quality during semantic search.
The string is composed of the following parts:
- Document metadata:
Document Title,Page Header,Page Footer,Section Header,Caption,Footnote,Page Number. Extracted directly from the parsed document structure. - LLM-discovered document fields: additional Key: Value lines for document-level fields auto-discovered by an LLM, such as "Company", "Document Type", "Fiscal Year", "Patient ID", or "Contract Number". The field names are chosen per document by the model and vary across documents.
- Document context sentence: a single sentence summarizing what the document is about, generated by an LLM.
- Content: the raw chunk text. Same value as the
chunk_to_retrievefield for the chunk. - Table summary: a short LLM-generated paraphrase of the table contents. For chunks that contain a table, the function appends this table summary and a set of related natural-language questions that the table can answer.
- Related questions: natural-language questions the table is capable of answering, used to improve retrieval recall for table content.
The string follows this template:
Document Title: {doc_title}
Page Header: {page_header}
Page Footer: {page_footer}
Section Header: {section_header}
Caption: {caption}
Footnote: {footnote}
Page Number: {page_number}
{additional_llm_discovered_fields}
{document_context_sentence}
Table summary: {table_summary}
Content:
{chunk_to_retrieve}
Related questions:
{qa_text}
Example rendered chunk_to_embed
chunk_to_embedDocument Title: Acme Corp 2024 Annual Report
Page Header:
Page Footer:
Section Header: Risk Factors
Caption:
Footnote:
Page Number: 14
Company: Acme Corp
Document Type: 10-K
Fiscal Year: 2024
Acme Corp's 2024 annual report covering financial performance and risk disclosures across global operating segments.
Content:
Our business faces a number of risks, including competition from established providers, evolving regulatory requirements, and concentration in a small number of large customers.
The fixed metadata fields are always rendered with their label, with the value left empty when not available. The LLM-discovered fields, document context sentence, table summary, and related questions are omitted entirely when not available or not applicable. The exact composition might be updated in future versions to improve retrieval quality.
Examples
Chain with ai_parse_document
The following example chains ai_prep_search with ai_parse_document to produce search-ready chunks from raw documents stored in a Unity Catalog volume:
WITH parsed_documents AS (
SELECT ai_parse_document(content) AS parsed
FROM READ_FILES('/Volumes/mydata/documents/', format => 'binaryFile')
)
SELECT ai_prep_search(parsed) AS result
FROM parsed_documents;
Build a vector search source table
The following example flattens the output into individual chunk rows and writes them to a Delta table. The table can then be used as a source for a Databricks AI Search index, using chunk_to_embed as the embedding column and chunk_id as the primary key.
WITH parsed_documents AS (
SELECT
path,
ai_parse_document(content) AS parsed
FROM READ_FILES('/Volumes/mydata/documents/', format => 'binaryFile')
),
prepped_documents AS (
SELECT
path,
ai_prep_search(parsed) AS result
FROM parsed_documents
)
SELECT
chunk.value:chunk_id::STRING AS chunk_id,
chunk.value:chunk_position::INT AS chunk_position,
chunk.value:chunk_to_retrieve::STRING AS chunk_to_retrieve,
chunk.value:chunk_to_embed::STRING AS chunk_to_embed,
prepped_documents.path AS source_uri
FROM
prepped_documents,
LATERAL variant_explode(prepped_documents.result:document.contents) AS chunk;
The resulting rows have the following schema:
Column name | Type |
|---|---|
|
|
|
|
|
|
|
|
|
|
Enable multi-modal retrieval
When ai_parse_document is called with the imageOutputPath option, rendered page images are saved to a Unity Catalog volume and the image_uri field in each chunk's pages array is populated. These image references can be passed to a vision-capable model at query time to answer questions that require visual context, such as block diagrams, charts, or tables that are not fully represented in text.
WITH parsed_documents AS (
SELECT ai_parse_document(
content,
map(
'imageOutputPath', '/Volumes/catalog/schema/volume/page_images/',
'descriptionElementTypes', '*'
)
) AS parsed
FROM READ_FILES('/Volumes/mydata/documents/', format => 'binaryFile')
),
prepped_documents AS (
SELECT ai_prep_search(parsed) AS result
FROM parsed_documents
)
SELECT
chunk.value:chunk_id::STRING AS chunk_id,
chunk.value:chunk_to_embed::STRING AS chunk_to_embed,
chunk.value:pages AS pages
FROM
prepped_documents,
LATERAL variant_explode(prepped_documents.result:document.contents) AS chunk;
Limitations
- The
ai_prep_searchfunction requires validai_parse_documentoutput as input. Passing otherVARIANTdata or an unsupported schema version might produce unexpected results or errors. - The maximum input size is consistent with the maximum output size of
ai_parse_document.