ai_prep_search function
Applies to: Databricks SQL
Databricks Runtime
This feature is in Beta. Workspace admins can control access to this feature from the Previews page. See Manage Databricks previews.
The ai_prep_search() function transforms the structured output of ai_parse_document into a format optimized for RAG vector search and information retrieval systems. For each input document, the function splits content into semantic chunks, enriches each chunk with document-level context such as the document title, section headers, page references, and produces an embedding-ready representation.
Requirements
- Databricks Runtime 18.2 or above.
- If you are using Serverless compute, the following is also required:
- The serverless environment version must be set to 3 or above, as this enables features like
VARIANT. - Must use either Python or SQL. For additional serverless features and limitations, see Serverless compute limitations.
- The serverless environment version must be set to 3 or above, as this enables features like
- The
ai_prep_searchfunction is available using Databricks notebooks, SQL editor, Databricks workflows, jobs, or Lakeflow Spark Declarative Pipelines.
Syntax
ai_prep_search(
parsed VARIANT,
[options MAP<STRING, STRING>]
) RETURNS VARIANT
Arguments
parsed: AVARIANTexpression representing the structured output ofai_parse_document.options: An optionalMAP<STRING, STRING>. Supported keys:'version': The version of the output schema to use.
Returns
A VARIANT containing document chunks formatted for vector search indexing. Each row in the output represents one input document.
The output schema is:
{
"document": {
"contents": [
{
"chunk_id": STRING, // Unique identifier composed of the document ID and chunk position
"chunk_position": INT, // 0-based position of the chunk within the document
"chunk_content": STRING, // Raw text content of the chunk
"chunk_to_embed": STRING, // Context-enriched text prepared for embedding; see chunk_to_embed format
"pages": [
{
"page_id": INT, // Page index that this chunk appears on
"image_uri": STRING // Path to the page image for multi-modal retrieval
}
]
}
],
"pages": [
{
"id": INT, // 0-based page index
"image_uri": STRING // Path to the rendered page image, populated when
// imageOutputPath is set in ai_parse_document
}
],
"source_uri": STRING // Source document URI
},
"error_status": {...}
}
The function output schema is versioned using a major.minor format. Databricks might upgrade the supported or default version to reflect improved representations based on ongoing research.
- Minor version upgrades are backward-compatible and might only introduce new fields.
- Major version upgrades might include breaking changes such as field additions, removals, or renamings.
chunk_to_embed format
The chunk_to_embed field combines document-level context with the chunk content to improve retrieval quality during semantic search. The format is:
Fields without a value for a given chunk are included but left empty. The exact composition might be updated in future versions to improve retrieval quality.
The following passage represents a chunk of content from a document.
- 'Content' contains raw document text
- All other fields describe document context and hierarchical information
- For visual elements like images/charts, a summary is generated as part of 'Content'
Document Title: {doc_title}
Page Header: {page_header}
Page Footer: {page_footer}
Section Header: {section_header}
Caption: {caption}
Footnote: {footnote}
Page Number: {page_number}
Content:
{chunk_content}
Examples
Chain with ai_parse_document
The following example chains ai_prep_search with ai_parse_document to produce search-ready chunks from raw documents stored in a Unity Catalog volume:
WITH parsed_documents AS (
SELECT ai_parse_document(content) AS parsed
FROM READ_FILES('/Volumes/mydata/documents/', format => 'binaryFile')
)
SELECT ai_prep_search(parsed) AS result
FROM parsed_documents;
Build a vector search source table
The following example flattens the output into individual chunk rows and writes them to a Delta table. The table can then be used as a source for a Databricks Vector Search index, using chunk_to_embed as the embedding column and chunk_id as the primary key.
WITH parsed_documents AS (
SELECT ai_parse_document(content) AS parsed
FROM READ_FILES('/Volumes/mydata/documents/', format => 'binaryFile')
),
prepped_documents AS (
SELECT ai_prep_search(parsed) AS result
FROM parsed_documents
)
SELECT
chunk.value:chunk_id::STRING AS chunk_id,
chunk.value:chunk_position::INT AS chunk_position,
chunk.value:chunk_content::STRING AS chunk_content,
chunk.value:chunk_to_embed::STRING AS chunk_to_embed,
prepped_documents.result:document.source_uri::STRING AS source_uri
FROM
prepped_documents,
LATERAL variant_explode(prepped_documents.result:document.contents) AS chunk;
The resulting rows have the following schema:
Column name | Type |
|---|---|
|
|
|
|
|
|
|
|
|
|
Enable multi-modal retrieval
When ai_parse_document is called with the imageOutputPath option, rendered page images are saved to a Unity Catalog volume and the image_uri field in each chunk's pages array is populated. These image references can be passed to a vision-capable model at query time to answer questions that require visual context, such as block diagrams, charts, or tables that are not fully represented in text.
WITH parsed_documents AS (
SELECT ai_parse_document(
content,
map(
'imageOutputPath', '/Volumes/catalog/schema/volume/page_images/',
'descriptionElementTypes', '*'
)
) AS parsed
FROM READ_FILES('/Volumes/mydata/documents/', format => 'binaryFile')
),
prepped_documents AS (
SELECT ai_prep_search(parsed) AS result
FROM parsed_documents
)
SELECT
chunk.value:chunk_id::STRING AS chunk_id,
chunk.value:chunk_to_embed::STRING AS chunk_to_embed,
chunk.value:pages AS pages
FROM
prepped_documents,
LATERAL variant_explode(prepped_documents.result:document.contents) AS chunk;
Limitations
- The
ai_prep_searchfunction requires validai_parse_documentoutput as input. Passing otherVARIANTdata or an unsupported schema version might produce unexpected results or errors. - The maximum input size is consistent with the maximum output size of
ai_parse_document.