Skip to main content

ai_prep_search function

Applies to: check marked yes Databricks SQL check marked yes Databricks Runtime

Beta

This feature is in Beta. Workspace admins can control access to this feature from the Previews page. See Manage Databricks previews.

The ai_prep_search() function transforms the structured output of ai_parse_document into a format optimized for RAG vector search and information retrieval systems. For each input document, the function splits content into semantic chunks, enriches each chunk with document-level context such as the document title, section headers, page references, and produces an embedding-ready representation.

Requirements

  • Databricks Runtime 18.2 or above.
  • If you are using Serverless compute, the following is also required:
    • The serverless environment version must be set to 3 or above, as this enables features like VARIANT.
    • Must use either Python or SQL. For additional serverless features and limitations, see Serverless compute limitations.
  • The ai_prep_search function is available using Databricks notebooks, SQL editor, Databricks workflows, jobs, or Lakeflow Spark Declarative Pipelines.

Syntax

SQL
ai_prep_search(
parsed VARIANT,
[options MAP<STRING, STRING>]
) RETURNS VARIANT

Arguments

  • parsed: A VARIANT expression representing the structured output of ai_parse_document.
  • options: An optional MAP<STRING, STRING>. Supported keys:
    • 'version': The version of the output schema to use.

Returns

A VARIANT containing document chunks formatted for vector search indexing. Each row in the output represents one input document.

The output schema is:

JSON
{
"document": {
"contents": [
{
"chunk_id": STRING, // Unique identifier composed of the document ID and chunk position
"chunk_position": INT, // 0-based position of the chunk within the document
"chunk_content": STRING, // Raw text content of the chunk
"chunk_to_embed": STRING, // Context-enriched text prepared for embedding; see chunk_to_embed format
"pages": [
{
"page_id": INT, // Page index that this chunk appears on
"image_uri": STRING // Path to the page image for multi-modal retrieval
}
]
}
],
"pages": [
{
"id": INT, // 0-based page index
"image_uri": STRING // Path to the rendered page image, populated when
// imageOutputPath is set in ai_parse_document
}
],
"source_uri": STRING // Source document URI
},
"error_status": {...}
}
important

The function output schema is versioned using a major.minor format. Databricks might upgrade the supported or default version to reflect improved representations based on ongoing research.

  • Minor version upgrades are backward-compatible and might only introduce new fields.
  • Major version upgrades might include breaking changes such as field additions, removals, or renamings.

chunk_to_embed format

The chunk_to_embed field combines document-level context with the chunk content to improve retrieval quality during semantic search. The format is:

note

Fields without a value for a given chunk are included but left empty. The exact composition might be updated in future versions to improve retrieval quality.

The following passage represents a chunk of content from a document.
- 'Content' contains raw document text
- All other fields describe document context and hierarchical information
- For visual elements like images/charts, a summary is generated as part of 'Content'

Document Title: {doc_title}
Page Header: {page_header}
Page Footer: {page_footer}
Section Header: {section_header}
Caption: {caption}
Footnote: {footnote}
Page Number: {page_number}

Content:
{chunk_content}

Examples

Chain with ai_parse_document

The following example chains ai_prep_search with ai_parse_document to produce search-ready chunks from raw documents stored in a Unity Catalog volume:

SQL
WITH parsed_documents AS (
SELECT ai_parse_document(content) AS parsed
FROM READ_FILES('/Volumes/mydata/documents/', format => 'binaryFile')
)
SELECT ai_prep_search(parsed) AS result
FROM parsed_documents;

Build a vector search source table

The following example flattens the output into individual chunk rows and writes them to a Delta table. The table can then be used as a source for a Databricks Vector Search index, using chunk_to_embed as the embedding column and chunk_id as the primary key.

SQL
WITH parsed_documents AS (
SELECT ai_parse_document(content) AS parsed
FROM READ_FILES('/Volumes/mydata/documents/', format => 'binaryFile')
),
prepped_documents AS (
SELECT ai_prep_search(parsed) AS result
FROM parsed_documents
)
SELECT
chunk.value:chunk_id::STRING AS chunk_id,
chunk.value:chunk_position::INT AS chunk_position,
chunk.value:chunk_content::STRING AS chunk_content,
chunk.value:chunk_to_embed::STRING AS chunk_to_embed,
prepped_documents.result:document.source_uri::STRING AS source_uri
FROM
prepped_documents,
LATERAL variant_explode(prepped_documents.result:document.contents) AS chunk;

The resulting rows have the following schema:

Column name

Type

chunk_id

STRING

chunk_position

INT

chunk_content

STRING

chunk_to_embed

STRING

source_uri

STRING

Enable multi-modal retrieval

When ai_parse_document is called with the imageOutputPath option, rendered page images are saved to a Unity Catalog volume and the image_uri field in each chunk's pages array is populated. These image references can be passed to a vision-capable model at query time to answer questions that require visual context, such as block diagrams, charts, or tables that are not fully represented in text.

SQL
WITH parsed_documents AS (
SELECT ai_parse_document(
content,
map(
'imageOutputPath', '/Volumes/catalog/schema/volume/page_images/',
'descriptionElementTypes', '*'
)
) AS parsed
FROM READ_FILES('/Volumes/mydata/documents/', format => 'binaryFile')
),
prepped_documents AS (
SELECT ai_prep_search(parsed) AS result
FROM parsed_documents
)
SELECT
chunk.value:chunk_id::STRING AS chunk_id,
chunk.value:chunk_to_embed::STRING AS chunk_to_embed,
chunk.value:pages AS pages
FROM
prepped_documents,
LATERAL variant_explode(prepped_documents.result:document.contents) AS chunk;

Limitations

  • The ai_prep_search function requires valid ai_parse_document output as input. Passing other VARIANT data or an unsupported schema version might produce unexpected results or errors.
  • The maximum input size is consistent with the maximum output size of ai_parse_document.