ai_extract function
Applies to: Databricks SQL
Databricks Runtime
This functionality is in Public Preview and HIPAA compliant.
During the preview:
- The underlying language model can handle several languages, but this AI Function is tuned for English.
- See Features with limited regional availability for AI Functions region availability.
The ai_extract() function extracts structured data from text and documents according to a schema you provide. You can use simple field names for basic extraction, or define complex schemas with nested objects, arrays, type validation, and field descriptions for business documents like invoices, contracts, and financial filings.
The function accepts text or VARIANT output from other AI functions like ai_parse_document, enabling composable workflows for end-to-end document processing.
For a visual UI to validate and iterate on the results of ai_extract, see Information Extraction.
Requirements
Apache 2.0 license
The underlying models that might be used at this time are licensed under the Apache 2.0 License, Copyright © The Apache Software Foundation. Customers are responsible for ensuring compliance with applicable model licenses.
Databricks recommends reviewing these licenses to ensure compliance with any applicable terms. If models emerge in the future that perform better according to Databricks's internal benchmarks, Databricks might change the model (and the list of applicable licenses provided on this page).
The model powering this function is made available using Model Serving Foundation Model APIs. See Applicable model terms for information about which models are available on Databricks and the licenses and policies that govern the use of those models.
If models emerge in the future that perform better according to Databricks's internal benchmarks, Databricks may change the models and update the documentation.
- This function is only available in some regions, see AI function availability.
- This function is not available on Databricks SQL Classic.
- Check the Databricks SQL pricing page.
- In Databricks Runtime 15.1 and above, this function is supported in Databricks notebooks, including notebooks that are run as a task in a Databricks workflow.
- Batch inference workloads require Databricks Runtime 15.4 ML LTS for improved performance.
Syntax
Databricks recommends using version 2 of this function because it supports nested field extraction and descriptions.
- Version 2.1 (recommended)
- Version 2
- Version 1
ai_extract(
content VARIANT | STRING,
schema STRING,
[options MAP<STRING, STRING>]
) RETURNS VARIANT
ai_extract(
content VARIANT | STRING,
schema STRING,
[options MAP<STRING, STRING>]
) RETURNS VARIANT
ai_extract(
content STRING,
labels ARRAY<STRING>,
[options MAP<STRING, STRING>]
) RETURNS STRUCT
Arguments
- Version 2.1 (recommended)
- Version 2
- Version 1
-
content: AVARIANTorSTRINGexpression. Accepts either:- Raw text as a
STRING - A
VARIANTproduced by another AI function (such asai_parse_document)
- Raw text as a
-
schema: ASTRINGliteral defining the JSON schema for extraction. The schema can be:- Simple schema: A JSON array of field names (assumed to be strings)
JSON
"[\"vendor_name\", \"invoice_id\", \"total_amount\"]" - Advanced schema: A JSON object with type information, descriptions, and nested structures
- Supports
string,integer,number,boolean, andenumtypes. Performs type validation. Values that are not valid result in an error. Maximum of 500 enum values. - Supports nested objects using
"type": "object"with"properties" - Supports arrays of primitives or objects using
"type": "array"with"items" - Optional
"description"field for each property to guide extraction quality
- Supports
- Simple schema: A JSON array of field names (assumed to be strings)
-
options: An optionalMAP<STRING, STRING>containing configuration options:version: Version switch to support migration ("2.1","2.0","1.0"). Default is based on input types.instructions: Global description of the task and domain to improve extraction quality. Must be less than 20,000 characters.enableCitations: Whentrue, the output for each field in the extraction schema includes a list of zero or more citations, which indicates in the document the output extracted.enableConfidenceScores: Whentrue, the output for each field in the extraction schema includes a confidence score between 0 and 1, indicating how certain the model is about that value. The appropriate confidence threshold depends on your specific use case, and you should choose a cutoff that aligns with your tolerance for risk and error.
-
content: AVARIANTorSTRINGexpression. Accepts either:- Raw text as a
STRING - A
VARIANTproduced by another AI function (such asai_parse_document)
- Raw text as a
-
schema: ASTRINGliteral defining the JSON schema for extraction. The schema can be:- Simple schema: A JSON array of field names (assumed to be strings)
JSON
"[\"vendor_name\", \"invoice_id\", \"total_amount\"]" - Advanced schema: A JSON object with type information, descriptions, and nested structures
- Supports
string,integer,number,boolean, andenumtypes. Performs type validation. Values that are not valid result in an error. Maximum of 500 enum values. - Supports nested objects using
"type": "object"with"properties" - Supports arrays of primitives or objects using
"type": "array"with"items" - Optional
"description"field for each property to guide extraction quality
- Supports
- Simple schema: A JSON array of field names (assumed to be strings)
-
options: An optionalMAP<STRING, STRING>containing configuration options:version: Version switch to support migration ("1.0"for v1 behavior,"2.0"for v2 behavior). Default is based on input types, but falls back to"1.0".instructions: Global description of the task and domain to improve extraction quality. Must be less than 20,000 characters.
-
content: ASTRINGexpression containing the raw text. -
labels: AnARRAY<STRING>literal. Each element is a type of entity to be extracted. -
options: An optionalMAP<STRING, STRING>containing configuration options:version: Version switch to support migration ("1.0"for v1 behavior,"2.0"for v2 behavior). Default is based on input types, but will fall back to"1.0".
Returns
- Version 2.1 (recommended)
- Version 2
- Version 1
Returns a VARIANT containing:
{
"response": {...}, // Extracted data matching the provided schema. Each leaf is returned as a Field object (see below).
"error_message": null, // null on success, or error message on failure
"metadata": { ... } // Metadata about the response, including unfurled citation ids.
}
The response field contains the structured data extracted according to the schema:
- Field names and types match the schema definition
- The schema's structure is preserved in the response: nested objects and arrays keep their original shape. Each “scalar” field in the extraction schema has an output object with the following fields:
value: The extracted value, typed according to the schema.Nullif the field can not be extracted.citation_ids: Only present whenenableCitationsistrue. An array of IDs indexed intometadata.citations.confidence_score: Only present whenenableConfidenceScoresistrue. A float between 0 and 1.
- Type validation is enforced for integer, number, boolean, and enum types
- If
contentisNULL, the result isNULL.
The metadata field contains the metadata about the response. When enableCitations is true, the metadata field contains details about each citation ID in the response field that trace the extracted value back to its location in the input.
Depending on the type of input, citations can be of one of two types:
- For raw text (STRING) inputs, a citation is a span of text in the original input. Each object in metadata.citations contains:
id: Integer matching a citation_ids entry on a field.start: Inclusive 0-based character offset into the input string.stop: Exclusive 0-based character offset into the input string.
- For PDF documents and images (when using
ai_extractdownstream ofai_parse_document), a citation is a bounding box in the original input. Each object inmetadata.citationscontains:id: Integer matching acitation_idsentry on a field.bbox: Array of{coord, page_id}objects, identical in shape to element.bbox inai_parse_documentoutput.coordis pixel coordinates on the page image as[x0, y0, x1, y1]; page_idis a 0-based page index.
Returns a VARIANT containing:
{
"response": { ... }, // Extracted data matching the provided schema
"error_message": null // null on success, or error message on failure
}
The response field contains the structured data extracted according to the schema:
- Field names and types match the schema definition
- Nested objects and arrays are preserved in the structure
- Fields may be
nullif not found - Type validation is enforced for
integer,number,boolean, andenumtypes
If content is NULL, the result is NULL.
Returns a STRUCT where each field corresponds to an entity type specified in labels. Each field contains a string representing the extracted entity. If the function finds more than one candidate for any entity type, it returns only one.
Examples
- Version 2.1 (recommended)
- Version 2
- Version 1
Simple schema - field names only
> SELECT ai_extract(
'Invoice #12345 from Acme Corp for $1,250.00 dated 2024-01-15',
'["invoice_id", "vendor_name", "total_amount", "invoice_date"]',
options => map('version', '2.1')
);
{
"response": {
"invoice_id": {"value": "12345"},
"vendor_name": {"value": "Acme Corp"},
"total_amount": {"value": "1250.00"},
"invoice_date": {"value": "2024-01-15"}
},
"error_message": null
}
Advanced schema - with types and descriptions
> SELECT ai_extract(
'Invoice #12345 from Acme Corp for $1,250.00 dated 2024-01-15',
'{
"invoice_id": {"type": "string", "description": "Unique invoice identifier"},
"vendor_name": {"type": "string", "description": "Legal business name"},
"total_amount": {"type": "number", "description": "Total invoice amount"},
"invoice_date": {"type": "string", "description": "Date in YYYY-MM-DD format"}
}',
options => map('version', '2.1')
);
{
"response": {
"invoice_id": {"value": "12345"},
"vendor_name": {"value": "Acme Corp"},
"total_amount": {"value": 1250.00},
"invoice_date": {"value": "2024-01-15"}
},
"error_message": null
}
Nested objects and arrays
> SELECT ai_extract(
'Invoice #12345 from Acme Corp
Line 1: Widget A, qty 10, $50.00 each
Line 2: Widget B, qty 5, $100.00 each
Subtotal: $1,000.00, Tax: $80.00, Total: $1,080.00',
'{
"invoice_header": {
"type": "object",
"properties": {
"invoice_id": {"type": "string"},
"vendor_name": {"type": "string"}
}
},
"line_items": {
"type": "array",
"description": "List of invoiced products",
"items": {
"type": "object",
"properties": {
"description": {"type": "string"},
"quantity": {"type": "integer"},
"unit_price": {"type": "number"}
}
}
},
"totals": {
"type": "object",
"properties": {
"subtotal": {"type": "number"},
"tax_amount": {"type": "number"},
"total_amount": {"type": "number"}
}
}
}',
options => map('version', '2.1')
);
{
"response": {
"invoice_header": {
"invoice_id": {"value": "12345"},
"vendor_name": {"value": "Acme Corp"}
},
"line_items": [
{"description": {"value": "Widget A"}, "quantity": {"value": 10}, "unit_price": {"value": 50.00}},
{"description": {"value": "Widget B"}, "quantity": {"value": 5}, "unit_price": {"value": 100.00}}
],
"totals": {
"subtotal": {"value": 1000.00},
"tax_amount": {"value": 80.00},
"total_amount": {"value": 1080.00}
}
},
"error_message": null
}
Composability with ai_parse_document
> WITH parsed_docs AS (
SELECT
path,
ai_parse_document(
content,
MAP('version', '2.0')
) AS parsed_content
FROM READ_FILES('/Volumes/finance/invoices/', format => 'binaryFile')
)
SELECT
path,
ai_extract(
parsed_content,
'["invoice_id", "vendor_name", "total_amount"]',
MAP('version', '2.1', 'instructions', 'These are vendor invoices.')
) AS invoice_data
FROM parsed_docs;
Using enums
> SELECT ai_extract(
'Invoice #12345 from Acme Corp, amount: $1,250.00 USD',
'{
"invoice_id": {"type": "string"},
"vendor_name": {"type": "string"},
"total_amount": {"type": "number"},
"currency": {
"type": "enum",
"labels": ["USD", "EUR", "GBP", "CAD", "AUD"],
"description": "Currency code"
},
"payment_terms": {"type": "string"}
}',
options => map('version', '2.1')
);
{
"response": {
"invoice_id": {"value": "12345"},
"vendor_name": {"value": "Acme Corp"},
"total_amount": {"value": 1250.00},
"currency": {"value": "USD"},
"payment_terms": {"value": null}
},
"error_message": null
}
Citations (STRING input, SPAN citations)
> SELECT ai_extract(
'Invoice #12345 from Acme Corp for $1,250.00 dated 2024-01-15',
'{
"invoice_id": {"type": "string", "description": "Unique invoice identifier"},
"vendor_name": {"type": "string", "description": "Legal business name"},
"total_amount": {"type": "number", "description": "Total invoice amount"},
"invoice_date": {"type": "string", "description": "Date in YYYY-MM-DD format"}
}',
options => map(
'version', '2.1',
'enableCitations', 'true'
)
);
{
"response": {
"invoice_id": {"citation_ids": [0], "value": "12345"},
"vendor_name": {"citation_ids": [0], "value": "Acme Corp"},
"total_amount": {"citation_ids": [1], "value": 1250.00},
"invoice_date": {"citation_ids": [1], "value": "2024-01-15"}
},
"metadata": {
"chunk_type": "span",
"citations": [
{"id": 0, "start": 0, "stop": 29},
{"id": 1, "start": 29, "stop": 60}
]
},
"error_message": null
}
Citations (VARIANT from ai_parse_document, BBOX citations)
> WITH parsed AS (
SELECT ai_parse_document(
content,
map('imageOutputPath', '/Volumes/main/default/parsed_images/') // necessary for rendering bboxes
) AS doc
FROM READ_FILES('/Volumes/main/default/invoices/invoice.pdf', format => 'binaryFile')
)
SELECT ai_extract(
doc,
'{"invoice_id":{"type":"string"}, "total_amount":{"type":"number"}}',
options => map('version','2.1','enableCitations','true')
) AS extracted
FROM parsed;
{
"response": {
"invoice_id": {"citation_ids": [0], "value": "12345"},
"total_amount": {"citation_ids": [1], "value": 1250.00}
},
"metadata": {
"chunk_type": "bbox",
"citations": [
{"id": 0, "bbox": [{"coord": [120, 80, 240, 110], "page_id": 0}]},
{"id": 1, "bbox": [{"coord": [400, 500, 560, 530], "page_id": 0}]}
],
"pages": [{"id": 0, "image_uri": "/Volumes/main/default/parsed_images/6077ca79...f8efdb2ed05.jpg"}]
},
"error_message": null
}
Confidence scores
> SELECT ai_extract(
'Invoice #12345 from Acme Corp for $1,250.00 dated 2024-01-15',
'{
"invoice_id": {"type": "string", "description": "Unique invoice identifier"},
"vendor_name": {"type": "string", "description": "Legal business name"},
"total_amount": {"type": "number", "description": "Total invoice amount"},
"invoice_date": {"type": "string", "description": "Date in YYYY-MM-DD format"}
}',
options => map(
'version', '2.1',
'enableConfidenceScores', 'true'
)
);
{
"response": {
"invoice_id": {"confidence_score": 0.95, "value": "12345"},
"vendor_name": {"confidence_score": 0.62, "value": "Acme Corp"},
"total_amount": {"confidence_score": 1.0, "value": 1250.00},
"invoice_date": {"confidence_score": 0.99, "value": "2024-01-15"}
},
"error_message": null
}
Notebook example
The following notebook provides a visual debugging interface for analyzing the citation outputs of the ai_extract function. It demonstrates how to render citation metadata as substring snippets (STRING input) or bounding-box overlays (VARIANT input), and join ai_extract citations back to ai_parse_document elements in SQL so you can flag low-confidence extractions for manual review.
Citation rendering notebook
Simple schema - field names only
> SELECT ai_extract(
'Invoice #12345 from Acme Corp for $1,250.00 dated 2024-01-15',
'["invoice_id", "vendor_name", "total_amount", "invoice_date"]'
);
{
"response": {
"invoice_id": "12345",
"vendor_name": "Acme Corp",
"total_amount": "1250.00",
"invoice_date": "2024-01-15"
},
"error_message": null
}
Advanced schema - with types and descriptions
> SELECT ai_extract(
'Invoice #12345 from Acme Corp for $1,250.00 dated 2024-01-15',
'{
"invoice_id": {"type": "string", "description": "Unique invoice identifier"},
"vendor_name": {"type": "string", "description": "Legal business name"},
"total_amount": {"type": "number", "description": "Total invoice amount"},
"invoice_date": {"type": "string", "description": "Date in YYYY-MM-DD format"}
}'
);
{
"response": {
"invoice_id": "12345",
"vendor_name": "Acme Corp",
"total_amount": 1250.00,
"invoice_date": "2024-01-15"
},
"error_message": null
}
Nested objects and arrays
> SELECT ai_extract(
'Invoice #12345 from Acme Corp
Line 1: Widget A, qty 10, $50.00 each
Line 2: Widget B, qty 5, $100.00 each
Subtotal: $1,000.00, Tax: $80.00, Total: $1,080.00',
'{
"invoice_header": {
"type": "object",
"properties": {
"invoice_id": {"type": "string"},
"vendor_name": {"type": "string"}
}
},
"line_items": {
"type": "array",
"description": "List of invoiced products",
"items": {
"type": "object",
"properties": {
"description": {"type": "string"},
"quantity": {"type": "integer"},
"unit_price": {"type": "number"}
}
}
},
"totals": {
"type": "object",
"properties": {
"subtotal": {"type": "number"},
"tax_amount": {"type": "number"},
"total_amount": {"type": "number"}
}
}
}'
);
{
"response": {
"invoice_header": {
"invoice_id": "12345",
"vendor_name": "Acme Corp"
},
"line_items": [
{"description": "Widget A", "quantity": 10, "unit_price": 50.00},
{"description": "Widget B", "quantity": 5, "unit_price": 100.00}
],
"totals": {
"subtotal": 1000.00,
"tax_amount": 80.00,
"total_amount": 1080.00
}
},
"error": null
}
Composability with ai_parse_document
> WITH parsed_docs AS (
SELECT
path,
ai_parse_document(
content,
MAP('version', '2.0')
) AS parsed_content
FROM READ_FILES('/Volumes/finance/invoices/', format => 'binaryFile')
)
SELECT
path,
ai_extract(
parsed_content,
'["invoice_id", "vendor_name", "total_amount"]',
MAP('instructions', 'These are vendor invoices.')
) AS invoice_data
FROM parsed_docs;
Using enums
> SELECT ai_extract(
'Invoice #12345 from Acme Corp, amount: $1,250.00 USD',
'{
"invoice_id": {"type": "string"},
"vendor_name": {"type": "string"},
"total_amount": {"type": "number"},
"currency": {
"type": "enum",
"labels": ["USD", "EUR", "GBP", "CAD", "AUD"],
"description": "Currency code"
},
"payment_terms": {"type": "string"}
}'
);
{
"response": {
"invoice_id": "12345",
"vendor_name": "Acme Corp",
"total_amount": 1250.00,
"currency": "USD",
"payment_terms": null
},
"error": null
}
> SELECT ai_extract(
'John Doe lives in New York and works for Acme Corp.',
array('person', 'location', 'organization')
);
{"person": "John Doe", "location": "New York", "organization": "Acme Corp."}
> SELECT ai_extract(
'Send an email to jane.doe@example.com about the meeting at 10am.',
array('email', 'time')
);
{"email": "jane.doe@example.com", "time": "10am"}
Limitations
- Version 2.1 (recommended)
- Version 2
- Version 1
- This function is not available on Databricks SQL Classic.
- This function cannot be used with views.
- The schema supports a maximum of 128 fields.
- Field names can contain up to 150 characters.
- Schemas support up to seven levels of nesting for nested fields.
- Enum fields support a maximum of 500 values.
- Type validation is enforced for
integer,number,boolean, andenumtypes. If a value does not match the specified type, the function returns an error. - The maximum total context size is 128,000 tokens.
- This function is not available on Databricks SQL Classic.
- This function cannot be used with views.
- The schema supports a maximum of 128 fields.
- Field names can contain up to 150 characters.
- Schemas support up to seven levels of nesting for nested fields.
- Enum fields support a maximum of 500 values.
- Type validation is enforced for
integer,number,boolean, andenumtypes. If a value does not match the specified type, the function returns an error. - The maximum total context size is 128,000 tokens.
- This function is not available on Databricks SQL Classic.
- This function cannot be used with Views.
- If more than one candidate for an entity type is found in the content, only one value is returned.