Skip to main content

ai_extract function

Applies to: check marked yes Databricks SQL check marked yes Databricks Runtime

Preview

This functionality is in Public Preview and HIPAA compliant.

During the preview:

The ai_extract() function extracts structured data from text and documents according to a schema you provide. You can use simple field names for basic extraction, or define complex schemas with nested objects, arrays, type validation, and field descriptions for business documents like invoices, contracts, and financial filings.

The function accepts text or VARIANT output from other AI functions like ai_parse_document, enabling composable workflows for end-to-end document processing.

For a visual UI to validate and iterate on the results of ai_extract, see Information Extraction.

Requirements

Apache 2.0 license

The underlying models that might be used at this time are licensed under the Apache 2.0 License, Copyright © The Apache Software Foundation. Customers are responsible for ensuring compliance with applicable model licenses.

Databricks recommends reviewing these licenses to ensure compliance with any applicable terms. If models emerge in the future that perform better according to Databricks's internal benchmarks, Databricks might change the model (and the list of applicable licenses provided on this page).

The model powering this function is made available using Model Serving Foundation Model APIs. See Applicable model terms for information about which models are available on Databricks and the licenses and policies that govern the use of those models.

If models emerge in the future that perform better according to Databricks's internal benchmarks, Databricks may change the models and update the documentation.

  • This function is only available in some regions, see AI function availability.
  • This function is not available on Databricks SQL Classic.
  • Check the Databricks SQL pricing page.
  • In Databricks Runtime 15.1 and above, this function is supported in Databricks notebooks, including notebooks that are run as a task in a Databricks workflow.
  • Batch inference workloads require Databricks Runtime 15.4 ML LTS for improved performance.

Syntax

Databricks recommends using version 2 of this function because it supports nested field extraction and descriptions.

SQL
ai_extract(
content VARIANT | STRING,
schema STRING,
[options MAP<STRING, STRING>]
) RETURNS VARIANT

Arguments

  • content: A VARIANT or STRING expression. Accepts either:

    • Raw text as a STRING
    • A VARIANT produced by another AI function (such as ai_parse_document)
  • schema: A STRING literal defining the JSON schema for extraction. The schema can be:

    • Simple schema: A JSON array of field names (assumed to be strings)
      JSON
      "[\"vendor_name\", \"invoice_id\", \"total_amount\"]"
    • Advanced schema: A JSON object with type information, descriptions, and nested structures
      • Supports string, integer, number, boolean, and enum types. Performs type validation. Values that are not valid result in an error. Maximum of 500 enum values.
      • Supports nested objects using "type": "object" with "properties"
      • Supports arrays of primitives or objects using "type": "array" with "items"
      • Optional "description" field for each property to guide extraction quality
  • options: An optional MAP<STRING, STRING> containing configuration options:

    • version: Version switch to support migration ("2.1", "2.0", "1.0"). Default is based on input types.
    • instructions: Global description of the task and domain to improve extraction quality. Must be less than 20,000 characters.
    • enableCitations: When true, the output for each field in the extraction schema includes a list of zero or more citations, which indicates in the document the output extracted.
    • enableConfidenceScores: When true, the output for each field in the extraction schema includes a confidence score between 0 and 1, indicating how certain the model is about that value. The appropriate confidence threshold depends on your specific use case, and you should choose a cutoff that aligns with your tolerance for risk and error.

Returns

Returns a VARIANT containing:

JSON
{
"response": {...}, // Extracted data matching the provided schema. Each leaf is returned as a Field object (see below).
"error_message": null, // null on success, or error message on failure
"metadata": { ... } // Metadata about the response, including unfurled citation ids.
}

The response field contains the structured data extracted according to the schema:

  • Field names and types match the schema definition
  • The schema's structure is preserved in the response: nested objects and arrays keep their original shape. Each “scalar” field in the extraction schema has an output object with the following fields:
    • value: The extracted value, typed according to the schema. Null if the field can not be extracted.
    • citation_ids: Only present when enableCitations is true. An array of IDs indexed into metadata.citations.
    • confidence_score: Only present when enableConfidenceScores is true. A float between 0 and 1.
  • Type validation is enforced for integer, number, boolean, and enum types
  • If content is NULL, the result is NULL.

The metadata field contains the metadata about the response. When enableCitations is true, the metadata field contains details about each citation ID in the response field that trace the extracted value back to its location in the input.

Depending on the type of input, citations can be of one of two types:

  • For raw text (STRING) inputs, a citation is a span of text in the original input. Each object in metadata.citations contains:
    • id: Integer matching a citation_ids entry on a field.
    • start: Inclusive 0-based character offset into the input string.
    • stop: Exclusive 0-based character offset into the input string.
  • For PDF documents and images (when using ai_extract downstream of ai_parse_document), a citation is a bounding box in the original input. Each object in metadata.citations contains:
    • id: Integer matching a citation_ids entry on a field.
    • bbox: Array of {coord, page_id} objects, identical in shape to element.bbox in ai_parse_document output. coord is pixel coordinates on the page image as [x0, y0, x1, y1]; page_id is a 0-based page index.

Examples

Simple schema - field names only

SQL
> SELECT ai_extract(
'Invoice #12345 from Acme Corp for $1,250.00 dated 2024-01-15',
'["invoice_id", "vendor_name", "total_amount", "invoice_date"]',
options => map('version', '2.1')
);
{
"response": {
"invoice_id": {"value": "12345"},
"vendor_name": {"value": "Acme Corp"},
"total_amount": {"value": "1250.00"},
"invoice_date": {"value": "2024-01-15"}
},
"error_message": null
}

Advanced schema - with types and descriptions

SQL
> SELECT ai_extract(
'Invoice #12345 from Acme Corp for $1,250.00 dated 2024-01-15',
'{
"invoice_id": {"type": "string", "description": "Unique invoice identifier"},
"vendor_name": {"type": "string", "description": "Legal business name"},
"total_amount": {"type": "number", "description": "Total invoice amount"},
"invoice_date": {"type": "string", "description": "Date in YYYY-MM-DD format"}
}',
options => map('version', '2.1')
);
{
"response": {
"invoice_id": {"value": "12345"},
"vendor_name": {"value": "Acme Corp"},
"total_amount": {"value": 1250.00},
"invoice_date": {"value": "2024-01-15"}
},
"error_message": null
}

Nested objects and arrays

SQL
> SELECT ai_extract(
'Invoice #12345 from Acme Corp
Line 1: Widget A, qty 10, $50.00 each
Line 2: Widget B, qty 5, $100.00 each
Subtotal: $1,000.00, Tax: $80.00, Total: $1,080.00',
'{
"invoice_header": {
"type": "object",
"properties": {
"invoice_id": {"type": "string"},
"vendor_name": {"type": "string"}
}
},
"line_items": {
"type": "array",
"description": "List of invoiced products",
"items": {
"type": "object",
"properties": {
"description": {"type": "string"},
"quantity": {"type": "integer"},
"unit_price": {"type": "number"}
}
}
},
"totals": {
"type": "object",
"properties": {
"subtotal": {"type": "number"},
"tax_amount": {"type": "number"},
"total_amount": {"type": "number"}
}
}
}',
options => map('version', '2.1')
);
{
"response": {
"invoice_header": {
"invoice_id": {"value": "12345"},
"vendor_name": {"value": "Acme Corp"}
},
"line_items": [
{"description": {"value": "Widget A"}, "quantity": {"value": 10}, "unit_price": {"value": 50.00}},
{"description": {"value": "Widget B"}, "quantity": {"value": 5}, "unit_price": {"value": 100.00}}
],
"totals": {
"subtotal": {"value": 1000.00},
"tax_amount": {"value": 80.00},
"total_amount": {"value": 1080.00}
}
},
"error_message": null
}

Composability with ai_parse_document

SQL
> WITH parsed_docs AS (
SELECT
path,
ai_parse_document(
content,
MAP('version', '2.0')
) AS parsed_content
FROM READ_FILES('/Volumes/finance/invoices/', format => 'binaryFile')
)
SELECT
path,
ai_extract(
parsed_content,
'["invoice_id", "vendor_name", "total_amount"]',
MAP('version', '2.1', 'instructions', 'These are vendor invoices.')
) AS invoice_data
FROM parsed_docs;

Using enums

SQL
> SELECT ai_extract(
'Invoice #12345 from Acme Corp, amount: $1,250.00 USD',
'{
"invoice_id": {"type": "string"},
"vendor_name": {"type": "string"},
"total_amount": {"type": "number"},
"currency": {
"type": "enum",
"labels": ["USD", "EUR", "GBP", "CAD", "AUD"],
"description": "Currency code"
},
"payment_terms": {"type": "string"}
}',
options => map('version', '2.1')
);
{
"response": {
"invoice_id": {"value": "12345"},
"vendor_name": {"value": "Acme Corp"},
"total_amount": {"value": 1250.00},
"currency": {"value": "USD"},
"payment_terms": {"value": null}
},
"error_message": null
}

Citations (STRING input, SPAN citations)

SQL
> SELECT ai_extract(
'Invoice #12345 from Acme Corp for $1,250.00 dated 2024-01-15',
'{
"invoice_id": {"type": "string", "description": "Unique invoice identifier"},
"vendor_name": {"type": "string", "description": "Legal business name"},
"total_amount": {"type": "number", "description": "Total invoice amount"},
"invoice_date": {"type": "string", "description": "Date in YYYY-MM-DD format"}
}',
options => map(
'version', '2.1',
'enableCitations', 'true'
)
);
{
"response": {
"invoice_id": {"citation_ids": [0], "value": "12345"},
"vendor_name": {"citation_ids": [0], "value": "Acme Corp"},
"total_amount": {"citation_ids": [1], "value": 1250.00},
"invoice_date": {"citation_ids": [1], "value": "2024-01-15"}
},
"metadata": {
"chunk_type": "span",
"citations": [
{"id": 0, "start": 0, "stop": 29},
{"id": 1, "start": 29, "stop": 60}
]
},
"error_message": null
}

Citations (VARIANT from ai_parse_document, BBOX citations)

SQL
> WITH parsed AS (
SELECT ai_parse_document(
content,
map('imageOutputPath', '/Volumes/main/default/parsed_images/') // necessary for rendering bboxes
) AS doc
FROM READ_FILES('/Volumes/main/default/invoices/invoice.pdf', format => 'binaryFile')
)
SELECT ai_extract(
doc,
'{"invoice_id":{"type":"string"}, "total_amount":{"type":"number"}}',
options => map('version','2.1','enableCitations','true')
) AS extracted
FROM parsed;
{
"response": {
"invoice_id": {"citation_ids": [0], "value": "12345"},
"total_amount": {"citation_ids": [1], "value": 1250.00}
},
"metadata": {
"chunk_type": "bbox",
"citations": [
{"id": 0, "bbox": [{"coord": [120, 80, 240, 110], "page_id": 0}]},
{"id": 1, "bbox": [{"coord": [400, 500, 560, 530], "page_id": 0}]}
],
"pages": [{"id": 0, "image_uri": "/Volumes/main/default/parsed_images/6077ca79...f8efdb2ed05.jpg"}]
},
"error_message": null
}

Confidence scores

SQL

> SELECT ai_extract(
'Invoice #12345 from Acme Corp for $1,250.00 dated 2024-01-15',
'{
"invoice_id": {"type": "string", "description": "Unique invoice identifier"},
"vendor_name": {"type": "string", "description": "Legal business name"},
"total_amount": {"type": "number", "description": "Total invoice amount"},
"invoice_date": {"type": "string", "description": "Date in YYYY-MM-DD format"}
}',
options => map(
'version', '2.1',
'enableConfidenceScores', 'true'
)
);
{
"response": {
"invoice_id": {"confidence_score": 0.95, "value": "12345"},
"vendor_name": {"confidence_score": 0.62, "value": "Acme Corp"},
"total_amount": {"confidence_score": 1.0, "value": 1250.00},
"invoice_date": {"confidence_score": 0.99, "value": "2024-01-15"}
},
"error_message": null
}

Notebook example

The following notebook provides a visual debugging interface for analyzing the citation outputs of the ai_extract function. It demonstrates how to render citation metadata as substring snippets (STRING input) or bounding-box overlays (VARIANT input), and join ai_extract citations back to ai_parse_document elements in SQL so you can flag low-confidence extractions for manual review.

Citation rendering notebook

Open notebook in new tab

Limitations

  • This function is not available on Databricks SQL Classic.
  • This function cannot be used with views.
  • The schema supports a maximum of 128 fields.
  • Field names can contain up to 150 characters.
  • Schemas support up to seven levels of nesting for nested fields.
  • Enum fields support a maximum of 500 values.
  • Type validation is enforced for integer, number, boolean, and enum types. If a value does not match the specified type, the function returns an error.
  • The maximum total context size is 128,000 tokens.