ai_classify function
Applies to: Databricks SQL
Databricks Runtime
This functionality is in Public Preview and HIPAA compliant.
During the preview:
- The underlying language model can handle several languages, but this AI Function is tuned for English.
- See Features with limited regional availability for AI Functions region availability.
The ai_classify() function classifies text content according to custom labels you provide. You can use simple label names for basic classification, or add label descriptions and instructions to improve accuracy for use cases like customer support routing, document categorization, and content analysis.
The function accepts text or VARIANT output from other AI functions like ai_parse_document, enabling composable workflows.
For a UI version to iterate on ai_classify, see Classification.
Requirements
Apache 2.0 license
The underlying models that might be used at this time are licensed under the Apache 2.0 License, Copyright © The Apache Software Foundation. Customers are responsible for ensuring compliance with applicable model licenses.
Databricks recommends reviewing these licenses to ensure compliance with any applicable terms. If models emerge in the future that perform better according to Databricks's internal benchmarks, Databricks might change the model (and the list of applicable licenses provided on this page).
The model powering this function is made available using Model Serving Foundation Model APIs. See Applicable model developer terms for information about which models are available on Databricks and the licenses and policies that govern the use of those models.
If models emerge in the future that perform better according to Databricks's internal benchmarks, Databricks may change the models and update the documentation.
- This function is only available in some regions, see AI function availability.
- This function is not available on Databricks SQL Classic.
- Check the Databricks SQL pricing page.
- In Databricks Runtime 15.1 and above, this function is supported in Databricks notebooks, including notebooks that are run as a task in a Databricks workflow.
- Batch inference workloads require Databricks Runtime 15.4 ML LTS for improved performance.
Syntax
- Version 2 (recommended)
- Version 1
ai_classify(
content VARIANT | STRING,
labels STRING,
[options MAP<STRING, STRING>]
) RETURNS VARIANT
ai_classify(
content STRING,
labels ARRAY<STRING>,
[options MAP<STRING, STRING>]
) RETURNS STRING
Arguments
- Version 2 (recommended)
- Version 1
-
content: AVARIANTorSTRINGexpression. Accepts either:- Raw text as a
STRING - A
VARIANTproduced by another AI function (such asai_parse_documentorai_extract)
- Raw text as a
-
labels: ASTRINGliteral defining the classification labels. The labels can be:- Simple labels: A JSON array of label names.
JSON
["urgent", "not_urgent"] - Labels with descriptions: A JSON object mapping label names to descriptions. Label descriptions must be 0-1000 characters.
JSON
{
"billing_error": "Payment, invoice, or refund issues",
"product_defect": "Any malfunction, bug, or breakage",
"account_issue": "Login failures, password resets"
}
Each label must be 1-100 characters.
labelsmust contain at least 2 labels, and no more than 500 labels. - Simple labels: A JSON array of label names.
-
options: An optionalMAP<STRING, STRING>containing configuration options:version: Version switch to support migration ("1.0"for v1 behavior,"2.0"for v2 behavior). Default is based on input types, but will fall back to"1.0".instructions: Global description of the task and domain to improve classification quality. Must be less than 20,000 characters.multilabel: Set to"true"to return multiple labels when multiple categories apply. Default is"false"(single-label classification).
-
content: ASTRINGexpression containing the text to be classified. -
labels: AnARRAY<STRING>literal with the expected output classification labels. Must contain at least 2 elements, and no more than 20 elements. Each label must be 1-50 characters. -
options: An optionalMAP<STRING, STRING>containing configuration options:version: Version switch to support migration ("1.0"for v1 behavior,"2.0"for v2 behavior). Default is based on input types, but will fall back to"1.0".
Returns
- Version 2 (recommended)
- Version 1
Returns a VARIANT containing:
{
"response": ["label_name"], // Array with single label (or multiple if multilabel=true)
"error_message": null // null on success, or error message on failure
}
The response field contains:
- Single-label mode (default): An array with one element containing the best matching label
- Multi-label mode (
multilabel: "true"): An array with multiple labels when multiple categories apply - Label names exactly match those provided in the
labelsparameter
Returns NULL if content is NULL or if the content cannot be classified.
Returns a STRING. The value matches one of the strings provided in the labels argument.
Returns NULL if content is NULL or if the content cannot be classified.
Examples
- Version 2 (recommended)
- Version 1
Simple labels - label names only
> SELECT ai_classify(
'My password is leaked.',
'["urgent", "not_urgent"]'
);
{
"response": ["urgent"],
"error": null
}
Labels with descriptions
> SELECT ai_classify(
'Customer cannot complete checkout due to payment processing error.',
'{
"billing_error": "Payment, invoice, or refund issues",
"product_defect": "Any malfunction, bug, or breakage",
"account_issue": "Login failures, password resets",
"feature_request": "Customer suggestions for improvements"
}'
);
{
"response": ["billing_error"],
"error": null
}
Using global instructions
> SELECT ai_classify(
'User reports app crashes on startup after update.',
'["critical", "high", "medium", "low"]',
MAP('instructions', 'Classify bug severity based on user impact and frequency.')
);
{
"response": ["critical"],
"error": null
}
Multi-label classification
> SELECT ai_classify(
'Customer wants refund and reports product arrived broken.',
'{
"billing_issue": "Payment or refund requests",
"product_defect": "Damaged or malfunctioning items",
"shipping_issue": "Delivery problems"
}',
MAP('multilabel', 'true')
);
{
"response": ["billing_issue", "product_defect"],
"error": null
}
Composability with ai_parse_document
> WITH parsed_docs AS (
SELECT
path,
ai_parse_document(
content,
MAP('version', '2.0')
) AS parsed_content
FROM READ_FILES('/Volumes/support/tickets/', format => 'binaryFile')
)
SELECT
path,
ai_classify(
parsed_content,
'["billing_error", "product_defect", "account_issue", "feature_request"]',
MAP('instructions', 'Customer support ticket classification.')
) AS ticket_category
FROM parsed_docs;
Batch classification
> SELECT
description,
ai_classify(
description,
'["clothing", "shoes", "accessories", "furniture", "electronics"]'
) AS category
FROM products
LIMIT 10;
> SELECT ai_classify("My password is leaked.", ARRAY("urgent", "not urgent"));
urgent
> SELECT
description,
ai_classify(description, ARRAY('clothing', 'shoes', 'accessories', 'furniture')) AS category
FROM
products
LIMIT 10;
Limitations
- Version 2 (recommended)
- Version 1
Version 2 limitations:
-
This function is not available on Databricks SQL Classic.
-
This function cannot be used with Views.
-
Label names must be 1–100 characters each.
-
The
labelsparameter must contain between 2 and 500 unique labels. -
Label descriptions must be 0–1,000 characters each.
-
The maximum total context size is 128,000 tokens.
Version 1 limitations:
-
This function is not available on Databricks SQL Classic.
-
This function cannot be used with Views.
-
Label names must be 1–50 characters each.
-
The
labelsarray must contain between 2 and 20 labels. -
The
contentinput must be less than 128,000 tokens (about 300,000 characters).