`ai_classify` function

Applies to: Databricks SQL Databricks Runtime

The ai_classify() function classifies text content according to custom labels you provide. You can use simple label names for basic classification, or add label descriptions and instructions to improve accuracy for use cases like customer support routing, document categorization, and content analysis.

The function accepts text or VARIANT output from other AI functions like ai_parse_document, enabling composable workflows.

For a UI version to iterate on ai_classify, see Classification.

Requirements

Apache 2.0 license

The underlying models that might be used at this time are licensed under the Apache 2.0 License, Copyright © The Apache Software Foundation. Customers are responsible for ensuring compliance with applicable model licenses.

Databricks recommends reviewing these licenses to ensure compliance with any applicable terms. If models emerge in the future that perform better according to Databricks's internal benchmarks, Databricks might change the model (and the list of applicable licenses provided on this page).

The model powering this function is made available using Model Serving Foundation Model APIs. See Applicable model terms for information about which models are available on Databricks and the licenses and policies that govern the use of those models.

If models emerge that perform better according to Databricks's internal benchmarks, Databricks might change the models and update the documentation.

This function is only available in some regions, see AI function availability.
For workspaces with the Enhanced Security and Compliance add-on,
- See regional support for ai_classify for the appropriate compliance standard.
- See Manage Databricks previews for how to enable it on your workspace.
This function is not available on Databricks SQL Classic.
Check the Databricks SQL pricing page.
In Databricks Runtime 15.1 and above, the ai_classify function is supported in Databricks notebooks, including notebooks that are run as a task in A Databricks workflow.
Batch inference workloads require Databricks Runtime 15.4 ML LTS for improved performance.

tip

Databricks recommends using version 2.0 for ai_classify. Version 1.0 is a legacy interface that does not support these capabilities and is not recommended for new or production workloads.

Version 2.0 supports:

Label descriptions for improved accuracy
Multi-label classification
Global instructions
Up to 500 labels, compared to 20 in version 1.0
VARIANT input from upstream AI functions like ai_parse_document
Returns a structured VARIANT with error information

To pin version 2.0 explicitly, pass options => map('version', '2.0').

Syntax

Version 2 (recommended)
Version 1 (Legacy)

ai_classify(content, labels [, options])

ai_classify(content, labels [, options])

Arguments

Version 2 (recommended)
Version 1 (Legacy)

content: A VARIANT or STRING expression. Accepts either:
- Raw text as a STRING
- A VARIANT produced by another AI function (such as ai_parse_document or ai_extract)
labels: A STRING expression defining the classification labels. This can be a string literal or any SQL expression that evaluates to a STRING, including a Delta table column. The labels can be:
- Simple labels: A JSON array of label names.
  JSON
```
["urgent", "not_urgent"]
```
- Labels with descriptions: A JSON object mapping label names to descriptions. Label descriptions must be 0-1000 characters.
  JSON
```
{
  "billing_error": "Payment, invoice, or refund issues",
  "product_defect": "Any malfunction, bug, or breakage",
  "account_issue": "Login failures, password resets"
}
```
Each label must be 1-100 characters. labels must contain at least 2 labels, and no more than 500 labels. For taxonomies larger than 500 labels, see the Classification with 500+ labels.
options: An optional MAP<STRING, STRING> containing configuration options:
- version: Version switch to support migration ("1.0" for v1 behavior, "2.0" for v2 behavior). Default is based on input types, but falls back to "1.0".
- instructions: Global description of the task and domain to improve classification quality. Must be less than 20,000 characters.
- multilabel: Set to "true" to return multiple labels when multiple categories apply. Default is "false" (single-label classification).

content: A STRING expression containing the text to be classified.
labels: An ARRAY<STRING> literal with the expected output classification labels. Must contain at least 2 elements, and no more than 20 elements. Each label must be 1-50 characters.
options: An optional MAP<STRING, STRING> containing configuration options:
- version: Version switch to support migration ("1.0" for v1 behavior, "2.0" for v2 behavior). Default is based on input types, but falls back to "1.0".

Returns

Version 2 (recommended)
Version 1 (Legacy)

Returns a VARIANT containing:

JSON
{
  "response": ["label_name"], // Array with single label (or multiple if multilabel=true)
  "error_message": null // null on success, or error message on failure
}

The response field contains:

Single-label mode (default): An array with one element containing the best matching label
Multi-label mode (multilabel: "true"): An array with multiple labels when multiple categories apply
Label names exactly match those provided in the labels parameter

Returns NULL if content is NULL or if the content cannot be classified.

Returns a STRING. The value matches one of the strings provided in the labels argument.

Returns NULL if content is NULL or if the content cannot be classified.

Examples

Version 2 (recommended)
Version 1 (Legacy)

Simple labels - label names only

SQL
> SELECT ai_classify(
    'My password is leaked.',
    '["urgent", "not_urgent"]'
  );
 {
   "response": ["urgent"],
   "error": null
 }

Labels with descriptions

SQL
> SELECT ai_classify(
    'Customer cannot complete checkout due to payment processing error.',
    '{
      "billing_error": "Payment, invoice, or refund issues",
      "product_defect": "Any malfunction, bug, or breakage",
      "account_issue": "Login failures, password resets",
      "feature_request": "Customer suggestions for improvements"
    }'
  );
 {
   "response": ["billing_error"],
   "error": null
 }

Labels from Delta Table

Pass labels from a Delta table by converting them into a JSON string. For example, given a label table with schema news_topics(topic STRING, description STRING), you can pass your labels to ai_classify as follows:

SQL
SELECT
  ai_classify(
    "Leicester City Wins Premier League Title at 5000-1 Odds",
    l.labels,
    map('version', '2.0')
  ) AS classification
FROM (
  SELECT to_json(map_from_entries(collect_list(struct(topic, description)))) AS labels
  FROM news_topics
) l;

Using global instructions

SQL
> SELECT ai_classify(
    'User reports app crashes on startup after update.',
    '["critical", "high", "medium", "low"]',
    MAP('instructions', 'Classify bug severity based on user impact and frequency.')
  );
 {
   "response": ["critical"],
   "error": null
 }

Multi-label classification

SQL
> SELECT ai_classify(
    'Customer wants refund and reports product arrived broken.',
    '{
      "billing_issue": "Payment or refund requests",
      "product_defect": "Damaged or malfunctioning items",
      "shipping_issue": "Delivery problems"
    }',
    MAP('version', '2.0','multilabel', 'true')
  );
 {
   "response": ["billing_issue", "product_defect"],
   "error": null
 }

Composability with `ai_parse_document`

SQL
> WITH parsed_docs AS (
    SELECT
      path,
      ai_parse_document(
        content,
        MAP('version', '2.0')
      ) AS parsed_content
    FROM READ_FILES('/Volumes/support/tickets/', format => 'binaryFile')
  )
  SELECT
    path,
    ai_classify(
      parsed_content,
      '["billing_error", "product_defect", "account_issue", "feature_request"]',
      MAP('instructions', 'Customer support ticket classification.')
    ) AS ticket_category
  FROM parsed_docs;

Batch classification

SQL
> SELECT
    description,
    ai_classify(
      description,
      '["clothing", "shoes", "accessories", "furniture", "electronics"]'
    ) AS category
  FROM products
  LIMIT 10;

Classification with 500+ labels

To classify with more than 500 labels, we recommend embedding your documents and labels, retrieving the top k labels per document, and then running ai_classify on the smaller subset.

See Tutorial: Classify documents with 500+ labels for a step-by-step walkthrough.

SQL
> SELECT ai_classify("My password is leaked.", ARRAY("urgent", "not urgent"));
  urgent

> SELECT
    description,
    ai_classify(description, ARRAY('clothing', 'shoes', 'accessories', 'furniture')) AS category
  FROM
    products
  LIMIT 10;

Limitations

Version 2 (recommended)
Version 1 (Legacy)

Version 2 limitations:

This function is not available on Databricks SQL Classic.
This function cannot be used with Views.
Label names must be 1–100 characters each.
The labels parameter must contain between 2 and 500 unique labels.
Label descriptions must be 0–1,000 characters each.
The maximum total context size is 128,000 tokens.

Version 1 (Legacy) limitations:

This function is not available on Databricks SQL Classic.
This function cannot be used with Views.
Label names must be 1–50 characters each.
The labels array must contain between 2 and 20 labels.
The content input must be less than 128,000 tokens (about 300,000 characters).

Requirements​

Syntax​

Arguments​

Returns​

Examples​

Simple labels - label names only​

Labels with descriptions​

Labels from Delta Table​

Using global instructions​

Multi-label classification​

Composability with ai_parse_document​

Batch classification​

Classification with 500+ labels​

Limitations​

Related functions​