Skip to main content

ai_classify function

Applies to: check marked yes Databricks SQL check marked yes Databricks Runtime

Preview

This functionality is in Public Preview and HIPAA compliant.

During the preview:

The ai_classify() function classifies text content according to custom labels you provide. You can use simple label names for basic classification, or add label descriptions and instructions to improve accuracy for use cases like customer support routing, document categorization, and content analysis.

The function accepts text or VARIANT output from other AI functions like ai_parse_document, enabling composable workflows.

For a UI version to iterate on ai_classify, see Classification.

Requirements

Apache 2.0 license

The underlying models that might be used at this time are licensed under the Apache 2.0 License, Copyright © The Apache Software Foundation. Customers are responsible for ensuring compliance with applicable model licenses.

Databricks recommends reviewing these licenses to ensure compliance with any applicable terms. If models emerge in the future that perform better according to Databricks's internal benchmarks, Databricks might change the model (and the list of applicable licenses provided on this page).

The model powering this function is made available using Model Serving Foundation Model APIs. See Applicable model developer terms for information about which models are available on Databricks and the licenses and policies that govern the use of those models.

If models emerge in the future that perform better according to Databricks's internal benchmarks, Databricks may change the models and update the documentation.

  • This function is only available in some regions, see AI function availability.
  • This function is not available on Databricks SQL Classic.
  • Check the Databricks SQL pricing page.
  • In Databricks Runtime 15.1 and above, this function is supported in Databricks notebooks, including notebooks that are run as a task in a Databricks workflow.
  • Batch inference workloads require Databricks Runtime 15.4 ML LTS for improved performance.

Syntax

SQL
ai_classify(
content VARIANT | STRING,
labels STRING,
[options MAP<STRING, STRING>]
) RETURNS VARIANT

Arguments

  • content: A VARIANT or STRING expression. Accepts either:

  • labels: A STRING literal defining the classification labels. The labels can be:

    • Simple labels: A JSON array of label names.
      JSON
      ["urgent", "not_urgent"]
    • Labels with descriptions: A JSON object mapping label names to descriptions. Label descriptions must be 0-1000 characters.
      JSON
      {
      "billing_error": "Payment, invoice, or refund issues",
      "product_defect": "Any malfunction, bug, or breakage",
      "account_issue": "Login failures, password resets"
      }

    Each label must be 1-100 characters. labels must contain at least 2 labels, and no more than 500 labels.

  • options: An optional MAP<STRING, STRING> containing configuration options:

    • version: Version switch to support migration ("1.0" for v1 behavior, "2.0" for v2 behavior). Default is based on input types, but will fall back to "1.0".
    • instructions: Global description of the task and domain to improve classification quality. Must be less than 20,000 characters.
    • multilabel: Set to "true" to return multiple labels when multiple categories apply. Default is "false" (single-label classification).

Returns

Returns a VARIANT containing:

JSON
{
"response": ["label_name"], // Array with single label (or multiple if multilabel=true)
"error_message": null // null on success, or error message on failure
}

The response field contains:

  • Single-label mode (default): An array with one element containing the best matching label
  • Multi-label mode (multilabel: "true"): An array with multiple labels when multiple categories apply
  • Label names exactly match those provided in the labels parameter

Returns NULL if content is NULL or if the content cannot be classified.

Examples

Simple labels - label names only

SQL
> SELECT ai_classify(
'My password is leaked.',
'["urgent", "not_urgent"]'
);
{
"response": ["urgent"],
"error": null
}

Labels with descriptions

SQL
> SELECT ai_classify(
'Customer cannot complete checkout due to payment processing error.',
'{
"billing_error": "Payment, invoice, or refund issues",
"product_defect": "Any malfunction, bug, or breakage",
"account_issue": "Login failures, password resets",
"feature_request": "Customer suggestions for improvements"
}'
);
{
"response": ["billing_error"],
"error": null
}

Using global instructions

SQL
> SELECT ai_classify(
'User reports app crashes on startup after update.',
'["critical", "high", "medium", "low"]',
MAP('instructions', 'Classify bug severity based on user impact and frequency.')
);
{
"response": ["critical"],
"error": null
}

Multi-label classification

SQL
> SELECT ai_classify(
'Customer wants refund and reports product arrived broken.',
'{
"billing_issue": "Payment or refund requests",
"product_defect": "Damaged or malfunctioning items",
"shipping_issue": "Delivery problems"
}',
MAP('multilabel', 'true')
);
{
"response": ["billing_issue", "product_defect"],
"error": null
}

Composability with ai_parse_document

SQL
> WITH parsed_docs AS (
SELECT
path,
ai_parse_document(
content,
MAP('version', '2.0')
) AS parsed_content
FROM READ_FILES('/Volumes/support/tickets/', format => 'binaryFile')
)
SELECT
path,
ai_classify(
parsed_content,
'["billing_error", "product_defect", "account_issue", "feature_request"]',
MAP('instructions', 'Customer support ticket classification.')
) AS ticket_category
FROM parsed_docs;

Batch classification

SQL
> SELECT
description,
ai_classify(
description,
'["clothing", "shoes", "accessories", "furniture", "electronics"]'
) AS category
FROM products
LIMIT 10;

Limitations

Version 2 limitations:

  • This function is not available on Databricks SQL Classic.

  • This function cannot be used with Views.

  • Label names must be 1–100 characters each.

  • The labels parameter must contain between 2 and 500 unique labels.

  • Label descriptions must be 0–1,000 characters each.

  • The maximum total context size is 128,000 tokens.