Data Classification

Beta

This feature is in Beta.

Data catalogs can have a vast amount of data, often containing known and unknown sensitive data. It is critical for data teams to understand what kind of sensitive data exists in each table so that they can both govern and democratize access to this data.

To address this problem, Databricks Data Classification automatically classifies and tags tables in your catalog. This allows you to discover sensitive data and to apply governance controls over the results, using tools such as attribute-based access control (ABAC) policies in Unity Catalog.

Using this feature, you can:

Classify data: The engine uses a compound AI system to automatically classify (and tag) any tables in Unity Catalog.
Optimize cost through smart scanning: The system intelligently determines when to scan your data by leveraging Unity Catalog and the Data Intelligence Engine. This means that scanning is incremental and optimized to ensure all new data is classified without manual configuration.
Review classifications: This preview provides an AI/BI dashboard to assist you in viewing classification results and downstream impact across your catalog.

For feedback or questions, contact us at data-classification-feedback@databricks.com.

note

Databricks Data Classification uses a Databricks-hosted large language model (LLM) to assist with classification. Databricks implements security controls to protect your data. For details, see Data protection in Model Serving and Databricks AI features trust and safety.

Requirements

You must have serverless compute enabled. See Connect to serverless compute.
To enable data classification, you must have MANAGE, CREATE SCHEMA, and SELECT privileges on the catalog.
Data classification is only supported on standard catalogs.

Enable data classification

Navigate to any catalog and click on the Details tab.
Click the Data Classification toggle to enable it.
(Optionally) Select the schemas you want to include for classification. By default, all schemas are included.

This creates a background job that incrementally scans all tables in the catalog or selected schemas.

View classification results

To view classification results, click See results next to the toggle. A dashboard opens, showing the classification results for all tables in the catalog.

See results button for Data Classification.

Overview

The Overview section shows the number of tables that have been classified, as well as the distribution of sensitive data classes across the catalog. You can filter the results by schema, table, or classification.

Overview section of the Data Classification dashboard.

The dashboard is powered by views that provide access-controlled results, meaning that only rows for table results that a user has read access to will be returned to them (see FAQ for more details).

Classification log

The Classification Log section shows a time-series chart of classifications over time. This enables you to see the latest classifications and drill down by sensitive data class.

Classification time-series section of the Data Classification dashboard.

It also provides a table with details for each classification, including:

Rationale: The reason why the classification was made. This can be a due to a detection on the metadata or column name, a detection on the values, or a combination of both.
Match score: The approximate proportion of rows that matched the classification.
Sample values: A sample of the values that matched the classification. This is useful for understanding the context of the classification and verifying its accuracy.
Downstream assets: A list of downstream assets that are impacted by the classification, including jobs, notebooks, queries, and dashboards.
Active users: The number of active users for the table in the provided time range.

Classification Log section of the Data Classification dashboard.

Scan failures

The Scan Failures section shows which tables failed to be classified. This can happen for a variety of reasons, and each table failure is accompanied by a detailed error message. For help resolving these errors, see the FAQ.

Tagging and governance controls

Data classification results can enable governance controls in multiple ways, including:

Sensitive data discovery: Classification results can be queried to discover sensitive data in your catalog and take appropriate action.
Row-level and column-level security: Classifications can produce tags that can be used by downstream policies to enforce row-level and column-level security using attribute-based access control (ABAC).
Table-level security: Classification results can be used to set up user groups and permissions to restrict access to sensitive tables and schemas.

Discover sensitive data

The results view in the dashboard helps you understand where sensitive data exists and how it is being used in your catalog. You can use this information to take appropriate action, such as automatically notifying table owners with a request to remove or remediate personally identifiable information (PII) from their table.

Row-level and column-level security

Data classification can automatically tag sensitive data using system tags. To do so:

You must have ASSIGN privileges on system governed tags (any tags beginning with the class. prefix).
You must have APPLY TAG privileges on the catalog, schemas, and tables where the tags will be applied.

If you have enabled the ABAC beta, you can use the class. tags and masking functions in an ABAC policy to automatically mask any tagged data.

For example, you can create a policy that masks social security numbers to all users who do not belong to certain user groups.

To enable the ABAC betas, see Enable ABAC

Another option for enforcing column-level security is to apply column masks to a tagged column.

Table-level security

You can use classification results to implement table-level security using user groups and permissions. For example, you can create a user group called confidential and assign it to all tables that contain name classifications, and you can create a group called restricted and assign it to all tables that contain us_ssn.

How to handle false positives

If data is incorrectly tagged, you can manually delete the tag. The tag will not be reapplied in future scans.

View Data Classification expenses

To understand how Data Classification is billed, visit the pricing page. You can view expenses related to Data Classification either by running a query or viewing the usage dashboard.

View usage from the system table `system.billing.usage`

To check Data Classification expenses, use a query similar to the following:

SQL
SELECT
   usage_date,
   identity_metadata.run_as AS run_as_user,
   SUM(usage_quantity) AS dbus
FROM
   system.billing.usage
WHERE
   usage_date >= DATE_SUB(CURRENT_DATE(), 30)
  AND billing_origin_product = 'DATA_CLASSIFICATION'
GROUP BY
   usage_date,
   identity_metadata.run_as
ORDER BY
   usage_date DESC,
   run_as_user;

View usage from the usage dashboard

If you already have a usage dashboard configured in your workspace, you can use it to filter the usage by selecting the Billing Origin Project labeled 'Data Classification.' If you do not have a usage dashboard configured, you can import one and apply the same filtering. For details, see Usage dashboards.

Frequently asked questions

How long does data classification take to run?

The classification engine relies on smart scanning to determine when to scan a table. You can expect that new tables and columns in your catalog will be scanned within 24 hours of being created.

If you're experiencing more than 24 hours of delay, contact us at data-classification-feedback@databricks.com.

What are the permissions on the result tables created?

Data classification creates tables to store results and errors (_result and _errors respectively), that by default are only accessible to the user who set up classification.

Dynamic views are also created over these tables with row-level access controls applied, so any users reading results from these views will only see entries corresponding to tables they already have ownership or read access to.

Some tables failed to be classified; how do I figure out what went wrong?

By default, failures that occurred for individual tables are skipped and retried the following day. You can use the errors view to see the exact error message that caused classification to fail.

SELECT * FROM <catalog_name>._data_classification.errors
WHERE schema_name = '<schema_name>' and table_name = '<table_name>'

Does data classification support views?

Views and metric views are not supported. If the view is based on existing tables, Databricks recommends classifying the underlying tables to see if they contain sensitive data.

Materialized views and streaming tables are supported.

Catalogs shared using Delta Sharing are not supported. Instead, Databricks recommends sharing schemas and tables inside an existing catalog to classify sensitive data.

Supported classes

The table lists the classes supported by Data Classification:

Class	Description
"credit_card"	Credit card number
"email_address"	Email address
"iban_code"	International Bank Account Number (IBAN)
"ip_address"	Internet Protocol Address (IPv4 or IPv6)
"location"	Location
"name"	Name of a person
"phone_number"	Phone number
"us_bank_number"	US bank number
"us_driver_license"	US driver license
"us_itin"	US Individual Taxpayer Identification Number
"us_passport"	US Passport
"us_ssn"	US Social Security Number

Requirements​

Enable data classification​

View classification results​

Overview​

Classification log​

Scan failures​

Tagging and governance controls​

Discover sensitive data​

Row-level and column-level security​

Table-level security​

How to handle false positives​

View Data Classification expenses​

View usage from the system table system.billing.usage​

View usage from the usage dashboard​

Frequently asked questions​

How long does data classification take to run?​

What are the permissions on the result tables created?​

Some tables failed to be classified; how do I figure out what went wrong?​

Does data classification support views?​

Does data classification support Delta Sharing catalogs?​

Supported classes​