Skip to main content

Data Classification

Beta

This feature is in Beta.

Data catalogs can have a vast amount of data, often containing known and unknown sensitive data. It is critical for data teams to understand what kind of sensitive data exists in each table so that they can both govern and democratize access to this data.

To address this problem, Databricks Data Classification automatically classifies and tags tables in your catalog. This allows you to discover sensitive data, as well as apply governance controls over the results, using tools such as role-based access control (RBAC) and attribute-based access control (ABAC) policies in Unity Catalog.

With this feature, you’ll be able to:

  • Classify data: The engine uses a compound AI system to automatically classify (and tag) any tables in Unity Catalog.
  • Optimize cost through smart scanning: The system intelligently determines when to scan your data by leveraging Unity Catalog and the Data Intelligence Engine. This means that scanning is incremental and optimized to ensure all new data is classified without manual configuration.
  • Review classifications: This preview provides an AI/BI dashboard to assist you in viewing classification results and downstream impact across your catalog.

For feedback or questions, contact us at data-classification-feedback@databricks.com.

Disclaimer

note
  • Databricks Data Classification uses a Databricks-hosted large language model (LLM) to assist with classification. Databricks implements security controls to protect your data. For details, see Data protection in Model Serving and Databricks AI features trust and safety.
  • Databricks Data Classification is available free of charge for a limited amount of time for catalogs containing up to 1,000 tables. After this period, you will be charged for the compute used to run the classification engine.

Requirements

  • You must have serverless compute enabled. See Connect to serverless compute.
  • To enable data classification, you must have MANAGE, CREATE SCHEMA, and SELECT privileges on the catalog.
  • Data classification is only supported on standard catalogs.

Start data classification

To enable the feature:

  1. Navigate to any catalog and click on the Details tab.

    Details tab for the catalog page in Catalog Explorer.

  2. Click the Data Classification toggle to enable it.

  3. (Optionally) Select the schemas you want to include for classification. By default, all schemas are included.

    Settings modal for Data Classification.

This creates a background job that incrementally scans all tables in the catalog or selected schemas.

View classification results

To view classification results, click See results next to the toggle. A dashboard opens, showing the classification results for all tables in the catalog.

See results button for Data Classification.

Overview

The Overview section shows the number of tables that have been classified, as well as the distribution of sensitive data classes across the catalog. You can filter the results by schema, table, or classification.

Overview section of the Data Classification dashboard.

The dashboard is powered by views that provide access-controlled results, meaning that only rows for table results that a user has read access to will be returned to them (see FAQ for more details).

Classification log

The Classification Log section shows a time-series chart of classifications over time. This enables you to see the latest classifications and drill down by sensitive data class.

Classification time-series section of the Data Classification dashboard.

It also provides a table with details for each classification, including:

  • Rationale: The reason why the classification was made. This can be a due to a detection on the metadata or column name, a detection on the values, or a combination of both.
  • Match score: The approximate proportion of rows that matched the classification.
  • Sample values: A sample of the values that matched the classification. This is useful for understanding the context of the classification and verifying its accuracy.
  • Downstream assets: A list of downstream assets that are impacted by the classification, including jobs, notebooks, queries, and dashboards.
  • Active users: The number of active users for the table in the provided time range.

Classification Log section of the Data Classification dashboard.

Scan failures

The Scan Failures section shows which tables failed to be classified. This can happen for a variety of reasons, and each table failure is accompanied by a detailed error message. For help resolving these errors, see the FAQ.

Tagging and governance controls

Data classification results can enable governance controls in multiple ways, including:

  • Sensitive data discovery: Classification results can be queried to discover sensitive data in your catalog and take appropriate action.
  • Row-level and column-level security: Classifications can produce tags that can be used by downstream policies to enforce row-level and column-level security using attribute-based access control (ABAC).
  • Table-level security: Classification results can be used to set up user groups and permissions to restrict access to sensitive tables and schemas.

Discover sensitive data

The results view in the dashboard helps you understand where sensitive data exists and how it is being used in your catalog. You can use this information to take appropriate action, such as automatically notifying table owners with a request to remove or remediate personally identifiable information (PII) from their table.

Row-level and column-level security

Data classification can automatically tag sensitive data using system tags. To do so:

  • You must be enrolled in the Tag Policies preview.
  • You must have ASSIGN privileges over system Tag Policies (any tags beginning with the class. prefix).
  • You must have APPLY TAG privileges on the catalog, schemas, and tables where the tags will be applied.

If you are enrolled in the ABAC preview, you can use the class. tags and masking functions in an ABAC policy to automatically mask any tagged data.

For example, you can create a policy that masks social security numbers to all users who do not belong to certain user groups.

To learn more about enrolling in the Tag Policies or ABAC previews, reach out to your account representative or Databricks support.

Another option for enforcing column-level security is to apply column masks to a tagged column.

Table-level security

You can use classification results to implement table-level security using user groups and permissions. For example, you can create a user group called confidential and assign it to all tables that contain name classifications, and you can create a group called restricted and assign it to all tables that contain us_ssn.

Frequently asked questions

How long does data classification take to run?

The classification engine relies on smart scanning to determine when to scan a table. You can expect that new tables and columns in your catalog will be scanned within 24 hours of being created.

If you're experiencing more than 24 hours of delay, contact us at data-classification-feedback@databricks.com.

What are the permissions on the result tables created?

Data classification creates tables to store results and errors (_result and _errors respectively), that by default are only accessible to the user who set up classification.

Dynamic views are also created over these tables with row-level access controls applied, so any users reading results from these views will only see entries corresponding to tables they already have ownership or read access to.

Some tables failed to be classified; how do I figure out what went wrong?

By default, failures that occurred for individual tables are skipped and retried the following day. You can use the errors view to see the exact error message that caused classification to fail.

SELECT * FROM <catalog_name>._data_classification.errors
WHERE schema_name = '<schema_name>' and table_name = '<table_name>'

Does data classification support views?

Views are not supported. If the view is based on existing tables, Databricks recommends classifying the underlying tables to see if they contain sensitive data.

Materialized views and streaming tables are supported.

Does data classification support Delta Sharing catalogs?

Catalogs shared using Delta Sharing are not supported. Instead, Databricks recommends sharing schemas and tables inside an existing catalog to classify sensitive data.

Supported classes

The table lists the classes supported by Data Classification:

Class

Description

"credit_card"

Credit card number

"email_address"

Email address

"iban_code"

International Bank Account Number (IBAN)

"ip_address"

Internet Protocol Address (IPv4 or IPv6)

"location"

Location

"name"

Name of a person

"phone_number"

Phone number

"us_bank_number"

US bank number

"us_driver_license"

US driver license

"us_itin"

US Individual Taxpayer Identification Number

"us_passport"

US Passport

"us_ssn"

US Social Security Number