Skip to main content

Data governance with Databricks

This page gives an overview of how to govern data using Unity Catalog in Databricks.

note

This page focuses on the governance of data. Related security topics, such as the following, are covered in Security and compliance:

  • Authentication and access control
  • Network configuration
  • Data security and encryption
  • Privacy and compliance

What is Unity Catalog?

Unity Catalog is a centralized data catalog that provides fine-grained access control for tabular and unstructured data in multiple formats on multiple platforms, along with governance of AI assets like machine learning models. It also includes the tools you need to discover data, track usage, capture lineage, and monitor data quality.

Unity Catalog is open-source and supports multiple platforms. It is deeply integrated into Databricks.

See What is Unity Catalog?.

The Unity Catalog data governance model

Data governance with Unity Catalog provides the following:

  • Data unification: a unified view of all data and AI assets, across platforms, reducing duplication and sprawl.
  • Data access control: tools to ensure that data is easy to access, but only for the right users.
  • Data discoverability: tools that make it easy to find the data you need.
  • Data quality: tools to ensure that data that is accurate, complete, consistent, and secure throughout its lifecycle.
  • Data collaboration and sharing: the ability to share data securely not just within your organization but across organizational and platform boundaries.
  • Auditing: tools that capture who uses the data and how.

This page explains how your organization can meet these needs using Unity Catalog in Databricks.

Data access control

To make sure that users only access the data they should, Unity Catalog provides a hierarchical privilege model that enables you to grant users, groups, and service principals access to data and AI assets from the account level down to table rows and columns. You can control access to assets that are stored in dedicated Unity Catalog storage or stored in other platforms, like cloud storage or database systems: the key is that Unity Catalog gives your users potential access to all of your data, no matter where it is, from within Databricks, and that Unity Catalog controls their access and tracks their data usage.

Task

Description

Manage privileges

Learn about the securable objects that Unity Catalog manages and how to control access to them.

Manage identities

Learn how to manage identities in the context of Unity Catalog.

Fine-grained access control

Learn how to control access to table data using row filters and column masks.

Manage access to external storage and data platforms

Learn how to control access to cloud storage, external data platforms, and external non-data services using Unity Catalog.

Manage access from external platforms

Learn how Unity Catalog can manage access to your data from external platforms that use the Apache Iceberg or open-source Unity Catalog APIs.

Data discoverability

Databricks and Unity Catalog provide the following tools to help users find the data they need:

Feature

Description

Catalog Explorer

Browse and search for data and AI assets using asset names and metadata such as comments and tags.

Catalog browsers

Find data and AI assets using browsers that are built into the notebook and SQL query editors. See Navigate the Databricks notebook and file editor and Write queries and explore data in the SQL editor.

AI-generated comments

Automatically generate documentation of data and AI assets to assist discoverability.

Table insights

Use a UI built into Catalog Explorer to view the most frequent users and queries of any table in Unity Catalog.

Data lineage

Capture and visualize the way data flows through your organization.

For feature and model lineage, see Feature governance and lineage.

Entity relationship diagrams (ERD)

Display relationships for tables that have foreign keys defined.

See also Discover data.

Data quality monitoring

Tools for ensuring data quality and data integrity are deeply integrated into Delta Lake, Apache Spark, and Databricks. You can learn about them throughout the Databricks documentation.

Unity Catalog adds the following:

Feature

Description

Lakehouse Monitoring

A data monitoring tool that captures the statistical properties and quality of the data in all of the tables in your account. You can also use it to track the performance of machine learning models and model-serving endpoints by monitoring inference tables that contain model inputs and predictions.

Certified and deprecated system tags

Label securable objects, such as catalogs, schemas, and tables, with indicators of data quality or lifecycle status. These system tags help organizations enforce governance, improve data discoverability, and increase trust in analytics and AI applications.

Data collaboration and sharing

Unity Catalog lets your users collaborate on the same data across all of your account's workspaces in the same region. When you require collaboration across workspace regions, across organizations, and across platforms, Unity Catalog provides the foundation for the following sharing tools.

Feature

Description

Delta Sharing

A secure data sharing platform that lets you share data and AI assets in Databricks with users outside your organization, whether those users use Databricks or not.

Clean Rooms

A Databricks-managed environment where multiple participants on Databricks and non-Databricks platforms can collaborate on projects without sharing underlying data with each other.

Databricks Marketplace

An open forum for exchanging data and AI products. It also provides a private data exchange.

Auditing

Audit logs capture fine-grained details about who accessed a given dataset and the actions that they performed. Unity Catalog adds system tables, the easiest way to access and query your account's audit logs.

See Audit log reference and Monitor account activity with system tables.

Legacy Databricks data governance tools

Databricks also provides these legacy governance features. Databricks recommends that you use Unity Catalog instead.

Feature

Description

Table access control

A legacy data governance model that lets you programmatically grant and revoke access to objects managed by your workspace’s built-in Hive metastore.

IAM role credential passthrough

A legacy data governance feature that allows users to authenticate automatically to S3 buckets from Databricks clusters using the identity that they use to log in to Databricks.

Next steps