Data governance with Databricks

Data governance is a framework of policies, processes, roles, and technical controls that ensures your organization's data is secure, trustworthy, and used responsibly throughout its lifecycle. Effective data governance enables you to maintain data quality, protect sensitive information, meet regulatory requirements, and maximize the value of your data assets.

Key components of data governance include:

Access control and security: Implementing fine-grained permissions and security measures to protect data from unauthorized access while enabling appropriate use.
Data lineage and observability: Tracking data flows and transformations to understand data origins, dependencies, and usage patterns.
Data quality management: Ensuring data is accurate, complete, consistent, and reliable for decision-making and analytics.
Metadata management: Capturing and maintaining information about data assets to improve discoverability and understanding.
Compliance enforcement: Meeting regulatory requirements and organizational policies for data privacy, retention, and usage.

This page focuses on the governance of data using Unity Catalog in Databricks. Related security topics, such as authentication, network configuration, data encryption, and privacy compliance, are covered in Security and compliance and Compliance overview.

The Unity Catalog data governance model

Unity Catalog is a centralized data catalog that provides governance for both structured and unstructured data in multiple formats. It offers fine-grained access control and governance of AI assets such as machine learning models. Unity Catalog is open-source and supports multiple platforms. It is deeply integrated into Databricks.

Unity Catalog is a complete data governance solution that provides the following:

Data unification: a unified view of all data and AI assets, across platforms, reducing duplication and sprawl.
Data access control: tools to ensure that data is accessible, but only for the right users.
Data discoverability: tools that make it easy to find the data you need.
Data quality: tools to ensure that data that is accurate, complete, consistent, and secure throughout its lifecycle.
Data collaboration and sharing: tools to share data securely not just within your organization but across organizational and platform boundaries.
Auditing: tools that capture who uses the data and how.

This page explains how your organization can meet these needs using Unity Catalog in Databricks.

Data access control

To make sure that users only access the data they should, Unity Catalog provides a hierarchical privilege model that enables you to grant users, groups, and service principals access to data and AI assets from the account level down to table rows and columns. You can control access to assets that are stored in dedicated Unity Catalog storage or stored in other platforms, like cloud storage or database systems: the key is that Unity Catalog gives your users potential access to all of your data, no matter where it is, from within Databricks, and that Unity Catalog controls their access and tracks their data usage.

Task	Description
Manage privileges	Learn about the securable objects that Unity Catalog manages and how to control access to them.
Manage attribute-based access control (ABAC)	Learn how to control access ot data using ABAC in Unity Catalog.
Manage identities	Learn how to manage identities in the context of Unity Catalog.
Fine-grained access control	Learn how to control access to table data using row filters and column masks.
Manage access to external storage and data platforms	Learn how to control access to cloud storage, external data platforms, and external non-data services using Unity Catalog.
Manage access from external platforms	Learn how Unity Catalog can manage access to your data from external platforms that use the Apache Iceberg or open-source Unity Catalog APIs.

Data discoverability

Databricks and Unity Catalog provide the following tools to help users find the data they need:

Feature	Description
Catalog Explorer	Browse and search for data and AI assets using asset names and metadata such as comments and tags.
Catalog browsers	Find data and AI assets using browsers that are built into the notebook and SQL query editors. See Navigate the Databricks notebook and file editor and Write queries and explore data in the new SQL editor.
AI-generated comments	Automatically generate documentation of data and AI assets to assist discoverability.
Table insights	Use a UI built into Catalog Explorer to view the most frequent users and queries of any table in Unity Catalog.
Data lineage	Capture and visualize the way data flows through your organization. For feature and model lineage, see Feature governance and lineage.
Entity relationship diagrams (ERD)	Display relationships for tables that have foreign keys defined.

Data quality monitoring

Tools for ensuring data quality and data integrity are deeply integrated into Delta Lake, Apache Spark, and Databricks. You can learn about them throughout the Databricks documentation.

Unity Catalog adds the following:

Feature	Description
Data quality monitoring	Data quality monitoring helps you ensure the quality of all of your data assets in Unity Catalog. It includes anomaly detection to monitor the data quality of all of the tables in a catalog or schema and data profiling to monitor the statistical properties and quality of the data of an individual table.
Certified and deprecated system tags (Private Preview)	Label securable objects, such as catalogs, schemas, and tables, with indicators of data quality or lifecycle status. These system tags help organizations enforce governance, improve data discoverability, and increase trust in analytics and AI applications.

Unity Catalog lets your users collaborate on the same data across all of your account's workspaces in the same region. When you require collaboration across workspace regions, across organizations, and across platforms, Unity Catalog provides the foundation for the following sharing tools.

Feature	Description
Delta Sharing	A secure data sharing platform that lets you share data and AI assets in Databricks with users outside your organization, whether those users use Databricks or not.
Clean Rooms	A Databricks-managed environment where multiple participants on Databricks and non-Databricks platforms can collaborate on projects without sharing underlying data with each other.
Databricks Marketplace	An open forum for exchanging data and AI products. It also provides a private data exchange.

Auditing

Audit logs capture fine-grained details about who accessed a given dataset and the actions that they performed. Unity Catalog adds system tables, the easiest way to access and query your account's audit logs.

See Audit log reference and Monitor account activity with system tables.

Legacy Databricks data governance tools

Databricks also provides these legacy governance features. Databricks recommends that you use Unity Catalog instead.

Feature	Description
Table access control	A legacy data governance model that lets you programmatically grant and revoke access to objects managed by your workspace’s built-in Hive metastore.
IAM role credential passthrough	A legacy data governance feature that allows users to authenticate automatically to S3 buckets from Databricks clusters using the identity that they use to log in to Databricks.

Next steps

Learn more about Unity Catalog: What is Unity Catalog?
Get started with Unity Catalog: Get started with Unity Catalog
Review best practices: What is Unity Catalog?

The Unity Catalog data governance model​

Data access control​

Data discoverability​

Data quality monitoring​

Data collaboration and sharing​

Auditing​

Legacy Databricks data governance tools​

Next steps​