Best practices for data governance

This article covers best practices of data governance, organized by architectural principles listed in the following sections.

1. Unify data management

Manage metadata for all data assets in one place

As a best practice, run the lakehouse in a single account with one Unity Catalog. The top-level container of objects in Unity Catalog is a metastore. It stores data assets (such as, tables and views) and the permissions that govern access to them. Use a single metastore per cloud region and do not access metastores across regions to avoid latency issues.

The metastore provides a three-level namespace:

Databricks recommends using catalogs to provide segregation across your organization’s information architecture. Often this means that catalogs can correspond to software development environment scope, team, or business unit.

Track data lineage to drive visibility of the data

Data lineage is a powerful tool that helps data leaders drive greater visibility and understanding of the data in their organizations. It describes the transformation and refinement of data from source to insight. Lineage includes the capture of all relevant metadata and events associated with the data in its lifecycle, including the source of the data set, what other data sets were used to create it, who created it and when, what transformations were performed, what other data sets use it, and many other events and attributes. Data lineage can be used for many data-related use cases:

  • Compliance and audit readiness: Data lineage helps organizations trace the source of tables and fields. This is important for meeting the requirements of many compliance regulations, such as General Data Protection Regulation (GDPR), California Consumer Privacy Act (CCPA), Health Insurance Portability and Accountability Act (HIPAA), Basel Committee on Banking Supervision (BCBS) 239, and Sarbanes-Oxley Act (SOX).

  • Impact analysis/change management: Data goes through multiple transformations from the source to the final business-ready table. Understanding the potential impact of data changes on downstream users becomes important from a risk-management perspective. This impact can be easily determined using the data lineage collected by Unity Catalog.

  • Data quality assurance: Understanding where a data set came from and what transformations have been applied provides much better context for data scientists and analysts, enabling them to gain better and more accurate insights.

  • Debugging and diagnostics: In the event of an unexpected result, data lineage helps data teams perform root cause analysis by tracing the error back to its source. This dramatically reduces debugging time.

Unity Catalog captures runtime data lineage across queries run on Databricks. Lineage is supported for all languages and is captured down to the column level. Lineage data includes notebooks, workflows, and dashboards related to the query. Lineage can be visualized in Data Explorer in near real-time and retrieved with the Databricks Data Lineage REST API.

2. Unify data security

Centralize access control

The Databricks Lakehouse Platform provides methods for data access control, mechanisms that describe which groups or individuals can access what data. These are statements of policy that can be extremely granular and specific, right down to definitions of every record that each individual has access to. Or they can be very expressive and broad, such as all finance users can see all financial data.

Unity Catalog centralizes access controls for files, tables, and views. Each securable object in Unity Catalog has an owner. An object’s owner has all privileges on the object, as well as the permission to grant privileges on the securable object to other principals. Unity Catalog allows to manage privileges, and to configure access control by using SQL DDL statements.

Unity Catalog uses dynamic views for fine-grained access controls so that you can restrict access to rows and columns to the users and groups who are authorized to query them. See Create a dynamic view.

For further information see Security, compliance & privacy - Manage identity and access using least privilege.

Configure audit logging

Databricks provides access to audit logs of activities performed by Databricks users, allowing your enterprise to monitor detailed Databricks usage patterns. There are two types of logs: Workspace-level audit logs with workspace-level events and account-level audit logs with account-level events.

Audit Unity Catalog events

Unity Catalog captures an audit log of actions performed against the metastore. This enables admins to access fine-grained details about who accessed a given dataset and what actions they performed.

Audit data sharing events

For secure sharing with Delta Sharing, Databricks provides audit logs to monitor Delta Sharing events, including:

  • When someone creates, modifies, updates, or deletes a share or a recipient.

  • When a recipient accesses an activation link and downloads the credential.

  • When a recipient accesses shares or data in shared tables.

  • When a recipient’s credential is rotated or expires.

3. Manage data quality

The Databricks Lakehouse Platform provides robust data quality management with built-in quality controls, testing, monitoring, and enforcement to ensure accurate and useful data is available for downstream BI, analytics, and machine learning workloads.

See Reliablity - Manage data quality.

4. Share data securely and in real-time

Use the open Delta Sharing protocol for sharing data with partners

Delta Sharing provides an open solution for securely sharing live data from your lakehouse to any computing platform. Recipients do not need to be on the Databricks platform, on the same cloud, or on any cloud at all. Delta Sharing is natively integrated with Unity Catalog, enabling organizations to centrally manage and audit shared data across the enterprise and confidently share data assets while meeting security and compliance requirements.

Data providers can share live data from where it resides in their cloud storage without replicating or moving it to another system. This approach reduces the operational costs of data sharing because data providers don’t have to replicate data multiple times across clouds, geographies, or data platforms to each of their data consumers.

Use Databricks-to-Databricks Delta Sharing between Databricks users

If you want to share data with users who don’t have access to your Unity Catalog metastore, you can use Databricks-to-Databricks Delta Sharing, as long as the recipients have access to a Databricks workspace that is enabled for Unity Catalog. Databricks-to-Databricks sharing lets you share data with users in other Databricks accounts, across cloud regions, across cloud providers. It’s a great way to securely share data across different Unity Catalog metastores in your own Databricks account.