This article describes the need for data governance and shares best practices and strategies you can use to implement these techniques across your organization.
Data governance is the oversight to ensure that data brings value and supports your business strategy. Data governance encapsulates the policies and practices implemented to securely manage the data assets within an organization. As the amount and complexity of data are growing, more and more organizations are looking at data governance to ensure the core business outcomes:
Consistent and high data quality as a foundation for analytics and machine learning.
Reduced time to insight.
Data democratization, that is enabling everybody in an organization to make data-driven decisions.
Support for risk and compliance for industry regulations such as HIPAA, FedRAMP, GDPR, or CCPA.
Cost optimization, for example by preventing users to start up large clusters and creating guardrails for using expensive GPU instances.
Data-driven companies typically build their data architectures for analytics on the lakehouse. A data lakehouse is an architecture that enables efficient and secure data engineering, machine learning, data warehousing, and business intelligence directly on vast amounts of data stored in data lakes. Data governance for a data lakehouse provides the following key capabilities:
Unified catalog: A unified catalog stores all your data, ML models, and analytics artifacts, in addition to metadata for each data object. The unified catalog also blends in data from other catalogs such as an existing Hive metastore.
Unified data access controls: A single and unified permissions model across all data assets and all clouds. This includes attribute bases access control (ABAC) for personally identifiable information (PII).
Data auditing: Data access is centrally audited with alerts and monitoring capabilities to promote accountability.
Data quality management: Robust data quality management with built-in quality controls, testing, monitoring, and enforcement to ensure accurate and useful data is available for downstream BI, analytics, and machine learning workloads.
Data lineage: Data lineage to get end-to-end visibility into how data flows in lakehouse from source to consumption.
Data discovery: Easy data discovery to enable data scientists, data analysts, and data engineers to quickly discover and reference relevant data and accelerate time to value.
Data sharing: Data can be shared across clouds and platforms.
Databricks provides centralized governance for data and AI with Unity Catalog and Delta Sharing.
Unity Catalog is a fine-grained governance solution for data and AI on the Databricks Lakehouse. It helps simplify security and governance of your data by providing a central place to administer and audit data access.
Delta Sharing is an open protocol developed by Databricks for secure data sharing with other organizations, or with other teams within your organization, regardless of which computing platforms they use.
For best practices on adopting Unity Catalog and Delta Sharing, see Unity Catalog best practices.
Every good data governance story starts with a strong identity foundation. To learn how to best configure identity in Databricks, see Identity best practices.
Here are some resources to help you build a comprehensive data governance solution that meets your organization’s needs:
Get started using Unity Catalog, learn how to get started with Unity Catalog.
Share data securely using Delta Sharing, learn how to securely share data with other organizations.
The Databricks Security and Trust Center, which provides information about the ways in which security is built into every layer of the Databricks Lakehouse Platform.
Secret management, for information on how to use Databricks secrets to store your credentials and reference them in notebooks and jobs. You should never hard code secrets or store them in plain text.
Table access control (legacy) lets you apply data governance controls for your data.
IAM role credential passthrough (legacy) allows users to authenticate automatically to S3 buckets from Databricks clusters using the identity that they use to log in to Databricks.