Unity Catalog best practices

This document provides an opinionated perspective on how to best adopt Databricks Unity Catalog and Delta Sharing to meet your data governance needs.

Configure a Unity Catalog metastore

Unity Catalog is a fine-grained governance solution for data and AI on the Databricks Lakehouse. It helps simplify security and governance of your data by providing a central place to administer and audit data access. The following diagram illustrates the main securable objects in Unity Catalog:

Unity Catalog object model diagram

A metastore is the top-level container of objects in Unity Catalog. It stores data assets (tables and views) and the permissions that govern access to them. Databricks account admins can create metastores and assign them to Databricks workspaces to control which workloads use each metastore. You create a single metastore in each region you operate and link it to all workspaces in that region. Therefore, if you have multiple regions using Databricks, you will have multiple metastores. To share data between metastores, see Delta Sharing.

Each metastore is configured with a root storage location that can be used for managed tables. You need to ensure that no users have direct access to this managed storage location. Giving access to this storage location could allow a user to bypass access controls in a Unity Catalog metastore and disrupt auditability. For these reasons, you should not reuse a bucket that is your current DBFS root file system or has previously been a DBFS root file system for the root storage location in your Unity Catalog metastore.

See Create a Unity Catalog metastore.

External locations and storage credentials

External locations and storage credentials allow Unity Catalog to read and write data on your cloud tenant on behalf of users.

A storage credential encapsulates a long-term cloud credential that provides access to cloud storage. For example, in AWS you can configure an IAM role to access S3 buckets.

An external location is an object that combines a cloud storage path with a storage credential in order to authorize access to the cloud storage path.

Important

External locations, by combining storage credentials and storage paths, provide strong control and auditability of storage access. To prevent users from bypassing the access control provided by Unity Catalog, you should ensure that you limit the number of users with direct access to any bucket that is being used as an external location. For the same reason, you should not mount storage accounts to DBFS if they are also being used as external locations. Databricks recommends that you migrate mounts on cloud storage locations to external locations in Unity Catalog using Data Explorer.

See Manage external locations and storage credentials.

Organize your data

Databricks recommends using catalogs to provide segregation across your organization’s information architecture. Often this means that catalogs can correspond to software development environment scope, team, or business unit.

Unity Catalog catalogs

A schema (also called a database) is the second layer of Unity Catalog’s three-level namespace and organizes tables and views. A table can be managed or external.

Managed tables are the default way to create tables in Unity Catalog. Unity Catalog manages the lifecycle and file layout for these tables. You should not use tools outside of Databricks to manipulate files in these tables directly. By default, managed tables are stored in the root storage location that you configure when you create a metastore. You can optionally specify managed table storage locations at the catalog or schema levels, overriding the root storage location. Managed tables always use the Delta table format.

External tables are tables whose data lifecycle and file layout are not managed by Unity Catalog. Use external tables to register large amounts of existing data in Unity Catalog, or if you require direct access to the data using tools outside of Databricks clusters or Databricks SQL warehouses. External tables support Delta Lake and many other data formats, including Parquet, JSON, and CSV.

For more information about creating tables, see Create tables.

Manage external locations and external tables

The diagram below represents the filesystem hierarchy of a single cloud storage bucket.

External locations

There are four external locations created and one storage credential used by them all. Users and groups can be granted access to the different storage locations within a Unity Catalog metastore. This allows you to provide specific groups access to different part of the cloud storage bucket.

You can create external tables using a storage location in a Unity Catalog metastore. Those external tables can then be secured independently. Databricks recommends that you create external tables from one storage location within one schema.

Databricks strongly recommends against registering common tables as external tables in more than one metastore due to the risk of consistency issues. For example, a change to the schema in one metastore will not register in the second metastore. Use Delta Sharing for sharing data between metastores. See Delta Sharing.

Configure access control

Each securable object in Unity Catalog has an owner. The principal that creates an object becomes its initial owner. An object’s owner has all privileges on the object, such as SELECT and MODIFY on a table, as well as the permission to grant privileges on the securable object to other principals. Only owners of a securable object have the permission to grant privileges on that object to other principals. Therefore, it is best practice to configure ownership on all objects to the group responsible for administration of grants on the object. Both the owner and metastore admins can transfer ownership of a securable object to a group. Additionally, if the object is contained within a catalog (like a table or view), the catalog and schema owner can change the ownership of the object.

Securable objects in Unity Catalog are hierarchical and privileges are inherited downward. This means that granting a privilege on a catalog or schema automatically grants the privilege to all current and future objects within the catalog or schema. For more information, see Inheritance model.

In order to read data from a table or view a user must have the following privileges:

  • SELECT on the table or view

  • USE SCHEMA on the schema that owns the table

  • USE CATALOG on the catalog that owns the schema

USE CATALOG enables the grantee to traverse the catalog in order to access its child objects and USE SCHEMAenables the grantee to traverse the schema in order to access its child objects. For example, to select data from a table, users need to have the SELECT privilege on that table and the USE CATALOG privilege on its parent catalog, along with the USE SCHEMA privilege on its parent schema. Therefore, you can use this privilege to restrict access to sections of your data namespace to specific groups. A common scenario is to set up a schema per team where only that team has USE SCHEMA and CREATE on the schema. This means that any tables produced by team members can only be shared within the team.

You can secure access to a table using the following SQL syntax:

GRANT USE CATALOG ON CATALOG < catalog_name > TO < group_name >;
GRANT USE SCHEMA ON SCHEMA < catalog_name >.< schema_name >
TO < group_name >;
GRANT
SELECT
  ON < catalog_name >.< schema_name >.< table_name >;
TO < group_name >;

You can secure access to columns using a dynamic view in a secondary schema as shown in the following SQL syntax:

CREATE VIEW < catalog_name >.< schema_name >.< view_name > as
SELECT
  id,
  CASE WHEN is_member(< group_name >) THEN email ELSE 'REDACTED' END AS email,
  country,
  product,
  total
FROM
  < catalog_name >.< schema_name >.< table_name >;
GRANT USE CATALOG ON CATALOG < catalog_name > TO < group_name >;
GRANT USE SCHEMA ON SCHEMA < catalog_name >.< schema_name >.< view_name >;
TO < group_name >;
GRANT
SELECT
  ON < catalog_name >.< schema_name >.< view_name >;
TO < group_name >;

You can secure access to rows using a dynamic view in a secondary schema as shown in the following SQL syntax:

CREATE VIEW < catalog_name >.< schema_name >.< view_name > as
SELECT
  *
FROM
  < catalog_name >.< schema_name >.< table_name >
WHERE
  CASE WHEN is_member(managers) THEN TRUE ELSE total <= 1000000 END;
GRANT USE CATALOG ON CATALOG < catalog_name > TO < group_name >;
GRANT USE SCHEMA ON SCHEMA < catalog_name >.< schema_name >.< table_name >;
TO < group_name >;
GRANT
SELECT
  ON < catalog_name >.< schema_name >.< table_name >;
TO < group_name >;

See Manage privileges in Unity Catalog.

Manage cluster configurations

Databricks recommends using cluster policies to limit the ability to configure clusters based on a set of rules. Cluster policies let you restrict access to only create clusters which are Unity Catalog-enabled. Using cluster policies reduces available choices, which will greatly simplify the cluster creation process for users and ensure that they are able to access data seamlessly. Cluster policies also enable you to control cost by limiting per cluster maximum cost.

To ensure the integrity of access controls and enforce strong isolation guarantees, Unity Catalog imposes security requirements on compute resources. For this reason, Unity Catalog introduces the concept of a cluster’s access mode. Unity Catalog is secure by default; if a cluster is not configured with an appropriate access mode, the cluster can’t access data in Unity Catalog. See Cluster access modes for Unity Catalog.

Databricks recommends using the User Isolation access mode when sharing a cluster and the Single User access mode for automated jobs and machine learning workloads.

The JSON below provides a policy definition for a shared cluster with the User Isolation security mode:

{
"spark_version": {
    "type": "regex",
    "pattern": "1[0-1]\\.[0-9]*\\.x-scala.*",
    "defaultValue": "10.4.x-scala2.12"
},
"access_mode": {
    "type": "fixed",
    "value": "USER_ISOLATION",
    "hidden": true
}
}

The JSON below provides a policy definition for an automated job cluster with the Single User security mode:

{
"spark_version": {
    "type": "regex",
    "pattern": "1[0-1]\\.[0-9].*",
    "defaultValue": "10.4.x-scala2.12"
},
"access_mode": {
    "type": "fixed",
    "value": "SINGLE_USER",
    "hidden": true
},
"single_user_name": {
    "type": "regex",
    "pattern": ".*",
    "hidden": true
}
}

Audit access

A complete data governance solution requires auditing access to data and providing alerting and monitoring capabilities. Unity Catalog captures an audit log of actions performed against the metastore and these logs are delivered as part of Databricks audit logs.

Make sure you configure audit logging. This involves configuring the right access policy so that Databricks can deliver audit logs to an S3 bucket that you provide. Audit logs are typically logged within 15 minutes. Databricks recommends sharing the same audit log configuration for all workspaces in the account.

See Monitoring Your Databricks Lakehouse Platform with Audit Logs for details on how to get complete visibility into critical events relating to your Databricks Lakehouse Platform.

Delta Sharing

Delta Sharing is an open protocol developed by Databricks for secure data sharing with other organizations or other departments within your organization, regardless of which computing platforms they use. When Delta Sharing is enabled on a metastore, Unity Catalog runs a Delta Sharing server.

To share data between metastores, you can leverage Databricks-to-Databricks Delta Sharing. This allows you to register tables from metastores in different regions. These tables will appear as read-only objects in the consuming metastore. These tables can be granted access like any other object within Unity Catalog.

When you use Databricks-to-Databricks Delta Sharing to share between metastores, keep in mind that access control is limited to one metastore. If a securable object, like a table, has grants on it and that resource is shared to an intra-account metastore, then the grants from the source will not apply to the destination share. The destination share will have to set its own grants.