Get started using Unity Catalog
This article provides step-by-step instructions for setting up Unity Catalog for your organization. It describes how to enable your Databricks account to use Unity Catalog and how to create your first tables in Unity Catalog.
Overview of Unity Catalog setup
This section provides a high-level overview of how to set up your Databricks account to use Unity Catalog and create your first tables. For detailed step-by-step instructions, see the sections that follow this one.
Set up your Databricks account for Unity Catalog
To enable your Databricks account to use Unity Catalog, you do the following:
Configure an S3 bucket and IAM role that Unity Catalog can use to store and access managed table data in your AWS account.
Create a metastore for each region in which your organization operates. This metastore functions as the top-level container for all of your data in Unity Catalog.
Assign workspaces to the metastore. Each workspace has the same view of the data that you manage in Unity Catalog.
Add users, groups, and service principals to your Databricks account.
For existing Databricks accounts, these identities are already present.
(Optional) Transfer your metastore admin role to a group.
Set up data access for your users
To set up data access for your users, you do the following:
In a workspace, create at least one compute resource: either a cluster or SQL warehouse.
You will use this compute resource when you run queries and commands, including grant statements on data objects that are secured in Unity Catalog.
Create at least one catalog.
Catalogs hold the schemas (databases) that in turn hold the tables that your users work with.
Create at least one schema.
Create tables.
For each level in the data hierarchy (catalogs, schemas, tables), you grant privileges to users, groups, or service principals. You can also grant row- or column-level privileges using dynamic views.
Requirements
You must be a Databricks account admin.
Your Databricks account must be on the Premium plan or above.
In AWS, you must have the ability to create S3 buckets, IAM roles, IAM policies, and cross-account trust relationships.
You must have at least one workspace that you want to use with Unity Catalog. See Create a workspace using the account console.
Configure a storage bucket and IAM role in AWS
In this step, you create the AWS objects required by Unity Catalog to store and access managed table data in your AWS account.
Find your Databricks account ID.
Log in to the Databricks account console.
Click your username.
From the menu, copy the Account ID value.
In AWS, create an S3 bucket.
This S3 bucket will be the root storage location for managed tables in Unity Catalog. Use a dedicated S3 bucket for each metastore and locate it in the same region as the workspaces you want to access the data from. Make a note of the S3 bucket path, which starts with
s3://
.This default storage location can be overridden at the catalog and schema levels.
Important
The bucket name cannot include dot notation (for example,
incorrect.bucket.name.notation
). For more bucket naming guidance, see the AWS bucket naming rules.If you enable KMS encryption on the S3 bucket, make a note of the name of the KMS encryption key.
Create an IAM role that will allow access to the S3 bucket.
Role creation is a two-step process. In this step you simply create the role, adding a temporary trust relationship policy that you then modify in the next step. You must modify the trust policy after you create the role because your role must be self-assuming—that is, it must be configured to trust itself. The role must therefore exist before you add the self-assumption statement. For information about self-assuming roles, see this Amazon blog article.
Create the IAM role with a Custom Trust Policy.
In the Custom Trust Policy field, paste the following policy JSON, replacing
<DATABRICKS-ACCOUNT-ID>
with the Databricks account ID you found in step 1 (not your AWS account ID).This policy establishes a cross-account trust relationship so that Unity Catalog can assume the role to access the data in the bucket on behalf of Databricks users. This is specified by the ARN in the
Principal
section. It is a static value that references a role created by Databricks. Do not modify it.{ "Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Principal": { "AWS": [ "arn:aws:iam::414351767826:role/unity-catalog-prod-UCMasterRole-14S5ZJVKOTYTL" ] }, "Action": "sts:AssumeRole", "Condition": { "StringEquals": { "sts:ExternalId": "<DATABRICKS-ACCOUNT-ID>" } } }] }
Skip the permissions policy configuration. You’ll go back to add that in a later step.
Save the IAM role.
Modify the trust relationship policy to make it “self-assuming.”
Return to your saved IAM role and go to the Trust Relationships tab.
Edit the trust relationship policy, adding the following ARN to the “Allow” statement.
Replace
<YOUR-AWS-ACCOUNT-ID>
and<THIS-ROLE-NAME>
with your actual IAM role values."arn:aws:iam::<YOUR-AWS-ACCOUNT-ID>:role/<THIS-ROLE-NAME>"
Your policy should now look like this (with replacement text updated to use your Databricks account ID and IAM role values):
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "AWS": [ "arn:aws:iam::414351767826:role/unity-catalog-prod-UCMasterRole-14S5ZJVKOTYTL", "arn:aws:iam::<YOUR-AWS-ACCOUNT-ID>:role/<THIS-ROLE-NAME>" ] }, "Action": "sts:AssumeRole", "Condition": { "StringEquals": { "sts:ExternalId": "<DATABRICKS-ACCOUNT-ID>" } } } ] }
In AWS, create an IAM policy in the same AWS account as the S3 bucket.
To avoid unexpected issues, you must use the following sample policy, replacing the following values:
<BUCKET>
: The name of the S3 bucket you created in the previous step.<KMS-KEY>
: Optional. If encryption is enabled, provide the name of the KMS key that encrypts the S3 bucket contents. If encryption is disabled, remove the entire KMS section of the IAM policy.<AWS-ACCOUNT-ID>
: The Account ID of the current AWS account (not your Databricks account).<AWS-IAM-ROLE-NAME>
: The name of the AWS IAM role that you created in the previous step.
{ "Version": "2012-10-17", "Statement": [ { "Action": [ "s3:GetObject", "s3:PutObject", "s3:DeleteObject", "s3:ListBucket", "s3:GetBucketLocation", "s3:GetLifecycleConfiguration", "s3:PutLifecycleConfiguration" ], "Resource": [ "arn:aws:s3:::<BUCKET>/*", "arn:aws:s3:::<BUCKET>" ], "Effect": "Allow" }, { "Action": [ "kms:Decrypt", "kms:Encrypt", "kms:GenerateDataKey*" ], "Resource": [ "arn:aws:kms:<KMS-KEY>" ], "Effect": "Allow" }, { "Action": [ "sts:AssumeRole" ], "Resource": [ "arn:aws:iam::<AWS-ACCOUNT-ID>:role/<AWS-IAM-ROLE-NAME>" ], "Effect": "Allow" } ] }
Note
If you need a more restrictive IAM policy for Unity Catalog, contact your Databricks representative for assistance.
Attach the IAM policy to the IAM role.
On the IAM role’s Permissions tab, attach the IAM policy you just created.
Create your first metastore and attach a workspace
To use Unity Catalog, you must create a metastore. A metastore is the top-level container for data in Unity Catalog. Each metastore exposes a three-level namespace (catalog
.schema
.table
) by which data can be organized.
You create a metastore for each region in which your organization operates. You can link each of these regional metastores to any number of workspaces in that region. Each linked workspace has the same view of the data in the metastore, and data access control can be managed across workspaces. You can access data in other metastores using Delta Sharing.
The metastore will use the the S3 bucket and IAM role that you created in the previous step.
To create a metastore:
Log in to the Databricks account console.
Click
Data.
Click Create Metastore.
Enter the following:
A name for the metastore.
The region where you want to deploy the metastore.
This must be in the same region as the workspaces you want to use to access the data. Make sure that this matches the region of the storage bucket you created earlier.
The S3 bucket path (you can omit
s3://
) and IAM role name for the bucket and role you created in Configure a storage bucket and IAM role in AWS.
Click Create.
When prompted, select workspaces to link to the metastore.
To learn how to assign workspaces to metastores, see Enable a workspace for Unity Catalog.
(Recommended) Transfer the metastore admin role to a group.
The user who creates a metastore is its owner, also called the metastore admin. The metastore admin can create top-level objects in the metastore such as catalogs and can manage access to tables and other objects. Databricks recommends that you reassign the metastore admin role to a group. See (Recommended) Transfer ownership of your metastore to a group
Add users and groups
A Unity Catalog metastore can be shared across multiple Databricks workspaces. Unity Catalog takes advantage of Databricks account-level identity management to provide a consistent view of users, service principals, and groups across all workspaces. In this step, you create users and groups in the account console and then choose the workspaces these identities can access.
Note
If you have an existing account and workspaces, your probably already have existing users and groups in your account, so you can skip the user and group creation steps.
If you have a large number of users or groups in your account, or if you prefer to manage identities outside of Databricks, you can sync users and groups from your identity provider (IdP).
To add a user and group using the account console:
Log in to the account console (requires the user to be an account admin).
Click
User management.
Add a user:
Click Users.
Click Add User.
Enter a name and email address for the user.
Click Add user.
Add a group:
Click Groups.
Click Add Group.
Enter a name for the group.
Click Confirm.
When prompted, add users to the group.
Add a user or group to a workspace, where they can perform data science, data engineering, and data analysis tasks using the data managed by Unity Catalog:
In the sidebar, click
Workspaces and select a workspace.
On the Permissions tab, click Add permissions.
Search for and select the user or group, assign the permission level (workspace User or Admin), and click Save.
To get started, create a group called data-consumers. This group is used later in this walk-through.
Create a cluster or SQL warehouse
Before you can start creating tables and assigning permissions, you need to create a compute resource to run your table-creation and permission-assignment workloads.
Tables defined in Unity Catalog are protected by fine-grained access controls. To ensure that access controls are enforced, Unity Catalog requires compute resources to conform to a secure configuration. Non-conforming compute resources cannot access tables in Unity Catalog.
Databricks provides two kinds of compute resources:
Clusters, which are used for executing commands in Databricks notebooks and running jobs.
SQL warehouses, which are used for executing queries in SQL Editor.
You can use either of these compute resources to work with Unity Catalog.
Create a cluster
To create a cluster that can access Unity Catalog:
Log in to your workspace as a workspace admin or user with permission to create clusters.
Click
Compute.
Click Create compute.
Enter a name for the cluster.
Set the Access mode to Shared.
Only Single User and Shared access modes support Unity Catalog. See What is cluster access mode?.
Set Databricks runtime version to Runtime: 11.3 LTS (Scala 2.12, Spark 3.3.0) or higher.
Click Create Cluster.
For specific configuration options, see Create a cluster.
Create a SQL warehouse
SQL warehouses support Unity Catalog by default, and there is no special configuration required.
To create a SQL warehouse:
Log in to your workspace as a workspace admin or user with permission to create clusters.
In the sidebar, click New* > SQL Warehouse.
For specific configuration options, see Configure SQL warehouses.
Create your first table and manage permissions
Unity Catalog enables you to define access to tables declaratively using SQL or the Databricks Explorer UI. It is designed to follow a “define once, secure everywhere” approach, meaning that access rules will be honored from all Databricks workspaces, clusters, and SQL warehouses in your account, as long as the workspaces share the same metastore.
In this example, you’ll run a notebook that creates a table named department
in the main
catalog and default
schema (database). This catalog and schema are created automatically for all metastores.
You can also try running an example notebook that performs the same tasks.
Permissions required USE CATALOG
permission. All users have the USE CATALOG
permission on the main
catalog by default. No other permissions are required to complete this example apart from those that you grant as you run it.
Create a notebook and attach it to the cluster you created in Create a cluster or SQL warehouse.
Select
SQL
as your notebook language.Add the following commands to the notebook and run them:
GRANT USE SCHEMA, CREATE TABLE ON SCHEMA main.default TO `<user>@<domain>.com`;
Replace
<user>@<domain>.com
with your Databricks username. You must enclose the username with backticks (` `
).CREATE TABLE IF NOT EXISTS main.default.department ( deptcode INT, deptname STRING, location STRING );
INSERT INTO main.default.department VALUES (10, 'FINANCE', 'EDINBURGH'), (20, 'SOFTWARE', 'PADDINGTON');
You now have a table in Unity Catalog.
Find the new table in Catalog Explorer.
In the sidebar, click
Catalog, then use the schema browser (or search) to find the
main
catalog and thedefault
catalog, where you’ll find thedepartment
table.Notice that you don’t need a running cluster or SQL warehouse to browse data in Catalog Explorer.
Grant permissions on the table.
As the original table creator, you’re the table owner, and you can grant other users permission to read or write to the table. You can even transfer ownership, but we won’t do that here.
On the table page in Catalog Explorer, go to the Permissions tab and click Grant.
On the Grant on dialog:
Select the users and groups you want to give permission to. In this example, we use a group called
data-consumers
.Select the privileges you want to grant. For this example, assign the
SELECT
privilege and click Grant.
For more information about the Unity Catalog privileges and permissions model, see Manage privileges in Unity Catalog.
You can also grant those permissions using the following SQL statement in a Databricks notebook or the Databricks SQL query editor:
GRANT SELECT ON main.default.department TO `data-consumers`;
Run one of the example notebooks that follow for a more detailed walkthrough that includes catalog and schema creation, a summary of available privileges, a sample query, and more.
Example notebooks: Create your first table and manage permissions
You can use the following example notebooks to create a catalog, schema, and table, as well as manage permissions on each.
(Optional) Link the metastore to additional workspaces
A key benefit of Unity Catalog is the ability to share a single metastore among multiple workspaces that are located in the same region. You can run different types of workloads against the same data without moving or copying data among workspaces. Each workspace can have only one Unity Catalog metastore assigned to it.
To learn how to link the metastore to additional workspaces, see Enable a workspace for Unity Catalog.
(Recommended) Sync account-level identities from your IdP
You can manage user access to Databricks by setting up provisioning from a third-party identity provider (IdP), like Okta. For complete instructions, see Sync users and groups from your identity provider.
(Optional) Install the Unity Catalog CLI
The Unity Catalog CLI is experimental, but it can be a convenient way to manage Unity Catalog from the command line. It is part of the Databricks CLI. To use the Unity Catalog CLI, do the following:
Optionally, create one or more connection profiles to use with the CLI.
Learn how to use the Databricks CLI in general.
Begin using the Unity Catalog CLI (legacy).
Next steps
Learn more about Unity Catalog: What is Unity Catalog?