Get started using Unity Catalog

This article provides step-by-step instructions for setting up Unity Catalog for your organization. It describes how to enable your Databricks account to use Unity Catalog and how to create your first tables in Unity Catalog.

Overview of Unity Catalog setup

This section provides a high-level overview of how to set up your Databricks account to use Unity Catalog and create your first tables. For detailed step-by-step instructions, see the sections that follow this one.

To enable your Databricks account to use Unity Catalog, you do the following:

  1. Configure an S3 bucket and IAM role that Unity Catalog can use to store and access data in your AWS account.

  2. Create a metastore for each region in which your organization operates. This metastore functions as the top-level container for all of your data in Unity Catalog.

    As the creator of the metastore, you are its owner and metastore admin.

  3. Attach workspaces to the metastore. Each workspace will have the same view of the data that you manage in Unity Catalog.

  4. Add users, groups, and service principals to your Databricks account.

    For existing Databricks accounts, these identities are already present.

  5. (Optional) Transfer your metastore admin role to a group.

To set up data access for your users, you do the following:

  1. In a workspace, create at least one compute resource: either a cluster or SQL warehouse.

    You will use this compute resource when you run queries and commands, including grant statements on data objects that are secured in Unity Catalog.

  2. Create at least one catalog.

    Catalogs hold the schemas (databases) that in turn hold the tables that your users work with.

  3. Create at least one schema.

  4. Create tables.

For each level in the data hierarchy (catalogs, schemas, tables), you grant privileges to users, groups, or service principals. You can also grant row- or column-level privileges using dynamic views.

Requirements

  • You must be a Databricks account admin.

  • Your Databricks account must be on the Premium plan or above.

  • In AWS, you must have the ability to create S3 buckets, IAM roles, IAM policies, and cross-account trust relationships.

  • You must have at least one workspace that you want to use with Unity Catalog. See Create and manage workspaces.

Configure a storage bucket and IAM role in AWS

In this step, you create the AWS objects required by Unity Catalog to store and access data in your AWS account.

  1. Find your Databricks account ID.

    1. Log in to the Databricks account console.

    2. Click User Profile User Profile.

    3. From the pop-up, copy the Account ID value.

  2. In AWS, create an S3 bucket.

    This S3 bucket will be the root storage location for managed tables in Unity Catalog. Use a dedicated S3 bucket for each metastore and locate it in the same region as the workspaces you want to access the data from. Make a note of the S3 bucket path, which starts with s3://.

    This default storage location can be overridden at the catalog and schema levels.

    Important

    The bucket name cannot include dot notation (for example, incorrect.bucket.name.notation). For more bucket naming guidance, see the AWS bucket naming rules.

    If you enable KMS encryption on the S3 bucket, make a note of the name of the KMS encryption key.

  3. Create an IAM role that allows access to the S3 bucket.

    Set up a cross-account trust relationship so that Unity Catalog can assume the role to access the data in the bucket on behalf of Databricks users. Paste the following policy JSON into the Trust Relationship tab.

    • Do not modify the role ARN in the Principal section. This is a static value that references a role created by Databricks.

    • In the sts:ExternalId section, replace <DATABRICKS_ACCOUNT_ID> with the Databricks account ID you found in step 1 (not your AWS account ID).

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "AWS": "arn:aws:iam::414351767826:role/unity-catalog-prod-UCMasterRole-14S5ZJVKOTYTL"
          },
          "Action": "sts:AssumeRole",
          "Condition": {
            "StringEquals": {
              "sts:ExternalId": "<DATABRICKS_ACCOUNT_ID>"
            }
          }
        }
      ]
    }
    
  4. In AWS, create an IAM policy in the same AWS account as the S3 bucket.

    To avoid unexpected issues, you must use the following sample policy, replacing the following values:

    • <BUCKET>: The name of the S3 bucket you created in the previous step.

    • <KMS_KEY>: Optional. If encryption is enabled, provide the name of the KMS key that encrypts the S3 bucket contents. If encryption is disabled, remove the entire KMS section of the IAM policy.

    • <AWS_ACCOUNT_ID>: The Account ID of the current AWS account (not your Databricks account).

    • <AWS_IAM_ROLE_NAME>: The name of the AWS IAM role that you created in the previous step.

    {
     "Version": "2012-10-17",
     "Statement": [
         {
             "Action": [
                 "s3:GetObject",
                 "s3:PutObject",
                 "s3:DeleteObject",
                 "s3:ListBucket",
                 "s3:GetBucketLocation",
                 "s3:GetLifecycleConfiguration",
                 "s3:PutLifecycleConfiguration"
             ],
             "Resource": [
                 "arn:aws:s3:::<BUCKET>/*",
                 "arn:aws:s3:::<BUCKET>"
             ],
             "Effect": "Allow"
         },
         {
             "Action": [
                 "kms:Decrypt",
                 "kms:Encrypt",
                 "kms:GenerateDataKey*"
             ],
             "Resource": [
                 "arn:aws:kms:<KMS_KEY>"
             ],
             "Effect": "Allow"
         },
         {
             "Action": [
                 "sts:AssumeRole"
             ],
             "Resource": [
                 "arn:aws:iam::<AWS_ACCOUNT_ID>:role/<AWS_IAM_ROLE_NAME>"
             ],
             "Effect": "Allow"
         }
       ]
    }
    

    Note

    • If you need a more restrictive IAM policy for Unity Catalog, contact your Databricks representative for assistance.

    • Databricks uses GetLifecycleConfiguration and PutLifecycleConfiguration to manage lifecycle policies for the personal staging locations used by Partner Connect and the upload data UI.

  5. Attach the IAM policy to the IAM role.

    On the IAM role’s Permission tab, attach the IAM policy you just created.

Create your first metastore and attach a workspace

To use Unity Catalog, you must create a metastore. A metastore is the top-level container for data in Unity Catalog. Each metastore exposes a three-level namespace (catalog.schema.table) by which data can be organized.

You create a metastore for each region in which your organization operates. You can link each of these regional metastores to any number of workspaces in that region.

Each linked workspace has the same view of the data in the metastore, and data access control can be managed across workspaces.

You can access data across metastores using Delta Sharing.

To create a metastore:

  1. Log in to the Databricks account console.

  2. Click Data Icon Data.

  3. Click Create Metastore.

    Enter the following:

    • A name for the metastore.

    • The region where you want to deploy the metastore.

      This must be in the same region as the workspaces you want to use to access the data. Make sure that this matches the region of the storage bucket you created earlier.

    • The S3 bucket path (you can omit s3://) and IAM role name for the bucket and role you created in Configure a storage bucket and IAM role in AWS.

  4. Click Create.

  5. When prompted, select workspaces to link to the metastore.

    For more information about assigning workspaces to metastores, see Enable a workspace for Unity Catalog.

The user who creates a metastore is its owner, also called the metastore admin. The metastore admin can create top-level objects in the metastore such as catalogs and can manage access to tables and other objects. Databricks recommends that you reassign the metastore admin role to a group. See (Recommended) Transfer ownership of your metastore to a group.

Add users and groups

A Unity Catalog metastore can be shared across multiple Databricks workspaces. Unity Catalog takes advantage of Databricks account-level identity management to provide a consistent view of users, service principals, and groups across all workspaces. In this step, you create users and groups in the account console and then choose the workspaces these identities can access.

Note

  • If you have an existing account and workspaces, your probably already have existing users and groups in your account, so you can skip this step.

  • If you have a large number of users or groups in your account, or if you prefer to manage identities outside of Databricks, you can sync users and groups from your identity provider (IdP).

To add a user and group using the account console:

  1. Log in to the account console (requires the user to be an account admin).

  2. Click Account Console user management icon User management.

  3. Add a user:

    1. Click Users.

    2. Click Add User.

    3. Enter a name and email address for the user.

    4. Click Send Invite.

  4. Add a group:

    1. Click Groups.

    2. Click Add Group.

    3. Enter a name for the group.

    4. Click Confirm.

    5. When prompted, add users to the group.

  5. Add a user or group to a workspace, where they can perform data science, data engineering, and data analysis tasks using the data managed by Unity Catalog:

    1. In the sidebar, click Workspace Icon Workspaces.

    2. On the Permissions tab, click Add permissions.

    3. Search for and select the user or group, assign the permission level (workspace User or Admin), and click Save.

To get started, create a group called data-consumers. This group is used later in this walk-through.

Create a cluster or SQL warehouse

Tables defined in Unity Catalog are protected by fine-grained access controls. To ensure that access controls are enforced, Unity Catalog requires compute resources to conform to a secure configuration. Unity Catalog is secure by default, meaning that non-conforming compute resources cannot access tables in Unity Catalog.

Databricks provides two kinds of compute resources:

  • Clusters, which are used for workloads in the Data Science & Engineering and Databricks Machine Learning persona-based environments.

  • SQL warehouses, which are used for executing queries in Databricks SQL.

You can use either of these compute resources to work with Unity Catalog, depending on the environment you are using: SQL warehouses for Databricks SQL or clusters for the Data Science & Engineering and Databricks Machine Learning environments.

Create a cluster

To create a cluster that can access Unity Catalog:

  1. Log in to your workspace as a workspace admin or user with permission to create clusters.

  2. Click compute icon Compute.

  3. Click Create cluster.

    1. Enter a name for the cluster.

    2. Set the Access mode to Single user.

      Only Single user and Shared access modes support Unity Catalog. See What is cluster access mode?.

    3. Set Databricks runtime version to Runtime: 11.1 (Scala 2.12, Spark 3.2.1) or higher.

  4. Click Create Cluster.

For specific configuration options, see Create a cluster.

Create a SQL warehouse

SQL warehouses support Unity Catalog by default, and there is no special configuration required.

To create a SQL warehouse:

  1. Log in to your workspace as a workspace admin or user with permission to create clusters.

  2. From the persona switcher, select SQL.

  3. Click Create and select SQL Warehouse.

For specific configuration options, see Create a SQL warehouse.

Create your first table

In Unity Catalog, metastores contain catalogs that contain schemas (databases), and you always create a table in a schema.

You can refer to a table using three-level notation:

<catalog>.<schema>.<table>

A newly-created metastore contains a catalog named main with an empty schema named default. In this example, you will create a table named department in the default schema in the main catalog.

To create a table, you must have the CREATE TABLE permission on the parent schema, the USE CATALOG permission on the parent catalog, and the USE SCHEMA permission on the parent schema. Metastore admins have these permissions by default.

The main catalog and main.default schema are unique in that all users begin with the USE CATALOG privilege on the main catalog and the USE SCHEMA privilege on the main.default schema. If you are not a metastore admin, either a metastore admin or the owner of the schema can grant you the CREATE TABLE privilege on the main.default schema.

Follow these steps to create a table manually. You can also import an example notebook and run it to create a catalog, schema, and table, along with managing permissions on each.

  1. Create a notebook and attach it to the cluster you created in Create a cluster or SQL warehouse.

    For the notebook language, select SQL, Python, R, or Scala, depending on the language you want to use.

  2. Grant permission to create tables on the default schema.

    To create tables, users require the CREATE TABLE and USE SCHEMA permissions on the schema in addition to the USE CATALOG permission on the catalog. All users receive the USE CATALOG privilege on the main catalog and the USE SCHEMA privilege on the main.default schema when a metastore is created.

    Metastore admins and the owner of the schema main.default can use the following command to GRANT the CREATE TABLE privilege to a user or group:

    GRANT CREATE TABLE ON SCHEMA <catalog-name>.<schema-name> TO `<EMAIL_ADDRESS>`;
    
    spark.sql("GRANT CREATE TABLE ON SCHEMA <catalog-name>.<schema-name> TO `<EMAIL_ADDRESS>`")
    
    library(SparkR)
    
    sql("GRANT CREATE TABLE ON SCHEMA <catalog-name>.<schema-name> TO `<EMAIL_ADDRESS>`")
    
    spark.sql("GRANT CREATE TABLE ON SCHEMA <catalog-name>.<schema-name> TO `<EMAIL_ADDRESS>`")
    

    For example, to allow members of the group data-consumers to create tables in main.default:

    GRANT CREATE TABLE ON SCHEMA main.default to `data-consumers`;
    
    spark.sql("GRANT CREATE TABLE ON SCHEMA main.default to `data-consumers`")
    
    library(SparkR)
    
    sql("GRANT CREATE TABLE ON SCHEMA main.default TO `data-consumers`")
    
    spark.sql("GRANT CREATE TABLE ON SCHEMA main.default to `data-consumers`")
    

    Run the cell.

  3. Create a new table called department.

    Add a new cell to the notebook. Paste in the following code, which specifies the table name, its columns, and inserts five rows into it.

    CREATE TABLE main.default.department
    (
      deptcode   INT,
      deptname  STRING,
      location  STRING
    );
    
    INSERT INTO main.default.department VALUES
      (10, 'FINANCE', 'EDINBURGH'),
      (20, 'SOFTWARE', 'PADDINGTON'),
      (30, 'SALES', 'MAIDSTONE'),
      (40, 'MARKETING', 'DARLINGTON'),
      (50, 'ADMIN', 'BIRMINGHAM');
    
    from pyspark.sql.types import StructType, StructField, IntegerType, StringType
    
    schema = StructType([ \
      StructField("deptcode", IntegerType(), True),
      StructField("deptname", StringType(), True),
      StructField("location", StringType(), True)
    ])
    
    spark.catalog.createTable(
      tableName = "main.default.department",
      schema = schema \
    )
    
    dfInsert = spark.createDataFrame(
      data = [
        (10, "FINANCE", "EDINBURGH"),
        (20, "SOFTWARE", "PADDINGTON"),
        (30, "SALES", "MAIDSTONE"),
        (40, "MARKETING", "DARLINGTON"),
        (50, "ADMIN", "BIRMINGHAM")
      ],
      schema = schema
    )
    
    dfInsert.write.saveAsTable(
      name = "main.default.department",
      mode = "append"
    )
    
    library(SparkR)
    
    schema = structType(
      structField("deptcode", "integer", TRUE),
      structField("deptname", "string", TRUE),
      structField("location", "string", TRUE)
    )
    
    df = createDataFrame(
      data = list(),
      schema = schema
    )
    
    saveAsTable(
      df = df,
      tableName = "main.default.department"
    )
    
    data = list(
      list("deptcode" = 10L, "deptname" = "FINANCE", "location" = "EDINBURGH"),
      list("deptcode" = 20L, "deptname" = "SOFTWARE", "location" = "PADDINGTON"),
      list("deptcode" = 30L, "deptname" = "SALES", "location" = "MAIDSTONE"),
      list("deptcode" = 40L, "deptname" = "MARKETING", "location" = "DARLINGTON"),
      list("deptcode" = 50L, "deptname" = "ADMIN", "location" = "BIRMINGHAM")
    )
    
    dfInsert = createDataFrame(
      data = data,
      schema = schema
    )
    
    insertInto(
      x = dfInsert,
      tableName = "main.default.department"
    )
    
    import spark.implicits._
    import org.apache.spark.sql.types.StructType
    
    val df = spark.createDataFrame(
      new java.util.ArrayList[Row](),
      new StructType()
        .add("deptcode", "int")
        .add("deptname", "string")
        .add("location", "string")
    )
    
    df.write
      .format("delta")
      .saveAsTable("main.default.department")
    
    val dfInsert = Seq(
      (10, "FINANCE", "EDINBURGH"),
      (20, "SOFTWARE", "PADDINGTON"),
      (30, "SALES", "MAIDSTONE"),
      (40, "MARKETING", "DARLINGTON"),
      (50, "ADMIN", "BIRMINGHAM")
    ).toDF("deptcode", "deptname", "location")
    
    dfInsert.write.insertInto("main.default.department")
    

    Run the cell.

  4. Query the table.

    Add a new cell to the notebook. Paste in the following code, then run the cell.

    SELECT * from main.default.department;
    
    display(spark.table("main.default.department"))
    
    display(tableToDF("main.default.department"))
    
    display(spark.table("main.default.department"))
    
  5. Grant the ability to read and query the table to the data-consumers group that you created in Add users and groups.

    Add a new cell to the notebook and paste in the following code:

    GRANT SELECT ON main.default.department TO `data-consumers`;
    
    spark.sql("GRANT SELECT ON main.default.department TO `data-consumers`")
    
    sql("GRANT SELECT ON main.default.department TO `data-consumers`")
    
    spark.sql("GRANT SELECT ON main.default.department TO `data-consumers`")
    

    Note

    To grant read access to all account-level users instead of only data-consumers, use the group name account users instead.

    Run the cell.

Short-cut: use an example notebook to create a catalog, schema, and table

You can use the following example notebook to create a catalog, schema, and table, as well as manage permissions on each.

Create and manage a Unity Catalog table with SQL

Open notebook in new tab

Create and manage a Unity Catalog table with Python

Open notebook in new tab

(Optional) Install the Unity Catalog CLI

The Unity Catalog CLI is part of the Databricks CLI. To use the Unity Catalog CLI, do the following:

  1. Set up the CLI.

  2. Set up authentication.

  3. Optionally, create one or more connection profiles to use with the CLI.

  4. Learn how to use the Databricks CLI in general.

  5. Begin using the Unity Catalog CLI.

Next steps