Get started using Unity Catalog

Preview

Unity Catalog is in Public Preview. To participate in the preview, contact your Databricks representative.

This guide helps you get started with Unity Catalog, the Databricks data governance framework.

Requirements

Configure a storage bucket and IAM role in AWS

  1. Find your Databricks account ID.

    1. Log in to the Databricks account console.

    2. Click User Profile User Profile.

    3. From the pop-up, copy the Account ID value.

  2. In AWS, create an S3 bucket.

    This S3 bucket will be the default storage location for managed tables in Unity Catalog. Use a dedicated S3 bucket for each metastore. Make a note of the S3 bucket path, which starts with s3://.

    Important

    The bucket name cannot include dot notation (.). For more bucket naming guidance, see the AWS bucket naming rules.

    If you enable KMS encryption on the S3 bucket, make a note of the name of the KMS encryption key.

  3. In AWS, create an IAM policy in the same AWS account as the S3 bucket.

    In the following sample policy, replace the following values:

    • <BUCKET>: The name of the S3 bucket from the previous step.

    • <KMS_KEY>: The name of the KMS key that encrypts the S3 bucket contents, if encryption is enabled. If encryption is disabled, remove the KMS section of the IAM policy.

    • <AWS_ACCOUNT_ID>: The Account ID of the current AWS account (not your Databricks account).

    • <AWS_IAM_ROLE_NAME>: The name of the AWS IAM role that will be created in the next step.

    {
     "Version": "2012-10-17",
     "Statement": [
         {
             "Action": [
                 "s3:GetObject",
                 "s3:GetObjectVersion",
                 "s3:PutObject",
                 "s3:PutObjectAcl",
                 "s3:DeleteObject",
                 "s3:ListBucket",
                 "s3:GetBucketLocation"
             ],
             "Resource": [
                 "arn:aws:s3:::<BUCKET>/*",
                 "arn:aws:s3:::<BUCKET>"
             ],
             "Effect": "Allow"
         },
         {
             "Action": [
                 "kms:Decrypt",
                 "kms:Encrypt",
                 "kms:GenerateDataKey*"
             ],
             "Resource": [
                 "arn:aws:kms:<KMS_KEY>"
             ],
             "Effect": "Allow"
         },
         {
             "Action": [
                 "sts:AssumeRole"
             ],
             "Resource": [
                 "arn:aws:iam::<AWS_ACCOUNT_ID>:role/<AWS_IAM_ROLE_NAME>"
             ],
             "Effect": "Allow"
         }
       ]
    }
    

    Note

    If you need a more restrictive IAM policy for Unity Catalog, contact your Databricks representative for assistance.

  4. Create an IAM role that uses the IAM policy you created in the previous step.

    1. Set EC2 as the trusted entity. This is a placeholder and has no effect on the trust relationship.

    2. In the Role’s Permission tab, attach the IAM Policy you just created.

    3. Set up a cross-account trust relationship so that Unity Catalog can assume the role to access the data in the bucket on the behalf of Databricks users by pasting the following policy JSON into the Trust Relationship tab.

      • Do not modify the role ARN in the Principal section, which is a static value that references a role created by Databricks.

      • In the sts:ExternalId section, replace <DATABRICKS_ACCOUNT_ID> with your Databricks account ID from the first step (not your AWS account ID).

      {
        "Version": "2012-10-17",
        "Statement": [
          {
            "Effect": "Allow",
            "Principal": {
              "AWS": "arn:aws:iam::414351767826:role/unity-catalog-prod-UCMasterRole-14S5ZJVKOTYTL"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
              "StringEquals": {
                "sts:ExternalId": "<DATABRICKS_ACCOUNT_ID>"
              }
            }
          }
        ]
      }
      

Create your first metastore and attach a workspace

A metastore is the top-level container for data in Unity Catalog. Each metastore exposes a 3-level namespace (catalog.schema.table) by which data can be organized.

A single metastore can be shared across multiple Databricks workspaces in an account. Each linked workspace has the same view of the data in the metastore, and data access control can be managed across workspaces. Databricks allows one metastore per region. If you have a multi-region Databricks deployment, you may want separate metastores for each region, but it is good practice to use a small number of metastores unless your organization requires hard isolation boundaries between sets of data. Data cannot easily be joined or queried across metastores.

To create a metastore:

  1. Log in to the Databricks account console.

  2. Click Data Icon Data.

  3. Click Create Metastore.

    1. Enter a name for the metastore.

    2. Enter the region where the metastore will be deployed. For best performance, co-locate the workspaces, metastore and cloud storage location in the same cloud region.

    3. Enter the S3 bucket path (you can omit s3://) and IAM role name for the bucket and role you created in Configure a storage bucket and IAM role in AWS.

  4. Click Create.

    The user who creates a metastore is its owner and metastore admin. Databricks recommends that you reassign the metastore admin role to a group. See (Recommended) Sync account-level identities from your IdP.

  5. When prompted, select workspaces to link to the metastore.

The user who creates a metastore is its owner. Databricks recommends that you reassign the metastore admin role to a group. See (Recommended) Sync account-level identities from your IdP.

Add users and groups

A Unity Catalog metastore can be shared across multiple Databricks workspaces. So that Databricks has a consistent view of users and groups across all workspaces, you can now create users and groups as account-level identities. Follow these steps to create account-level identities.

Note

To manually add a user:

  1. Log in to the account console (requires the user to be an account admin).

  2. Click Users Icon Users and Groups.

  3. To add a user:

    1. Click Users.

    2. Click Add User.

    3. Enter a name and email address for the user.

    4. Click Send Invite.

    To log into a workspace, a user must also be added to the workspace. To create workspace-level users, see Manage workspace-level users.

  4. To add a group:

    1. Click Groups.

    2. Click Add Group.

    3. Enter a name for the group.

    4. Click Confirm.

    5. When prompted, add users to the group.

To get started, create a group called data-consumers. This group is used later in this walk-through.

Create a compute resource

Tables defined in Unity Catalog are protected by fine-grained access controls. To ensure that access controls are enforced, Unity Catalog requires compute resources to conform to a secure configuration. Unity Catalog is secure by default, meaning that non-conforming compute resources cannot access tables in Unity Catalog.

Databricks provides two kinds of compute resources:

  • Clusters, which are used for workloads in the Data Science & Engineering and Databricks Machine Learning persona-based environments.

  • SQL warehouses, which are used for executing queries in Databricks SQL.

To create a compute resource of either type that can access data in Unity Catalog:

Create a cluster

To create a cluster that can access Unity Catalog, the workspace must be enabled for Unity Catalog.

Important

When UC is enabled in a workspace, the following features will not be available in a new shared access mode cluster created from the UI. Databricks recommends against assigning a Unity Catalog metastore to workspaces that require these features for shared clusters.

  • Cluster-scoped and global init scripts

  • Cluster-scoped libraries

  • Python UDFs

  1. Log in to the workspace as a workspace-level admin.

  2. Click compute icon Compute.

  3. Click Create cluster.

    1. Enter a name for the cluster.

    2. Set Databricks runtime version to Runtime: 11.1 (Scala 2.12.14, Spark 3.3.0) or higher.

  4. Click Access Mode. Set Single User or Shared depending on use.

    Shared clusters can be shared by multiple users, but only SQL and Python workloads are supported.

    Single User clusters can run workloads in Scala, R, Python, and SQL. All queries execute with the privileges of the named user. Only the named user can use the cluster (by default the owner of the cluster); other users cannot attach to the cluster. Databricks recommends scheduling automated jobs using single user clusters; the job and cluster should have the same owner, ideally a service principal. Single user access mode does not support dynamic views.

    For more information about the features available in each security mode, see _.

  5. Click Create Cluster.

Create a SQL warehouse

To create a SQL warehouse that can access Unity Catalog data:

  1. Log in to the workspace as a workspace-level admin.

  2. From the persona switcher, select SQL.

  3. Click Create, then select SQL Warehouse.

  4. Under Advanced Settings set Channel to Preview.

  1. (Optional) Configure the SQL warehouse as a Serverless SQL warehouse (Preview).

    Serverless SQL warehouses start within seconds, rather than minutes. For more information, see Serverless compute.

SQL warehouses are automatically created with the correct security mode, with no configuration required.

Create your first table

In Unity Catalog, metastores contain catalogs that contain schemas (databases), and you always create a table in a schema.

You can refer to a table using three-level notation:

<catalog>.<schema>.<table>

A newly-created metastore contains a catalog named main with an empty schema named default. In this example, you will create a table named department in the default schema in the main catalog.

To create a table, you must be an account admin, metastore admin, or a user with the CREATE permission on the parent schema and the USAGE permission on the parent catalog and schema.

Follow these steps to create a table manually. You can also import an example notebook and run it to create a catalog, schema, and table, along with managing permissions on each.

  1. Create a notebook and attach it to the cluster you created in Create a compute resource.

    For the notebook language, select SQL, Python, R, or Scala, depending on the language you want to use.

  2. Grant permission to create tables on the default schema.

    To create tables, users require the CREATE and USAGE permissions on the schema in addition to the USAGE permission on the catalog. All users receive the USAGE privilege on the main catalog and the main.default schema when a metastore is created.

    Account admins, metastore admins, and the owner of the schema main.default can use the following command to GRANT the CREATE privilege to a user or group:

    GRANT CREATE ON SCHEMA <catalog-name>.<schema-name> TO `<EMAIL_ADDRESS>`;
    
    spark.sql("GRANT CREATE ON SCHEMA <catalog-name>.<schema-name> TO `<EMAIL_ADDRESS>`")
    
    library(SparkR)
    
    sql("GRANT CREATE ON SCHEMA <catalog-name>.<schema-name> TO `<EMAIL_ADDRESS>`")
    
    spark.sql("GRANT CREATE ON SCHEMA <catalog-name>.<schema-name> TO `<EMAIL_ADDRESS>`")
    

    For example, to allow members of the group data-consumers to create tables in main.default:

    GRANT CREATE ON SCHEMA main.default to `data-consumers`;
    
    spark.sql("GRANT CREATE ON SCHEMA main.default to `data-consumers`")
    
    library(SparkR)
    
    sql("GRANT CREATE ON SCHEMA main.default TO `data-consumers`")
    
    spark.sql("GRANT CREATE ON SCHEMA main.default to `data-consumers`")
    

    Run the cell.

  3. Create a new table called department.

    Add a new cell to the notebook. Paste in the following code, which specifies the table name, its columns, and inserts five rows into it.

    CREATE TABLE main.default.department
    (
      deptcode   INT,
      deptname  STRING,
      location  STRING
    );
    
    INSERT INTO main.default.department VALUES
      (10, 'FINANCE', 'EDINBURGH'),
      (20, 'SOFTWARE', 'PADDINGTON'),
      (30, 'SALES', 'MAIDSTONE'),
      (40, 'MARKETING', 'DARLINGTON'),
      (50, 'ADMIN', 'BIRMINGHAM');
    
    from pyspark.sql.types import StructType, StructField, IntegerType, StringType
    
    schema = StructType([ \
      StructField("deptcode", IntegerType(), True),
      StructField("deptname", StringType(), True),
      StructField("location", StringType(), True)
    ])
    
    spark.catalog.createTable(
      tableName = "main.default.department",
      schema = schema \
    )
    
    dfInsert = spark.createDataFrame(
      data = [
        (10, "FINANCE", "EDINBURGH"),
        (20, "SOFTWARE", "PADDINGTON"),
        (30, "SALES", "MAIDSTONE"),
        (40, "MARKETING", "DARLINGTON"),
        (50, "ADMIN", "BIRMINGHAM")
      ],
      schema = schema
    )
    
    dfInsert.write.saveAsTable(
      name = "main.default.department",
      mode = "append"
    )
    
    library(SparkR)
    
    schema = structType(
      structField("deptcode", "integer", TRUE),
      structField("deptname", "string", TRUE),
      structField("location", "string", TRUE)
    )
    
    df = createDataFrame(
      data = list(),
      schema = schema
    )
    
    saveAsTable(
      df = df,
      tableName = "main.default.department"
    )
    
    data = list(
      list("deptcode" = 10L, "deptname" = "FINANCE", "location" = "EDINBURGH"),
      list("deptcode" = 20L, "deptname" = "SOFTWARE", "location" = "PADDINGTON"),
      list("deptcode" = 30L, "deptname" = "SALES", "location" = "MAIDSTONE"),
      list("deptcode" = 40L, "deptname" = "MARKETING", "location" = "DARLINGTON"),
      list("deptcode" = 50L, "deptname" = "ADMIN", "location" = "BIRMINGHAM")
    )
    
    dfInsert = createDataFrame(
      data = data,
      schema = schema
    )
    
    insertInto(
      x = dfInsert,
      tableName = "main.default.department"
    )
    
    import spark.implicits._
    import org.apache.spark.sql.types.StructType
    
    val df = spark.createDataFrame(
      new java.util.ArrayList[Row](),
      new StructType()
        .add("deptcode", "int")
        .add("deptname", "string")
        .add("location", "string")
    )
    
    df.write
      .format("delta")
      .saveAsTable("main.default.department")
    
    val dfInsert = Seq(
      (10, "FINANCE", "EDINBURGH"),
      (20, "SOFTWARE", "PADDINGTON"),
      (30, "SALES", "MAIDSTONE"),
      (40, "MARKETING", "DARLINGTON"),
      (50, "ADMIN", "BIRMINGHAM")
    ).toDF("deptcode", "deptname", "location")
    
    dfInsert.write.insertInto("main.default.department")
    

    Run the cell.

  4. Query the table.

    Add a new cell to the notebook. Paste in the following code, then run the cell.

    SELECT * from main.default.department;
    
    display(spark.table("main.default.department"))
    
    display(tableToDF("main.default.department"))
    
    display(spark.table("main.default.department"))
    
  5. Grant the ability to read and query the table to the data-consumers group that you created in Add users and groups.

    Add a new cell to the notebook and paste in the following code:

    GRANT SELECT ON main.default.department TO `data-consumers`;
    
    spark.sql("GRANT SELECT ON main.default.department TO `data-consumers`")
    
    sql("GRANT SELECT ON main.default.department TO `data-consumers`")
    
    spark.sql("GRANT SELECT ON main.default.department TO `data-consumers`")
    

    Note

    To grant read access to all account-level users instead of only data-consumers, use the group name account users instead.

    Run the cell.

Short-cut: use an example notebook to create a catalog, schema, and table

You can use the following example notebook to create a catalog, schema, and table, as well as manage permissions on each.

Create and manage a Unity Catalog table with SQL

Open notebook in new tab

Create and manage a Unity Catalog table with Python

Open notebook in new tab

(Optional) Install the Unity Catalog CLI

The Unity Catalog CLI is part of the Databricks CLI. To use the Unity Catalog CLI, do the following:

  1. Set up the CLI.

  2. Set up authentication.

  3. Optionally, create one or more connection profiles to use with the CLI.

  4. Learn how to use the Databricks CLI in general.

  5. Begin using the Unity Catalog CLI.

Next steps