Create a Unity Catalog metastore

This article shows how to create a Unity Catalog metastore and link it to workspaces.

Important

For workspaces that were enabled for Unity Catalog automatically, the instructions in this article are unnecessary. Databricks began to enable new workspaces for Unity Catalog automatically on November 8, 2023, with a rollout proceeding gradually across accounts. You must follow the instructions in this article only if you have a workspace and don’t already have a metastore in your workspace region. To determine whether a metastore already exists in your region, see Automatic enablement of Unity Catalog.

A metastore is the top-level container for data in Unity Catalog. Unity Catalog metastores register metadata about securable objects (such as tables, volumes, external locations, and shares) and the permissions that govern access to them. Each metastore exposes a three-level namespace (catalog.schema.table) by which data can be organized. You must have one metastore for each region in which your organization operates. To work with Unity Catalog, users must be on a workspace that is attached to a metastore in their region.

To create a metastore, you do the following:

  1. In your AWS account, optionally create a storage location for metastore-level storage of managed tables and volumes.

    For information to help you decide whether you need metastore-level storage, see (Optional) Create metastore-level storage and Data is physically separated in storage.

  2. In your AWS account, create an IAM role that gives access to that storage location.

  3. In Databricks, create the metastore, attaching the storage location, and assign workspaces to the metastore.

Note

In addition to the approaches described in this article, you can also create a metastore by using the Databricks Terraform provider, specifically the databricks_metastore resource. To enable Unity Catalog to access the metastore, use databricks_metastore_data_access. To link workspaces to a metastore, use databricks_metastore_assignment.

Before you begin

Before you begin, you should familiarize yourself with the basic Unity Catalog concepts, including metastores and managed storage. See What is Unity Catalog?.

You should also confirm that you meet the following requirements for all setup steps:

  • You must be a Databricks account admin.

  • Your Databricks account must be on the Premium plan or above.

  • If you want to set up metastore-level root storage, you must have the ability to create S3 buckets, IAM roles, IAM policies, and cross-account trust relationships in your AWS account.

Step 1 (Optional): Create an S3 bucket for metastore-level managed storage in AWS

In this step, which is optional, you create the S3 bucket required by Unity Catalog to store managed table and volume data at the metastore level. You create the S3 bucket in your own AWS account. To determine whether you need metastore-level storage, see (Optional) Create metastore-level storage.

  1. In AWS, create an S3 bucket.

    This S3 bucket will be the metastore-level storage location for managed tables and managed volumes in Unity Catalog. This default storage location can be overridden at the catalog and schema levels. See Managed storage

    Requirements:

    • If you have more than one metastore, you should use a dedicated S3 bucket for each one.

    • Locate the bucket in the same region as the workspaces you want to access the data from.

    • The bucket name cannot include dot notation (for example, incorrect.bucket.name.notation). For more bucket naming guidance, see the AWS bucket naming rules.

  2. Make a note of the S3 bucket path, which starts with s3://.

  3. If you enable KMS encryption on the S3 bucket, make a note of the name of the KMS encryption key.

Step 2 (Optional): Create an IAM role to access the storage location

In this step, which is required only if you completed step 1, you create the IAM role required by Unity Catalog to access the S3 bucket that you created in the previous step.

Role creation is a two-step process. First you simply create the role, adding a temporary trust relationship policy that you then modify in a later step. You must modify the trust policy after you create the role because your role must be self-assuming—that is, it must be configured to trust itself. The role must therefore exist before you add the self-assumption statement. For information about self-assuming roles, see this Amazon blog article.

  1. Find your Databricks account ID.

    1. Log in to the Databricks account console.

    2. Click your username.

    3. From the menu, copy the Account ID value.

  2. In AWS, create an IAM role with a Custom Trust Policy.

  3. In the Custom Trust Policy field, paste the following policy JSON, replacing <DATABRICKS-ACCOUNT-ID> with the Databricks account ID you found in step 1 (not your AWS account ID).

    This policy establishes a cross-account trust relationship so that Unity Catalog can assume the role to access the data in the bucket on behalf of Databricks users. This is specified by the ARN in the Principal section. It is a static value that references a role created by Databricks. Do not modify it.

    {
      "Version": "2012-10-17",
      "Statement": [{
        "Effect": "Allow",
        "Principal": {
          "AWS": [
            "arn:aws:iam::414351767826:role/unity-catalog-prod-UCMasterRole-14S5ZJVKOTYTL"
          ]
        },
        "Action": "sts:AssumeRole",
        "Condition": {
          "StringEquals": {
            "sts:ExternalId": "<DATABRICKS-ACCOUNT-ID>"
          }
        }
      }]
    }
    
  4. Skip the permissions policy configuration. You’ll go back to add that in a later step.

  5. Save the IAM role.

  6. Modify the trust relationship policy to make it “self-assuming.”

    1. Return to your saved IAM role and go to the Trust Relationships tab.

    2. Edit the trust relationship policy, adding the following ARN to the “Allow” statement.

      Replace <YOUR-AWS-ACCOUNT-ID> and <THIS-ROLE-NAME> with your actual IAM role values.

      "arn:aws:iam::<YOUR-AWS-ACCOUNT-ID>:role/<THIS-ROLE-NAME>"
      

    Your policy should now look like this (with replacement text updated to use your Databricks account ID and IAM role values):

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "AWS": [
              "arn:aws:iam::414351767826:role/unity-catalog-prod-UCMasterRole-14S5ZJVKOTYTL",
              "arn:aws:iam::<YOUR-AWS-ACCOUNT-ID>:role/<THIS-ROLE-NAME>"
             ]
          },
          "Action": "sts:AssumeRole",
          "Condition": {
            "StringEquals": {
              "sts:ExternalId": "<DATABRICKS-ACCOUNT-ID>"
            }
          }
        }
      ]
    }
    
  7. In AWS, create an IAM policy in the same AWS account as the S3 bucket.

    To avoid unexpected issues, you must use the following sample policy, replacing the following values:

    • <BUCKET>: The name of the S3 bucket you created in the previous step.

    • <KMS-KEY>: Optional. If encryption is enabled, provide the name of the KMS key that encrypts the S3 bucket contents. If encryption is disabled, remove the entire KMS section of the IAM policy.

    • <AWS-ACCOUNT-ID>: The Account ID of the current AWS account (not your Databricks account).

    • <AWS-IAM-ROLE-NAME>: The name of the AWS IAM role that you created in the previous step.

    {
     "Version": "2012-10-17",
     "Statement": [
         {
             "Action": [
                 "s3:GetObject",
                 "s3:PutObject",
                 "s3:DeleteObject",
                 "s3:ListBucket",
                 "s3:GetBucketLocation"
             ],
             "Resource": [
                 "arn:aws:s3:::<BUCKET>/*",
                 "arn:aws:s3:::<BUCKET>"
             ],
             "Effect": "Allow"
         },
         {
             "Action": [
                 "kms:Decrypt",
                 "kms:Encrypt",
                 "kms:GenerateDataKey*"
             ],
             "Resource": [
                 "arn:aws:kms:<KMS-KEY>"
             ],
             "Effect": "Allow"
         },
         {
             "Action": [
                 "sts:AssumeRole"
             ],
             "Resource": [
                 "arn:aws:iam::<AWS-ACCOUNT-ID>:role/<AWS-IAM-ROLE-NAME>"
             ],
             "Effect": "Allow"
         }
       ]
    }
    

    Note

    If you need a more restrictive IAM policy for Unity Catalog, contact your Databricks representative for assistance.

  8. Attach the IAM policy to the IAM role.

    On the IAM role’s Permissions tab, attach the IAM policy that you just created.

Step 3: Create the metastore and attach a workspace

Each Databricks region requires its own Unity Catalog metastore.

You create a metastore for each region in which your organization operates. You can link each of these regional metastores to any number of workspaces in that region. Each linked workspace has the same view of the data in the metastore, and data access control can be managed across workspaces. You can access data in other metastores using Delta Sharing.

If you chose to create metastore-level storage, the metastore will use the the S3 bucket and IAM role that you created in the previous steps.

To create a metastore:

  1. Log in to the Databricks account console.

  2. Click Catalog icon Catalog.

  3. Click Create Metastore.

  4. Enter the following:

    • A name for the metastore.

    • The region where you want to deploy the metastore.

      This must be in the same region as the workspaces you want to use to access the data. Make sure that this matches the region of the storage bucket you created earlier.

    • (Optional) The S3 bucket path (you can omit s3://) and IAM role name for the bucket and role you created in Step 1 (Optional): Create an S3 bucket for metastore-level managed storage in AWS.

  5. Click Create.

  6. When prompted, select workspaces to link to the metastore.

    For details, see Enable a workspace for Unity Catalog.

  7. Transfer the metastore admin role to a group.

    The user who creates a metastore is its owner, also called the metastore admin. The metastore admin can create top-level objects in the metastore such as catalogs and can manage access to tables and other objects. Databricks recommends that you reassign the metastore admin role to a group. See Assign a metastore admin.

  8. Enable Databricks management of uploads to managed volumes.

    Databricks uses cross-origin resource sharing (CORS) to upload data to managed volumes in Unity Catalog. See Configure Unity Catalog storage account for CORS.

Add managed storage to an existing metastore

Metastore-level managed storage is optional, and it is not included for metastores that were created automatically. You might want to add metastore-level storage to your metastore if you prefer a data isolation model that stores data centrally for multiple workspaces. You need metastore-level storage if you want to share notebooks using Delta Sharing or if you are a Databricks partner who uses personal staging locations.

See also Managed storage.

Requirements

  • You must have at least one workspace attached to the Unity Catalog metastore.

  • Databricks permissions required:

    • To create an external location, you must be a metastore admin or user with the CREATE EXTERNAL LOCATION and CREATE STORAGE CREDENTIAL privileges.

    • To add the storage location to the metastore definition, you must be an account admin.

  • AWS permissions required: the ability to create S3 buckets, IAM roles, IAM policies, and cross-account trust relationships.

Step 1: Create the storage location

Follow the instructions in Step 1 (Optional): Create an S3 bucket for metastore-level managed storage in AWS to create a dedicated S3 bucket in an AWS account in the same region as your metastore.

Step 2: Create an external location in Unity Catalog

In this step, you create an external location in Unity Catalog that represents the bucket that you just created.

  1. Open a workspace that is attached to the metastore.

  2. Click Catalog icon Catalog to open Catalog Explorer.

  3. Click the + Add button and select Add an external location.

  4. On the Create a new external location dialog, click AWS Quickstart (Recommended) and click Next.

    The AWS Quickstart configures the external location and creates a storage credential for you. If you choose to use the Manual option, you must manually create an IAM role that gives access to the S3 bucket and create the storage credential in Databricks yourself.

  5. On the Create external location with Quickstart dialog, enter the path to the S3 bucket in the Bucket Name field.

  6. Click Generate new token to generate the personal access token that you will use to authenticate between Databricks and your AWS account.

  7. Copy the token and click Launch in Quickstart.

  8. In the AWS CloudFormation template that launches (labeled Quick create stack), paste the token into the Databricks Account Credentials field.

  9. Accept the terms at the bottom of the page (I acknowledge that AWS CloudFormation might create IAM resources with custom names).

  10. Click Create stack.

    It may take a few minutes for the CloudFormation template to finish creating the external location object in Databricks.

  11. Return to your Databricks workspace and go to the External locations pane in Catalog Explorer.

    In the left pane of Catalog Explorer, scroll down and click External Data > External Locations.

  12. Confirm that a new external location has been created.

    Automatically-generated external locations use the naming syntax db_s3_external_databricks-S3-ingest-<id>.

  13. Grant yourself the CREATE MANAGED STORAGE privilege on the external location.

    1. Click the external location name to open the details pane.

    2. On the Permissions tab, click Grant.

    3. On the Grant on <external location> dialog, select yourself in the Principals field and select CREATE MANAGED STORAGE.

    4. Click Grant.

Step 3: Add the storage location to the metastore

After you have created an external location that represents the metastore storage bucket, you can add it to the metastore.

  1. As an account admin, log in to the account console.

  2. Click Catalog icon Catalog.

  3. Click the metastore name.

  4. Confirm that you are the Metastore Admin.

    If you are not, click Edit and assign yourself as the metastore admin. You can unassign yourself when you are done with this procedure.

  5. On the Configuration tab, next to S3 bucket path, click Set.

  6. On the Set metastore root dialog, enter the S3 bucket path that you used to create the external location, and click Update.

    You cannot modify this path once you set it.

Delete a metastore

If you are closing your Databricks account or have another reason to delete access to data managed by your Unity Catalog metastore, you can delete the metastore.

Warning

All objects managed by the metastore will become inaccessible using Databricks workspaces. This action cannot be undone.

Managed table data and metadata will be auto-deleted after 30 days. External table data in your cloud storage is not affected by metastore deletion.

To delete a metastore:

  1. As a metastore admin, log in to the account console.

  2. Click Catalog icon Catalog.

  3. Click the metastore name.

  4. On the Configuration tab, click the three-button menu at the far upper right and select Delete.

  5. On the confirmation dialog, enter the name of the metastore and click Delete.