Automate Unity Catalog setup using Terraform

You can automate Unity Catalog setup by using the Databricks Terraform provider. This article shows one approach to deploying an end-to-end Unity Catalog implementation. If you already have some Unity Catalog infrastructure components in place, you can also use this article to deploy additional Unity Catalog infrastructure components as needed.

For more information, see Deploying pre-requisite resources and enabling Unity Catalog in the Databricks Terraform provider documentation.

Requirements

To automate Unity Catalog setup using Terraform, you must have the following:

  • Your Databricks account must be on the Premium plan or above.

  • In AWS, you must have the ability to create Amazon S3 buckets, AWS IAM roles, AWS IAM policies, and cross-account trust relationships.

  • You must have at least one Databricks workspace that you want to use with Unity Catalog. See Create and manage workspaces.

To use the Databricks Terraform provider to configure a metastore for Unity Catalog, storage for the metastore, any external storage, and all of their related access credentials, you must have the following:

  • An AWS account.

  • A Databricks on AWS account.

  • An account-level admin user in your Databricks account.

  • The Terraform CLI. See Download Terraform on the Terraform website.

  • The following seven environment variables:

    • DATABRICKS_USERNAME, set to the value of your Databricks account-level admin username.

    • DATABRICKS_PASSWORD, set to the value of the password for your Databricks account-level admin user.

    • DATABRICKS_ACCOUNT_ID, set to the value of the ID of your Databricks account. You can find this value in the corner of your Databricks account console.

    • TF_VAR_databricks_account_id, also set to the value of the ID of your Databricks account.

    • AWS_ACCESS_KEY_ID, set to the value of your AWS user’s access key ID. See Programmatic access in the AWS General Reference.

    • AWS_SECRET_ACCESS_KEY, set to the value of your AWS user’s secret access key. See Programmatic access in the AWS General Reference.

    • AWS_REGION, set to the value of the AWS Region code for your Databricks account. See Regional endpoints in the AWS General Reference.

    Note

    As a security best practice, when authenticating with automated tools, systems, scripts, and apps, Databricks recommends you use access tokens belonging to service principals instead of workspace users. To create access tokens for service principals, see Manage access tokens for a service principal.

    To set these environment variables, see your operating system’s documentation.

To use the Databricks Terraform provider to configure all other Unity Catalog infrastructure components, you must have the following:

  • A Databricks workspace.

  • On your local development machine, you must have:

    • The Terraform CLI. See Download Terraform on the Terraform website.

    • One of the following:

      • The Databricks command-line interface (Databricks CLI), configured with your Databricks workspace instance URL, for example https://dbc-1234567890123456.cloud.databricks.com, and your Databricks personal access token, by running databricks configure --token. See Set up the CLI and Set up authentication.

        Note

        As a security best practice, when authenticating with automated tools, systems, scripts, and apps, Databricks recommends you use access tokens belonging to service principals instead of workspace users. To create access tokens for service principals, see Manage access tokens for a service principal.

      • The following two environment variables:

        To set these environment variables, see your operating system’s documentation.

        Note

        As a security best practice, when authenticating with automated tools, systems, scripts, and apps, Databricks recommends you use access tokens belonging to service principals instead of workspace users. To create access tokens for service principals, see Manage access tokens for a service principal.

Configure Terraform authentication

This section shows how to configure Terraform authentication to deploy end-to-end Unity Catalog infrastructure. See also Provider initialization.

To configure Terraform authentication to deploy end-to-end Unity Catalog infrastructure, create a file named auth.tf.

The code that you run depends on your authentication method.

To use a Databricks CLI connection profile for workspace authentication, use the following code:

variable "databricks_connection_profile" {}

terraform {
  required_providers {
    databricks = {
      source = "databricks/databricks"
    }
    aws = {
      source = "hashicorp/aws"
    }
  }
}

provider "aws" {}

# Use Databricks CLI authentication.
provider "databricks" {
  profile = var.databricks_connection_profile
}

# Generate a random string as the prefix for AWS resources, to ensure uniqueness.
resource "random_string" "naming" {
  special = false
  upper   = false
  length  = 6
}

locals {
  prefix = "demo${random_string.naming.result}"
  tags   = {}
}

To use environment variables for workspace authentication instead, use the following code:

terraform {
  required_providers {
    databricks = {
      source = "databricks/databricks"
    }
    aws = {
      source = "hashicorp/aws"
    }
  }
}

provider "aws" {}

# Use environment variables for authentication.
provider "databricks" {}

# Generate a random string as the prefix for AWS resources, to ensure uniqueness.
resource "random_string" "naming" {
  special = false
  upper   = false
  length  = 6
}

locals {
  prefix = "demo${random_string.naming.result}"
  tags   = {}
}

To use a Databricks CLI connection profile for workspace authentication, also create a file named auth.auto.tfvars with the following configuration code, and replace the Databricks CLI connection profile name as needed. This enables you to reuse auth.tf in other projects without changing this value in the auth.tf file itself.

databricks_connection_profile = "DEFAULT"

Configure storage for a metastore

This section shows how to configure the deployment of root storage for a metastore. This storage consists of an Amazon S3 bucket along with an IAM role that gives Unity Catalog permissions to access and manage data in the bucket. See also aws_s3_bucket, aws_s3_bucket_public_access_block, aws_iam_policy_document, aws_iam_policy, and aws_iam_role.

To configure the metastore storage deployment, create a file named metastore-storage.tf with the following configuration code:

variable "metastore_storage_label" {}
variable "databricks_account_id" {}

resource "aws_s3_bucket" "metastore" {
  bucket = "${local.prefix}-${var.metastore_storage_label}"
  acl    = "private"
  versioning {
    enabled = false
  }
  force_destroy = true
  tags = merge(local.tags, {
    Name = "${local.prefix}-${var.metastore_storage_label}"
  })
}

resource "aws_s3_bucket_public_access_block" "metastore" {
  bucket                  = aws_s3_bucket.metastore.id
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
  depends_on              = [aws_s3_bucket.metastore]
}

data "aws_iam_policy_document" "passrole_for_unity_catalog" {
  statement {
    effect  = "Allow"
    actions = ["sts:AssumeRole"]
    principals {
      identifiers = ["arn:aws:iam::414351767826:role/unity-catalog-prod-UCMasterRole-14S5ZJVKOTYTL"]
      type        = "AWS"
    }
    condition {
      test     = "StringEquals"
      variable = "sts:ExternalId"
      values   = [var.databricks_account_id]
    }
  }
}

resource "aws_iam_policy" "unity_metastore" {
  policy = jsonencode({
    Version = "2012-10-17"
    Id      = "${local.prefix}-databricks-unity-metastore"
    Statement = [
      {
        "Action" : [
          "s3:GetObject",
          "s3:GetObjectVersion",
          "s3:PutObject",
          "s3:PutObjectAcl",
          "s3:DeleteObject",
          "s3:ListBucket",
          "s3:GetBucketLocation"
        ],
        "Resource" : [
          aws_s3_bucket.metastore.arn,
          "${aws_s3_bucket.metastore.arn}/*"
        ],
        "Effect" : "Allow"
      }
    ]
  })
  tags = merge(local.tags, {
    Name = "${local.prefix}-unity-catalog IAM policy"
  })
}

// Required, in case https://docs.databricks.com/data/databricks-datasets.html are needed.
resource "aws_iam_policy" "sample_data" {
  policy = jsonencode({
    Version = "2012-10-17"
    Id      = "${local.prefix}-databricks-sample-data"
    Statement = [
      {
        "Action" : [
          "s3:GetObject",
          "s3:GetObjectVersion",
          "s3:ListBucket",
          "s3:GetBucketLocation"
        ],
        "Resource" : [
          "arn:aws:s3:::databricks-datasets-oregon/*",
          "arn:aws:s3:::databricks-datasets-oregon"

        ],
        "Effect" : "Allow"
      }
    ]
  })
  tags = merge(local.tags, {
    Name = "${local.prefix}-unity-catalog IAM policy"
  })
}

resource "aws_iam_role" "metastore_data_access" {
  name                = "${local.prefix}-uc-access"
  assume_role_policy  = data.aws_iam_policy_document.passrole_for_unity_catalog.json
  managed_policy_arns = [aws_iam_policy.unity_metastore.arn, aws_iam_policy.sample_data.arn]
  tags = merge(local.tags, {
    Name = "${local.prefix}-unity-catalog IAM role"
  })
}



Also create a file named metastore-storage.auto.tfvars with the following configuration code, and replace the metastore storage label as needed. This enables you to reuse metastore-storage.tf in other projects without changing this value in the metastore-storage.tf file itself.

metastore_storage_label = "metastore"

Configure a metastore

This section shows how to configure the deployment of a metastore into an account. See also databricks_metastore, databricks_metastore_data_access, and databricks_metastore_assignment.

To configure the metastore deployment, create a file named metastore.tf with the following configuration code:

variable "metastore_name" {}
variable "metastore_label" {}
variable "default_metastore_workspace_id" {}
variable "default_metastore_default_catalog_name" {}

resource "databricks_metastore" "metastore" {
  name          = var.metastore_name
  storage_root  = "s3://${aws_s3_bucket.metastore.id}/${var.metastore_label}"
  force_destroy = true
}

resource "databricks_metastore_data_access" "metastore_data_access" {
  depends_on   = [ databricks_metastore.metastore ]
  metastore_id = databricks_metastore.metastore.id
  name         = aws_iam_role.metastore_data_access.name
  aws_iam_role { role_arn = aws_iam_role.metastore_data_access.arn }
  is_default   = true
}

resource "databricks_metastore_assignment" "default_metastore" {
  depends_on           = [ databricks_metastore_data_access.metastore_data_access ]
  workspace_id         = var.default_metastore_workspace_id
  metastore_id         = databricks_metastore.metastore.id
  default_catalog_name = var.default_metastore_default_catalog_name
}

Also create a file named metastore.auto.tfvars with the following configuration code, and replace the values as needed. This enables you to reuse metastore.tf in other projects without changing these values in the metastore.tf file itself.

metastore_name                         = "my_metastore"
metastore_label                        = "metastore"
default_metastore_workspace_id         = "<workspace-id>"
default_metastore_default_catalog_name = "my_catalog"

Configure a catalog

This section shows how to configure the deployment of a catalog into an existing metastore. See also databricks_catalog.

To configure the catalog deployment, create a file named catalog.tf with the following configuration code:

variable "catalog_name" {}

resource "databricks_catalog" "catalog" {
  depends_on = [ databricks_metastore_assignment.default_metastore ]
  metastore_id = databricks_metastore.metastore.id
  name         = var.catalog_name
}

Also create a file named catalog.auto.tfvars with the following configuration code, and replace the catalog’s name as needed. This enables you to reuse catalog.tf in other projects without changing this value in the catalog.tf file itself.

catalog_name = "my_catalog"

If you have an existing metastore that you want to use, replace databricks_metastore.metastore.id with the existing metastore’s programmatic ID. To get this ID, you can call the GET /api/2.1/unity-catalog/metastores operation in the Unity Catalog API 2.1 or run the databricks unity-catalog metastores list command in the Unity Catalog CLI.

Configure access grants for a catalog

This section shows how to configure the deployment of access grants for the preceding catalog to existing groups. See also databricks_grants. To use this configuration, you must know the names of the existing groups that you want to grant access to.

To configure the catalog access grants deployment, create a file named catalog-grants.tf with the following configuration code:

variable "catalog_admins_display_name" {}
variable "catalog_privileges" {}

data "databricks_group" "catalog_admins" {
  display_name = var.catalog_admins_display_name
}

resource "databricks_grants" "catalog" {
  depends_on = [ databricks_catalog.catalog ]
  catalog    = databricks_catalog.catalog.name
  grant {
    principal  = data.databricks_group.catalog_admins.display_name
    privileges = var.catalog_privileges
  }
}

Also create a file named catalog-grants.auto.tfvars with the following configuration code. Change the specified group name and grant privileges as needed. For a complete list of available grants, see databricks_grants. This enables you to reuse catalog-grants.tf in other projects without changing these values in the catalog-grants.tf file itself.

catalog_admins_display_name = "admins"
catalog_privileges          = [ "ALL PRIVILEGES" ]

If you have an existing catalog that you want to use, remove the line that starts with depends_on, and replace databricks_catalog.catalog.name with the existing catalog’s name.

To create new groups with the Databricks Terraform provider instead of using existing groups, see Create users and groups.

Configure a schema

This section shows how to configure the deployment of a schema into the preceding catalog. See also databricks_schema.

To configure the schema deployment, create a file named schema.tf with the following configuration code:

variable "schema_name" {}

resource "databricks_schema" "schema" {
  depends_on   = [ databricks_catalog.catalog ]
  catalog_name = databricks_catalog.catalog.name
  name         = var.schema_name
}

Also create a file named schema.auto.tfvars with the following configuration code, and replace the schema’s name as needed. This enables you to reuse schema.tf in other projects without changing this value in the schema.tf file itself.

schema_name = "my_schema"

If you have an existing catalog that you want to use, remove the line that starts with depends_on, and replace databricks_catalog.catalog.name with the existing schema’s catalog name.

Configure access grants for a schema

This section shows how to configure the deployment of access grants for the preceding schema to existing groups. See also databricks_grants. To use this configuration, you must know the names of the existing groups that you want to grant access to.

To configure the schema access grants deployment, create a file named schema-grants.tf with the following configuration code:

variable "schema_admins_display_name" {}
variable "schema_privileges" {}

data "databricks_group" "schema_admins" {
  display_name = var.schema_admins_display_name
}

resource "databricks_grants" "schema" {
  depends_on = [ databricks_schema.schema ]
  schema = "${databricks_catalog.catalog.name}.${databricks_schema.schema.name}"
  grant {
    principal  = data.databricks_group.schema_admins.display_name
    privileges = var.schema_privileges
  }
}

Also create a file named schema-grants.auto.tfvars with the following configuration code. Change the specified group name and grant privileges as needed. For a complete list of available grants, see databricks_grants. This enables you to reuse schema-grants.tf in other projects without changing these values in the schema-grants.tf file itself.

schema_admins_display_name = "admins"
schema_privileges          = [ "ALL PRIVILEGES" ]

If you have an existing schema that you want to use, remove the line that starts with depends_on, replace ${databricks_catalog.catalog.name} with the existing schema’s catalog name, and replace ${databricks_schema.schema.name} with the existing schema’s name.

Configure external storage

This section shows how to configure deployment of external storage. This external storage consists of an Amazon S3 bucket along with an IAM role that allows Unity Catalog permissions to access and manage data in the bucket. See also aws_s3_bucket, aws_s3_bucket_public_access_block, aws_iam_policy, aws_iam_role, databricks_storage_credential, and databricks_external_location.

To configure the external storage deployment, create a file named external-storage.tf with the following configuration code.

variable "external_storage_label" {}
variable "external_storage_location_label" {}

resource "aws_s3_bucket" "external" {
  bucket = "${local.prefix}-${var.external_storage_label}"
  acl    = "private"
  versioning {
    enabled = false
  }
  force_destroy = true
  tags = merge(local.tags, {
    Name = "${local.prefix}-${var.external_storage_label}"
  })
}

resource "aws_s3_bucket_public_access_block" "external" {
  bucket             = aws_s3_bucket.external.id
  ignore_public_acls = true
  depends_on         = [aws_s3_bucket.external]
}

resource "aws_iam_policy" "external_data_access" {
  policy = jsonencode({
    Version = "2012-10-17"
    Id      = "${aws_s3_bucket.external.id}-access"
    Statement = [
      {
        "Action" : [
          "s3:GetObject",
          "s3:GetObjectVersion",
          "s3:PutObject",
          "s3:PutObjectAcl",
          "s3:DeleteObject",
          "s3:ListBucket",
          "s3:GetBucketLocation"
        ],
        "Resource" : [
          aws_s3_bucket.external.arn,
          "${aws_s3_bucket.external.arn}/*"
        ],
        "Effect" : "Allow"
      }
    ]
  })
  tags = merge(local.tags, {
    Name = "${local.prefix}-unity-catalog ${var.external_storage_label} access IAM policy"
  })
}

resource "aws_iam_role" "external_data_access" {
  name                = "${local.prefix}-external-access"
  assume_role_policy  = data.aws_iam_policy_document.passrole_for_unity_catalog.json
  managed_policy_arns = [aws_iam_policy.external_data_access.arn]
  tags = merge(local.tags, {
    Name = "${local.prefix}-unity-catalog ${var.external_storage_label} access IAM role"
  })
}

resource "databricks_storage_credential" "external" {
  name     = aws_iam_role.external_data_access.name
  aws_iam_role {
    role_arn = aws_iam_role.external_data_access.arn
  }
}

resource "databricks_external_location" "some" {
  name            = "${var.external_storage_label}"
  url             = "s3://${local.prefix}-${var.external_storage_label}/${var.external_storage_location_label}"
  credential_name = databricks_storage_credential.external.id
}

Also create a file named external-storage.auto.tfvars with the following configuration code, and replace the values as needed. This enables you to reuse external-storage.tf in other projects without changing these values in the external-storage.tf file itself.

external_storage_label          = "external"
external_storage_location_label = "some"

Configure access grants for external storage

This section shows how to configure the deployment of access grants for the preceding external storage to existing groups. See also databricks_grants. To use this configuration, you must know the names of the existing groups that you want to grant access to.

To configure the external access grants deployment, create a file named external-storage-grants.tf with the following configuration code. Change the specified group name, add additional groups, and specify additional grant privileges as needed. For a complete list of available grants, see databricks_grants.

variable "external_storage_admins_display_name" {}
variable "external_storage_privileges" {}

data "databricks_group" "external_storage_admins" {
  display_name = var.external_storage_admins_display_name
}

resource "databricks_grants" "external_storage_credential" {
  storage_credential = databricks_storage_credential.external.id
  grant {
    principal  = var.external_storage_admins_display_name
    privileges = var.external_storage_privileges
  }
}

resource "databricks_grants" "external_storage" {
  external_location = databricks_external_location.some.id
  grant {
    principal  = var.external_storage_admins_display_name
    privileges = var.external_storage_privileges
  }
}

Also create a file named external-storage-grants.auto.tfvars with the following configuration code. Change the specified group name and grant privileges as needed. For a complete list of available grants, see databricks_grants. This enables you to reuse external-storage-grants.tf in other projects without changing these values in the external-storage-grants.tf file itself.

external_storage_admins_display_name = "admins"
external_storage_privileges          = [ "ALL PRIVILEGES" ]

If you have an existing external storage credential that you want to use, replace databricks_storage_credential.external.id with the existing external storage credential’s programmatic ID. To get this ID, you can call the GET /api/2.1/unity-catalog/storage-credentials operation in the Unity Catalog API 2.1 or run the databricks unity-catalog storage-credentials list command in the Unity Catalog CLI.

If you have an existing external storage resource that you want to use, replace databricks_external_location.some.id with the existing external storage resource’s programmatic ID. To get this ID, you can call the GET /api/2.1/unity-catalog/external-locations operation in the Unity Catalog API 2.1 or run the databricks unity-catalog external-locations list command in the Unity Catalog CLI.

Configure a managed table

Configuring managed tables with the Databricks Terraform provider is not supported. To configure managed tables, see Create tables and Create your first table.

Configure an external table

Configuring external tables with the Databricks Terraform provider is not supported. To configure external tables, see Create tables.

Configure access grants for a managed or external table

This section shows how to configure the deployment of access grants for an existing managed table or external table to existing groups. See also databricks_grants. To use this configuration, you must know the table’s name as well as the names of the existing groups that you want to grant access to.

To configure the table access grants deployment, create a file named table-grants.tf with the following configuration code:

variable "table_admins_display_name" {}
variable "table_catalog_name" {}
variable "table_schema_name" {}
variable "table_table_name" {}
variable "table_privileges" {}

data "databricks_group" "table_admins" {
  display_name = var.table_admins_display_name
}

resource "databricks_grants" "table" {
  table      = "${var.table_catalog_name}.${var.table_schema_name}.${var.table_table_name}"
  grant {
    principal  = data.databricks_group.table_admins.display_name
    privileges = var.table_privileges
  }
}

Also create a file named table-grants.auto.tfvars with the following configuration code. Change the specified catalog name, schema name, table name, group name, and grant privileges as needed. For a complete list of available grants, see databricks_grants. This enables you to reuse table-grants.tf in other projects without changing these values in the table-grants.tf file itself.

table_admins_display_name = "admins"
table_catalog_name        = "my_catalog"
table_schema_name         = "my_schema"
table_table_name          = "my_table"
table_privileges          = [ "ALL PRIVILEGES" ]

To create new groups with the Databricks Terraform provider instead of using existing groups, see Create users and groups.

Configure a view of a managed table

Configuring views of managed tables with the Databricks Terraform provider is not supported. To configure views, see Create views.

Configure access grants for a view

This section shows how to configure the deployment of access grants for an existing view of a managed table to existing groups. See also databricks_grants. To use this configuration, you must know the managed table’s name as well as the names of the existing groups that you want to grant access to.

To configure the view’s access grants deployment, create a file named view-grants.tf with the following configuration code:

variable "view_admins_display_name" {}
variable "view_catalog_name" {}
variable "view_schema_name" {}
variable "view_table_name" {}
variable "view_privileges" {}

data "databricks_group" "view_admins" {
  display_name = var.view_admins_display_name
}

resource "databricks_grants" "managed_table_view" {
  table = "${var.view_catalog_name}.${var.view_schema_name}.${var.view_table_name}"
  grant {
    principal  = data.databricks_group.view_admins.display_name
    privileges = var.view_privileges
  }
}

Also create a file named view-grants.auto.tfvars with the following configuration code. Change the specified catalog name, schema name, table name, group name, and grant privileges as needed. For a complete list of available grants, see databricks_grants. This enables you to reuse view-grants.tf in other projects without changing these values in the view-grants.tf file itself.

view_admins_display_name = "admins"
view_catalog_name        = "my_catalog"
view_schema_name         = "my_schema"
view_table_name          = "my_table"
view_privileges          = [ "ALL PRIVILEGES" ]

To create new groups with the Databricks Terraform provider instead of using existing groups, see Create users and groups.

Configure a cluster that works with Unity Catalog

This section shows how to configure the deployment of a cluster that works with Unity Catalog.

To configure the cluster, create a file named cluster.tf with the following configuration code. This configuration deploys a cluster with the minimum amount of compute resources and the latest Databricks Runtime Long Term Support (LTS) version. See also databricks_cluster.

variable "cluster_name" {}
variable "cluster_autotermination_minutes" {}
variable "cluster_num_workers" {}
variable "cluster_data_security_mode" {}

# Create the cluster with the "smallest" amount
# of resources allowed.
data "databricks_node_type" "smallest" {
  local_disk = true
}

# Use the latest Databricks Runtime
# Long Term Support (LTS) version.
data "databricks_spark_version" "latest_lts" {
  long_term_support = true
}

resource "databricks_cluster" "this" {
  cluster_name            = var.cluster_name
  node_type_id            = data.databricks_node_type.smallest.id
  spark_version           = data.databricks_spark_version.latest_lts.id
  autotermination_minutes = var.cluster_autotermination_minutes
  num_workers             = var.cluster_num_workers
  data_security_mode      = var.cluster_data_security_mode
}

output "cluster_url" {
 value = databricks_cluster.this.url
}

Also create a file named cluster.auto.tfvars with the following configuration variables, specifying the cluster’s name, the number of minutes before automatically terminating due to inactivity, the number of workers, and the data security mode for Unity Catalog. This enables you to reuse cluster.tf in other projects without changing these values in the cluster.tf file itself.

cluster_name                    = "My Cluster"
cluster_autotermination_minutes = 60
cluster_num_workers             = 1
cluster_data_security_mode      = "SINGLE_USER"

Validate, plan, deploy, or destroy the resources

  • To validate the syntax of the Terraform configurations without deploying them, run the terraform validate command.

  • To show what actions Terraform would take to deploy the configurations, run the terraform plan command. This command does not actually deploy the configurations.

  • To deploy the configurations, run the terraform deploy command.

  • To delete the deployed resources, run the terraform destroy command.