Automate Unity Catalog setup using Terraform
You can automate Unity Catalog setup by using the Databricks Terraform provider. This article shows one approach to deploying an end-to-end Unity Catalog implementation. If you already have some Unity Catalog infrastructure components in place, you can also use this article to deploy additional Unity Catalog infrastructure components as needed.
For more information, see Deploying pre-requisite resources and enabling Unity Catalog in the Databricks Terraform provider documentation.
Requirements
To automate Unity Catalog setup using Terraform, you must have the following:
Your Databricks account must be on the Premium plan or above.
In AWS, you must have the ability to create Amazon S3 buckets, AWS IAM roles, AWS IAM policies, and cross-account trust relationships.
You must have at least one Databricks workspace that you want to use with Unity Catalog. See Create a workspace using the account console.
To use the Databricks Terraform provider to configure a metastore for Unity Catalog, storage for the metastore, any external storage, and all of their related access credentials, you must have the following:
An AWS account.
A Databricks on AWS account.
A service principal that has the account admin role in your Databricks account.
The Terraform CLI. See Download Terraform on the Terraform website.
The following seven Databricks environment variables:
DATABRICKS_CLIENT_ID
, set to the value of the client ID, also known as the application ID, of the service principal. See Authentication using OAuth for service principals.DATABRICKS_CLIENT_SECRET
, set to the value of the client secret of the service principal. See Authentication using OAuth for service principals.DATABRICKS_ACCOUNT_ID
, set to the value of the ID of your Databricks account. You can find this value in the corner of your Databricks account console.TF_VAR_databricks_account_id
, also set to the value of the ID of your Databricks account.AWS_ACCESS_KEY_ID
, set to the value of your AWS user’s access key ID. See Programmatic access in the AWS General Reference.AWS_SECRET_ACCESS_KEY
, set to the value of your AWS user’s secret access key. See Programmatic access in the AWS General Reference.AWS_REGION
, set to the value of the AWS Region code for your Databricks account. See Regional endpoints in the AWS General Reference.
Note
An account admin’s username and password can also be used to authenticate to the Terraform provider. Databricks strongly recommends that you use OAuth for service principals. To use a username and password, you must have the following environment variables:
DATABRICKS_USERNAME
, set to the value of your Databricks account-level admin username.DATABRICKS_PASSWORD
, set to the value of the password for your Databricks account-level admin user.
To set these environment variables, see your operating system’s documentation.
To use the Databricks Terraform provider to configure all other Unity Catalog infrastructure components, you must have the following:
A Databricks workspace.
On your local development machine, you must have:
The Terraform CLI. See Download Terraform on the Terraform website.
One of the following:
Databricks CLI version 0.205 or above, configured with your Databricks personal access token by running
databricks configure --host <workspace-url> --profile <some-unique-profile-name>
. See Install or update the Databricks CLI and Databricks personal access token authentication.Note
As a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use OAuth tokens or personal access tokens belonging to service principals instead of workspace users. To create tokens for service principals, see Manage tokens for a service principal.
The following Databricks environment variables:
DATABRICKS_HOST
, set to the value of your Databricks workspace instance URL, for examplehttps://dbc-1234567890123456.cloud.databricks.com
DATABRICKS_CLIENT_ID
, set to the value of the client ID, also known as the application ID, of the service principal. See Authentication using OAuth for service principals.DATABRICKS_CLIENT_SECRET
, set to the value of the client secret of the service principal. See Authentication using OAuth for service principals.
Alternatively, you can use a personal access token instead of a service principal’s client ID and client secret:
DATABRICKS_TOKEN
, set to the value of your Databricks personal access token. See also Manage personal access tokens.
To set these environment variables, see your operating system’s documentation.
Note
As a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use OAuth tokens or personal access tokens belonging to service principals instead of workspace users. To create tokens for service principals, see Manage tokens for a service principal.
Configure Terraform authentication
This section shows how to configure Terraform authentication to deploy end-to-end Unity Catalog infrastructure. See also Provider initialization.
To configure Terraform authentication to deploy end-to-end Unity Catalog infrastructure, create a file named auth.tf
.
The code that you run depends on your authentication method.
To use a Databricks CLI connection profile for workspace authentication, use the following code:
variable "databricks_connection_profile" {}
terraform {
required_providers {
databricks = {
source = "databricks/databricks"
}
aws = {
source = "hashicorp/aws"
}
}
}
provider "aws" {}
# Use Databricks CLI authentication.
provider "databricks" {
profile = var.databricks_connection_profile
}
# Generate a random string as the prefix for AWS resources, to ensure uniqueness.
resource "random_string" "naming" {
special = false
upper = false
length = 6
}
locals {
prefix = "demo${random_string.naming.result}"
tags = {}
}
To use environment variables for workspace authentication instead, use the following code:
terraform {
required_providers {
databricks = {
source = "databricks/databricks"
}
aws = {
source = "hashicorp/aws"
}
}
}
provider "aws" {}
# Use environment variables for authentication.
provider "databricks" {}
# Generate a random string as the prefix for AWS resources, to ensure uniqueness.
resource "random_string" "naming" {
special = false
upper = false
length = 6
}
locals {
prefix = "demo${random_string.naming.result}"
tags = {}
}
To use a Databricks CLI connection profile for workspace authentication, also create a file named auth.auto.tfvars
with the following configuration code, and replace the Databricks CLI connection profile name as needed. This enables you to reuse auth.tf
in other projects without changing this value in the auth.tf
file itself.
databricks_connection_profile = "DEFAULT"
Configure storage for a metastore
This section shows how to configure the deployment of root storage for a metastore. This storage consists of an Amazon S3 bucket along with an IAM role that gives Unity Catalog permissions to access and manage data in the bucket. See also aws_s3_bucket, aws_s3_bucket_public_access_block, aws_iam_policy_document, aws_iam_policy, and aws_iam_role.
To configure the metastore storage deployment, create a file named metastore-storage.tf
with the following configuration code:
variable "metastore_storage_label" {}
variable "databricks_account_id" {}
resource "aws_s3_bucket" "metastore" {
bucket = "${local.prefix}-${var.metastore_storage_label}"
acl = "private"
versioning {
enabled = false
}
force_destroy = true
tags = merge(local.tags, {
Name = "${local.prefix}-${var.metastore_storage_label}"
})
}
resource "aws_s3_bucket_public_access_block" "metastore" {
bucket = aws_s3_bucket.metastore.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
depends_on = [aws_s3_bucket.metastore]
}
data "aws_caller_identity" "current" {}
data "aws_iam_policy_document" "passrole_for_unity_catalog" {
statement {
effect = "Allow"
actions = ["sts:AssumeRole"]
principals {
identifiers = ["arn:aws:iam::414351767826:role/unity-catalog-prod-UCMasterRole-14S5ZJVKOTYTL"]
type = "AWS"
}
condition {
test = "StringEquals"
variable = "sts:ExternalId"
values = [var.databricks_account_id]
}
}
statement{
sid = "ExplicitSelfRoleAssumption"
effect = "Allow"
actions = ["sts:AssumeRole"]
principals {
identifiers = ["arn:aws:iam::${data.aws_caller_identity.current.account_id}:root"]
type = "AWS"
}
condition {
test = "ArnLike"
variable = "aws:PrincipalArn"
values = ["arn:aws:iam::${data.aws_caller_identity.current.account_id}:role/${local.prefix}-uc-access"]
}
}
}
resource "aws_iam_policy" "unity_metastore" {
policy = jsonencode({
Version = "2012-10-17"
Id = "${local.prefix}-databricks-unity-metastore"
Statement = [
{
"Action" : [
"s3:GetObject",
"s3:GetObjectVersion",
"s3:PutObject",
"s3:PutObjectAcl",
"s3:DeleteObject",
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource" : [
aws_s3_bucket.metastore.arn,
"${aws_s3_bucket.metastore.arn}/*"
],
"Effect" : "Allow"
}
]
})
tags = merge(local.tags, {
Name = "${local.prefix}-unity-catalog IAM policy"
})
}
// Required, in case https://docs.databricks.com/data/databricks-datasets.html are needed.
resource "aws_iam_policy" "sample_data" {
policy = jsonencode({
Version = "2012-10-17"
Id = "${local.prefix}-databricks-sample-data"
Statement = [
{
"Action" : [
"s3:GetObject",
"s3:GetObjectVersion",
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource" : [
"arn:aws:s3:::databricks-datasets-oregon/*",
"arn:aws:s3:::databricks-datasets-oregon"
],
"Effect" : "Allow"
}
]
})
tags = merge(local.tags, {
Name = "${local.prefix}-unity-catalog IAM policy"
})
}
resource "aws_iam_role" "metastore_data_access" {
name = "${local.prefix}-uc-access"
assume_role_policy = data.aws_iam_policy_document.passrole_for_unity_catalog.json
managed_policy_arns = [aws_iam_policy.unity_metastore.arn, aws_iam_policy.sample_data.arn]
tags = merge(local.tags, {
Name = "${local.prefix}-unity-catalog IAM role"
})
}
Also create a file named metastore-storage.auto.tfvars
with the following configuration code, and replace the metastore storage label as needed. This enables you to reuse metastore-storage.tf
in other projects without changing this value in the metastore-storage.tf
file itself.
metastore_storage_label = "metastore"
Configure a metastore
This section shows how to configure the deployment of a metastore into an account. See also databricks_metastore, databricks_metastore_data_access, and databricks_metastore_assignment.
To configure the metastore deployment, create a file named metastore.tf
with the following configuration code:
variable "metastore_name" {}
variable "metastore_label" {}
variable "default_metastore_workspace_id" {}
variable "default_metastore_default_catalog_name" {}
resource "databricks_metastore" "metastore" {
name = var.metastore_name
storage_root = "s3://${aws_s3_bucket.metastore.id}/${var.metastore_label}"
force_destroy = true
}
resource "databricks_metastore_data_access" "metastore_data_access" {
depends_on = [ databricks_metastore.metastore ]
metastore_id = databricks_metastore.metastore.id
name = aws_iam_role.metastore_data_access.name
aws_iam_role { role_arn = aws_iam_role.metastore_data_access.arn }
is_default = true
}
resource "databricks_metastore_assignment" "default_metastore" {
depends_on = [ databricks_metastore_data_access.metastore_data_access ]
workspace_id = var.default_metastore_workspace_id
metastore_id = databricks_metastore.metastore.id
default_catalog_name = var.default_metastore_default_catalog_name
}
Also create a file named metastore.auto.tfvars
with the following configuration code, and replace the values as needed. This enables you to reuse metastore.tf
in other projects without changing these values in the metastore.tf
file itself.
metastore_name = "my_metastore"
metastore_label = "metastore"
default_metastore_workspace_id = "<workspace-id>"
default_metastore_default_catalog_name = "my_catalog"
Configure a catalog
This section shows how to configure the deployment of a catalog into an existing metastore. See also databricks_catalog.
To configure the catalog deployment, create a file named catalog.tf
with the following configuration code:
variable "catalog_name" {}
resource "databricks_catalog" "catalog" {
depends_on = [ databricks_metastore_assignment.default_metastore ]
metastore_id = databricks_metastore.metastore.id
name = var.catalog_name
}
Also create a file named catalog.auto.tfvars
with the following configuration code, and replace the catalog’s name as needed. This enables you to reuse catalog.tf
in other projects without changing this value in the catalog.tf
file itself.
catalog_name = "my_catalog"
If you have an existing metastore that you want to use, replace databricks_metastore.metastore.id
with the existing metastore’s programmatic ID. To get this ID, you can call the GET /api/2.1/unity-catalog/metastores
operation in the Unity Catalog API 2.1 or run the databricks unity-catalog metastores list command in the Unity Catalog CLI.
Configure access grants for a catalog
This section shows how to configure the deployment of access grants for the preceding catalog to existing groups. See also databricks_grants. To use this configuration, you must know the names of the existing groups that you want to grant access to.
To configure the catalog access grants deployment, create a file named catalog-grants.tf
with the following configuration code:
variable "catalog_admins_display_name" {}
variable "catalog_privileges" {}
data "databricks_group" "catalog_admins" {
display_name = var.catalog_admins_display_name
}
resource "databricks_grants" "catalog" {
depends_on = [ databricks_catalog.catalog ]
catalog = databricks_catalog.catalog.name
grant {
principal = data.databricks_group.catalog_admins.display_name
privileges = var.catalog_privileges
}
}
Also create a file named catalog-grants.auto.tfvars
with the following configuration code. Change the specified group name and grant privileges as needed. For a complete list of available grants, see databricks_grants. This enables you to reuse catalog-grants.tf
in other projects without changing these values in the catalog-grants.tf
file itself.
catalog_admins_display_name = "admins"
catalog_privileges = [ "ALL PRIVILEGES" ]
If you have an existing catalog that you want to use, remove the line that starts with depends_on
, and replace databricks_catalog.catalog.name
with the existing catalog’s name.
To create new groups with the Databricks Terraform provider instead of using existing groups, see Create users and groups.
Configure a schema
This section shows how to configure the deployment of a schema into the preceding catalog. See also databricks_schema.
To configure the schema deployment, create a file named schema.tf
with the following configuration code:
variable "schema_name" {}
resource "databricks_schema" "schema" {
depends_on = [ databricks_catalog.catalog ]
catalog_name = databricks_catalog.catalog.name
name = var.schema_name
}
Also create a file named schema.auto.tfvars
with the following configuration code, and replace the schema’s name as needed. This enables you to reuse schema.tf
in other projects without changing this value in the schema.tf
file itself.
schema_name = "my_schema"
If you have an existing catalog that you want to use, remove the line that starts with depends_on
, and replace databricks_catalog.catalog.name
with the existing schema’s catalog name.
Configure access grants for a schema
This section shows how to configure the deployment of access grants for the preceding schema to existing groups. See also databricks_grants. To use this configuration, you must know the names of the existing groups that you want to grant access to.
To configure the schema access grants deployment, create a file named schema-grants.tf
with the following configuration code:
variable "schema_admins_display_name" {}
variable "schema_privileges" {}
data "databricks_group" "schema_admins" {
display_name = var.schema_admins_display_name
}
resource "databricks_grants" "schema" {
depends_on = [ databricks_schema.schema ]
schema = "${databricks_catalog.catalog.name}.${databricks_schema.schema.name}"
grant {
principal = data.databricks_group.schema_admins.display_name
privileges = var.schema_privileges
}
}
Also create a file named schema-grants.auto.tfvars
with the following configuration code. Change the specified group name and grant privileges as needed. For a complete list of available grants, see databricks_grants. This enables you to reuse schema-grants.tf
in other projects without changing these values in the schema-grants.tf
file itself.
schema_admins_display_name = "admins"
schema_privileges = [ "ALL PRIVILEGES" ]
If you have an existing schema that you want to use, remove the line that starts with depends_on
, replace ${databricks_catalog.catalog.name}
with the existing schema’s catalog name, and replace ${databricks_schema.schema.name}
with the existing schema’s name.
Configure external storage
This section shows how to configure deployment of external storage. This external storage consists of an Amazon S3 bucket along with an IAM role that allows Unity Catalog permissions to access and manage data in the bucket. See also aws_s3_bucket, aws_s3_bucket_public_access_block, aws_iam_policy, aws_iam_role, databricks_storage_credential, and databricks_external_location.
To configure the external storage deployment, create a file named external-storage.tf
with the following configuration code.
variable "external_storage_label" {}
variable "external_storage_location_label" {}
resource "aws_s3_bucket" "external" {
bucket = "${local.prefix}-${var.external_storage_label}"
acl = "private"
versioning {
enabled = false
}
force_destroy = true
tags = merge(local.tags, {
Name = "${local.prefix}-${var.external_storage_label}"
})
}
resource "aws_s3_bucket_public_access_block" "external" {
bucket = aws_s3_bucket.external.id
ignore_public_acls = true
depends_on = [aws_s3_bucket.external]
}
resource "aws_iam_policy" "external_data_access" {
policy = jsonencode({
Version = "2012-10-17"
Id = "${aws_s3_bucket.external.id}-access"
Statement = [
{
"Action" : [
"s3:GetObject",
"s3:GetObjectVersion",
"s3:PutObject",
"s3:PutObjectAcl",
"s3:DeleteObject",
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource" : [
aws_s3_bucket.external.arn,
"${aws_s3_bucket.external.arn}/*"
],
"Effect" : "Allow"
}
]
})
tags = merge(local.tags, {
Name = "${local.prefix}-unity-catalog ${var.external_storage_label} access IAM policy"
})
}
resource "aws_iam_role" "external_data_access" {
name = "${local.prefix}-external-access"
assume_role_policy = data.aws_iam_policy_document.passrole_for_unity_catalog.json
managed_policy_arns = [aws_iam_policy.external_data_access.arn]
tags = merge(local.tags, {
Name = "${local.prefix}-unity-catalog ${var.external_storage_label} access IAM role"
})
}
resource "databricks_storage_credential" "external" {
name = aws_iam_role.external_data_access.name
aws_iam_role {
role_arn = aws_iam_role.external_data_access.arn
}
}
resource "databricks_external_location" "some" {
name = "${var.external_storage_label}"
url = "s3://${local.prefix}-${var.external_storage_label}/${var.external_storage_location_label}"
credential_name = databricks_storage_credential.external.id
}
Also create a file named external-storage.auto.tfvars
with the following configuration code, and replace the values as needed. This enables you to reuse external-storage.tf
in other projects without changing these values in the external-storage.tf
file itself.
external_storage_label = "external"
external_storage_location_label = "some"
Configure access grants for external storage
This section shows how to configure the deployment of access grants for the preceding external storage to existing groups. See also databricks_grants. To use this configuration, you must know the names of the existing groups that you want to grant access to.
To configure the external access grants deployment, create a file named external-storage-grants.tf
with the following configuration code. Change the specified group name, add additional groups, and specify additional grant privileges as needed. For a complete list of available grants, see databricks_grants.
variable "external_storage_admins_display_name" {}
variable "external_storage_privileges" {}
data "databricks_group" "external_storage_admins" {
display_name = var.external_storage_admins_display_name
}
resource "databricks_grants" "external_storage_credential" {
storage_credential = databricks_storage_credential.external.id
grant {
principal = var.external_storage_admins_display_name
privileges = var.external_storage_privileges
}
}
resource "databricks_grants" "external_storage" {
external_location = databricks_external_location.some.id
grant {
principal = var.external_storage_admins_display_name
privileges = var.external_storage_privileges
}
}
Also create a file named external-storage-grants.auto.tfvars
with the following configuration code. Change the specified group name and grant privileges as needed. For a complete list of available grants, see databricks_grants. This enables you to reuse external-storage-grants.tf
in other projects without changing these values in the external-storage-grants.tf
file itself.
external_storage_admins_display_name = "admins"
external_storage_privileges = [ "ALL PRIVILEGES" ]
If you have an existing external storage credential that you want to use, replace databricks_storage_credential.external.id
with the existing external storage credential’s programmatic ID. To get this ID, you can call the GET /api/2.1/unity-catalog/storage-credentials
operation in the Unity Catalog API 2.1 or run the databricks unity-catalog storage-credentials list command in the Unity Catalog CLI.
If you have an existing external storage resource that you want to use, replace databricks_external_location.some.id
with the existing external storage resource’s programmatic ID. To get this ID, you can call the GET /api/2.1/unity-catalog/external-locations
operation in the Unity Catalog API 2.1 or run the databricks unity-catalog external-locations list command in the Unity Catalog CLI.
Configure a managed table
Configuring managed tables with the Databricks Terraform provider is not supported. To configure managed tables, see Create tables and Create your first table and manage permissions.
Configure an external table
Configuring external tables with the Databricks Terraform provider is not supported. To configure external tables, see Create tables.
Configure access grants for a managed or external table
This section shows how to configure the deployment of access grants for an existing managed table or external table to existing groups. See also databricks_grants. To use this configuration, you must know the table’s name as well as the names of the existing groups that you want to grant access to.
To configure the table access grants deployment, create a file named table-grants.tf
with the following configuration code:
variable "table_admins_display_name" {}
variable "table_catalog_name" {}
variable "table_schema_name" {}
variable "table_table_name" {}
variable "table_privileges" {}
data "databricks_group" "table_admins" {
display_name = var.table_admins_display_name
}
resource "databricks_grants" "table" {
table = "${var.table_catalog_name}.${var.table_schema_name}.${var.table_table_name}"
grant {
principal = data.databricks_group.table_admins.display_name
privileges = var.table_privileges
}
}
Also create a file named table-grants.auto.tfvars
with the following configuration code. Change the specified catalog name, schema name, table name, group name, and grant privileges as needed. For a complete list of available grants, see databricks_grants. This enables you to reuse table-grants.tf
in other projects without changing these values in the table-grants.tf
file itself.
table_admins_display_name = "admins"
table_catalog_name = "my_catalog"
table_schema_name = "my_schema"
table_table_name = "my_table"
table_privileges = [ "ALL PRIVILEGES" ]
To create new groups with the Databricks Terraform provider instead of using existing groups, see Create users and groups.
Configure a view of a managed table
Configuring views of managed tables with the Databricks Terraform provider is not supported. To configure views, see Create views.
Configure access grants for a view
This section shows how to configure the deployment of access grants for an existing view of a managed table to existing groups. See also databricks_grants. To use this configuration, you must know the managed table’s name as well as the names of the existing groups that you want to grant access to.
To configure the view’s access grants deployment, create a file named view-grants.tf
with the following configuration code:
variable "view_admins_display_name" {}
variable "view_catalog_name" {}
variable "view_schema_name" {}
variable "view_table_name" {}
variable "view_privileges" {}
data "databricks_group" "view_admins" {
display_name = var.view_admins_display_name
}
resource "databricks_grants" "managed_table_view" {
table = "${var.view_catalog_name}.${var.view_schema_name}.${var.view_table_name}"
grant {
principal = data.databricks_group.view_admins.display_name
privileges = var.view_privileges
}
}
Also create a file named view-grants.auto.tfvars
with the following configuration code. Change the specified catalog name, schema name, table name, group name, and grant privileges as needed. For a complete list of available grants, see databricks_grants. This enables you to reuse view-grants.tf
in other projects without changing these values in the view-grants.tf
file itself.
view_admins_display_name = "admins"
view_catalog_name = "my_catalog"
view_schema_name = "my_schema"
view_table_name = "my_table"
view_privileges = [ "ALL PRIVILEGES" ]
To create new groups with the Databricks Terraform provider instead of using existing groups, see Create users and groups.
Configure a cluster that works with Unity Catalog
This section shows how to configure the deployment of a cluster that works with Unity Catalog.
To configure the cluster, create a file named cluster.tf
with the following configuration code. This configuration deploys a cluster with the minimum amount of compute resources and the latest Databricks Runtime Long Term Support (LTS) version. See also databricks_cluster.
variable "cluster_name" {}
variable "cluster_autotermination_minutes" {}
variable "cluster_num_workers" {}
variable "cluster_data_security_mode" {}
# Create the cluster with the "smallest" amount
# of resources allowed.
data "databricks_node_type" "smallest" {
local_disk = true
}
# Use the latest Databricks Runtime
# Long Term Support (LTS) version.
data "databricks_spark_version" "latest_lts" {
long_term_support = true
}
resource "databricks_cluster" "this" {
cluster_name = var.cluster_name
node_type_id = data.databricks_node_type.smallest.id
spark_version = data.databricks_spark_version.latest_lts.id
autotermination_minutes = var.cluster_autotermination_minutes
num_workers = var.cluster_num_workers
data_security_mode = var.cluster_data_security_mode
}
output "cluster_url" {
value = databricks_cluster.this.url
}
Also create a file named cluster.auto.tfvars
with the following configuration variables, specifying the cluster’s name, the number of minutes before automatically terminating due to inactivity, the number of workers, and the data security mode for Unity Catalog. This enables you to reuse cluster.tf
in other projects without changing these values in the cluster.tf
file itself.
cluster_name = "My Cluster"
cluster_autotermination_minutes = 60
cluster_num_workers = 1
cluster_data_security_mode = "SINGLE_USER"
Validate, plan, deploy, or destroy the resources
To validate the syntax of the Terraform configurations without deploying them, run the
terraform validate
command.To show what actions Terraform would take to deploy the configurations, run the
terraform plan
command. This command does not actually deploy the configurations.To deploy the configurations, run the
terraform deploy
command.To delete the deployed resources, run the
terraform destroy
command.