Create Databricks workspaces using Terraform
This guide shows you how to create Databricks workspaces with the Databricks Terraform provider along with all required infrastructure on AWS. The instructions provided here apply only to Databricks accounts on the E2 version of the platform. All new Databricks accounts and most existing accounts are now E2. If you are unsure which account type you have, contact your Databricks representative.
Provider initialization for E2 workspaces
This guide assumes you have a client ID and client secret for a service principal that has the account admin role. See Authentication using OAuth tokens for service principals. You must also know your account ID (databricks_account_id
). To get your account ID, see Locate your account ID.
Note
An account admin’s username and password can also be used to authenticate. However, Databricks strongly recommends that you use OAuth for service principals. To use a username and password use the following variables:
variable “databricks_account_username” {}
variable “databricks_account_password” {}
This guide is provided as is and is intended to provide a basis for your configuration.
variable "databricks_client_id" {}
variable "databricks_client_secret" {}
variable "databricks_account_id" {}
variable "tags" {
default = {}
}
variable "cidr_block" {
default = "10.4.0.0/16"
}
variable "region" {
default = "eu-west-1"
}
resource "random_string" "naming" {
special = false
upper = false
length = 6
}
locals {
prefix = "demo${random_string.naming.result}"
}
Before you manage a workspace, you must create a VPC, root bucket, cross-account role, Databricks E2 workspace, and host and token outputs. You must also initialize the provider with alias = "mws"
and use provider = databricks.mws
for all databricks_mws_*
resources. The provider requires all databricks_mws_*
resources to be created within its own dedicated Terraform module of your environment. Usually this module creates VPC and IAM roles as well.
terraform {
required_providers {
databricks = {
source = "databricks/databricks"
}
}
}
provider "aws" {
region = var.region
}
// initialize provider in "MWS" mode to provision new workspace
provider "databricks" {
alias = "mws"
host = "https://accounts.cloud.databricks.com"
client_id = var.databricks_client_id
client_secret = var.databricks_client_secret
}
Step 1: Create a VPC
Create an AWS VPC with all necessary firewall rules. See Customer-managed VPC for the complete and up-to-date details on networking. The AWS VPC is registered as the databricks_mws_networks resource.
data "aws_availability_zones" "available" {}
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "2.70.0"
name = local.prefix
cidr = var.cidr_block
azs = data.aws_availability_zones.available.names
tags = var.tags
enable_dns_hostnames = true
enable_nat_gateway = true
create_igw = true
public_subnets = [cidrsubnet(var.cidr_block, 3, 0)]
private_subnets = [cidrsubnet(var.cidr_block, 3, 1),
cidrsubnet(var.cidr_block, 3, 2)]
default_security_group_egress = [{
cidr_blocks = "0.0.0.0/0"
}]
default_security_group_ingress = [{
description = "Allow all internal TCP and UDP"
self = true
}]
}
resource "databricks_mws_networks" "this" {
provider = databricks.mws
account_id = var.databricks_account_id
network_name = "${local.prefix}-network"
security_group_ids = [module.vpc.default_security_group_id]
subnet_ids = module.vpc.private_subnets
vpc_id = module.vpc.vpc_id
}
Step 2: Create a root bucket
Create an AWS S3 bucket for DBFS workspace storage, which is commonly referred to as the root bucket. This provider has databricks_aws_bucket_policy with the necessary IAM policy template. Your AWS S3 bucket must be registered using the databricks_mws_storage_configurations resource.
resource "aws_s3_bucket" "root_storage_bucket" {
bucket = "${local.prefix}-rootbucket"
acl = "private"
versioning {
enabled = false
}
force_destroy = true
tags = merge(var.tags, {
Name = "${local.prefix}-rootbucket"
})
}
resource "aws_s3_bucket_public_access_block" "root_storage_bucket" {
bucket = aws_s3_bucket.root_storage_bucket.id
ignore_public_acls = true
depends_on = [aws_s3_bucket.root_storage_bucket]
}
data "databricks_aws_bucket_policy" "this" {
bucket = aws_s3_bucket.root_storage_bucket.bucket
}
resource "aws_s3_bucket_policy" "root_bucket_policy" {
bucket = aws_s3_bucket.root_storage_bucket.id
policy = data.databricks_aws_bucket_policy.this.json
}
resource "databricks_mws_storage_configurations" "this" {
provider = databricks.mws
account_id = var.databricks_account_id
bucket_name = aws_s3_bucket.root_storage_bucket.bucket
storage_configuration_name = "${local.prefix}-storage"
}
Step 3: Create a cross-account IAM role
A cross-account IAM role is registered with the databricks_mws_credentials resource.
data "databricks_aws_assume_role_policy" "this" {
external_id = var.databricks_account_id
}
resource "aws_iam_role" "cross_account_role" {
name = "${local.prefix}-crossaccount"
assume_role_policy = data.databricks_aws_assume_role_policy.this.json
tags = var.tags
}
data "databricks_aws_crossaccount_policy" "this" {
}
resource "aws_iam_role_policy" "this" {
name = "${local.prefix}-policy"
role = aws_iam_role.cross_account_role.id
policy = data.databricks_aws_crossaccount_policy.this.json
}
resource "databricks_mws_credentials" "this" {
provider = databricks.mws
account_id = var.databricks_account_id
role_arn = aws_iam_role.cross_account_role.arn
credentials_name = "${local.prefix}-creds"
depends_on = [aws_iam_role_policy.this]
}
Step 4: Create a Databricks E2 workspace
Create a Databricks E2 workspace using the databricks_mws_workspaces resource. Code that creates workspaces and code that manages workspaces must be in separate Terraform modules to avoid common confusion between provider = databricks.mws
and provider = databricks.created_workspace
. This is why you must specify databricks_host
and databricks_token
outputs in the following modules.
resource "databricks_mws_workspaces" "this" {
provider = databricks.mws
account_id = var.databricks_account_id
aws_region = var.region
workspace_name = local.prefix
deployment_name = local.prefix
credentials_id = databricks_mws_credentials.this.credentials_id
storage_configuration_id = databricks_mws_storage_configurations.this.storage_configuration_id
network_id = databricks_mws_networks.this.network_id
}
// export host to be used by other modules
output "databricks_host" {
value = databricks_mws_workspaces.this.workspace_url
}
// initialize provider in normal mode
provider "databricks" {
// in normal scenario you won't have to give providers aliases
alias = "created_workspace"
host = databricks_mws_workspaces.this.workspace_url
}
// create PAT token to provision entities within workspace
resource "databricks_token" "pat" {
provider = databricks.created_workspace
comment = "Terraform Provisioning"
lifetime_seconds = 86400
}
// export token for integration tests to run on
output "databricks_token" {
value = databricks_token.pat.token_value
sensitive = true
}
Provider configuration
In Manage Databricks workspaces using Terraform, use the following configuration for the provider:
provider "databricks" {
host = module.ai.databricks_host
token = module.ai.databricks_token
}