Create Databricks workspaces using Terraform

This guide shows you how to create Databricks workspaces with the Databricks Terraform provider along with all required infrastructure on AWS. The instructions provided here apply only to Databricks accounts on the E2 version of the platform. All new Databricks accounts and most existing accounts are now E2. If you are unsure which account type you have, contact your Databricks representative.

Provider initialization for E2 workspaces

This guide assumes you have a client ID and client secret for a service principal that has the account admin role. See Authentication using OAuth tokens for service principals. You must also know your account ID (databricks_account_id). To get your account ID, see Locate your account ID.

Note

An account admin’s username and password can also be used to authenticate. However, Databricks strongly recommends that you use OAuth for service principals. To use a username and password use the following variables:

  • variable “databricks_account_username” {}

  • variable “databricks_account_password” {}

This guide is provided as is and is intended to provide a basis for your configuration.

variable "databricks_client_id" {}
variable "databricks_client_secret" {}
variable "databricks_account_id" {}

variable "tags" {
  default = {}
}

variable "cidr_block" {
  default = "10.4.0.0/16"
}

variable "region" {
  default = "eu-west-1"
}

resource "random_string" "naming" {
  special = false
  upper   = false
  length  = 6
}

locals {
  prefix = "demo${random_string.naming.result}"
}

Before you manage a workspace, you must create a VPC, root bucket, cross-account role, Databricks E2 workspace, and host and token outputs. You must also initialize the provider with alias = "mws" and use provider = databricks.mws for all databricks_mws_* resources. The provider requires all databricks_mws_* resources to be created within its own dedicated Terraform module of your environment. Usually this module creates VPC and IAM roles as well.

terraform {
  required_providers {
    databricks = {
      source  = "databricks/databricks"
    }
  }
}

provider "aws" {
  region = var.region
}

// initialize provider in "MWS" mode to provision new workspace
provider "databricks" {
  alias    = "mws"
  host     = "https://accounts.cloud.databricks.com"
  client_id = var.databricks_client_id
  client_secret = var.databricks_client_secret
}

Step 1: Create a VPC

Create an AWS VPC with all necessary firewall rules. See Customer-managed VPC for the complete and up-to-date details on networking. The AWS VPC is registered as the databricks_mws_networks resource.

data "aws_availability_zones" "available" {}

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "2.70.0"

  name = local.prefix
  cidr = var.cidr_block
  azs  = data.aws_availability_zones.available.names
  tags = var.tags

  enable_dns_hostnames = true
  enable_nat_gateway   = true
  create_igw           = true

  public_subnets  = [cidrsubnet(var.cidr_block, 3, 0)]
  private_subnets = [cidrsubnet(var.cidr_block, 3, 1),
                     cidrsubnet(var.cidr_block, 3, 2)]

  default_security_group_egress = [{
    cidr_blocks = "0.0.0.0/0"
  }]

  default_security_group_ingress = [{
    description = "Allow all internal TCP and UDP"
    self        = true
  }]
}

resource "databricks_mws_networks" "this" {
  provider           = databricks.mws
  account_id         = var.databricks_account_id
  network_name       = "${local.prefix}-network"
  security_group_ids = [module.vpc.default_security_group_id]
  subnet_ids         = module.vpc.private_subnets
  vpc_id             = module.vpc.vpc_id
}

Step 2: Create a root bucket

Create an AWS S3 bucket for DBFS workspace storage, which is commonly referred to as the root bucket. This provider has databricks_aws_bucket_policy with the necessary IAM policy template. Your AWS S3 bucket must be registered using the databricks_mws_storage_configurations resource.

resource "aws_s3_bucket" "root_storage_bucket" {
  bucket = "${local.prefix}-rootbucket"
  acl    = "private"
  versioning {
    enabled = false
  }
  force_destroy = true
  tags = merge(var.tags, {
    Name = "${local.prefix}-rootbucket"
  })
}

resource "aws_s3_bucket_public_access_block" "root_storage_bucket" {
  bucket             = aws_s3_bucket.root_storage_bucket.id
  ignore_public_acls = true
  depends_on         = [aws_s3_bucket.root_storage_bucket]
}

data "databricks_aws_bucket_policy" "this" {
  bucket = aws_s3_bucket.root_storage_bucket.bucket
}

resource "aws_s3_bucket_policy" "root_bucket_policy" {
  bucket = aws_s3_bucket.root_storage_bucket.id
  policy = data.databricks_aws_bucket_policy.this.json
}

resource "databricks_mws_storage_configurations" "this" {
  provider                   = databricks.mws
  account_id                 = var.databricks_account_id
  bucket_name                = aws_s3_bucket.root_storage_bucket.bucket
  storage_configuration_name = "${local.prefix}-storage"
}

Step 3: Create a cross-account IAM role

A cross-account IAM role is registered with the databricks_mws_credentials resource.

data "databricks_aws_assume_role_policy" "this" {
  external_id = var.databricks_account_id
}

resource "aws_iam_role" "cross_account_role" {
  name               = "${local.prefix}-crossaccount"
  assume_role_policy = data.databricks_aws_assume_role_policy.this.json
  tags               = var.tags
}

data "databricks_aws_crossaccount_policy" "this" {
}

resource "aws_iam_role_policy" "this" {
  name   = "${local.prefix}-policy"
  role   = aws_iam_role.cross_account_role.id
  policy = data.databricks_aws_crossaccount_policy.this.json
}

resource "databricks_mws_credentials" "this" {
  provider         = databricks.mws
  account_id       = var.databricks_account_id
  role_arn         = aws_iam_role.cross_account_role.arn
  credentials_name = "${local.prefix}-creds"
  depends_on       = [aws_iam_role_policy.this]
}

Step 4: Create a Databricks E2 workspace

Create a Databricks E2 workspace using the databricks_mws_workspaces resource. Code that creates workspaces and code that manages workspaces must be in separate Terraform modules to avoid common confusion between provider = databricks.mws and provider = databricks.created_workspace. This is why you must specify databricks_host and databricks_token outputs in the following modules.

resource "databricks_mws_workspaces" "this" {
  provider        = databricks.mws
  account_id      = var.databricks_account_id
  aws_region      = var.region
  workspace_name  = local.prefix
  deployment_name = local.prefix

  credentials_id           = databricks_mws_credentials.this.credentials_id
  storage_configuration_id = databricks_mws_storage_configurations.this.storage_configuration_id
  network_id               = databricks_mws_networks.this.network_id
}

// export host to be used by other modules
output "databricks_host" {
  value = databricks_mws_workspaces.this.workspace_url
}

// initialize provider in normal mode
provider "databricks" {
  // in normal scenario you won't have to give providers aliases
  alias = "created_workspace"
  host = databricks_mws_workspaces.this.workspace_url
}

// create PAT token to provision entities within workspace
resource "databricks_token" "pat" {
  provider = databricks.created_workspace
  comment  = "Terraform Provisioning"
  lifetime_seconds = 86400
}

// export token for integration tests to run on
output "databricks_token" {
  value     = databricks_token.pat.token_value
  sensitive = true
}

Provider configuration

In Manage Databricks workspaces using Terraform, use the following configuration for the provider:

provider "databricks" {
  host = module.ai.databricks_host
  token = module.ai.databricks_token
}