Tutorial: Create a workspace with Terraform

Creating a Databricks workspace requires many steps, especially when you use the Databricks and AWS account consoles. In this tutorial, you will use the Databricks Terraform provider and the AWS provider to programmatically create a Databricks workspace along with the required AWS resources. These providers are based on HashiCorp Terraform, a popular open source infrastructure as code (IaC) tool for managing the operational lifecycle of cloud resources.

Requirements

  • An existing or new Databricks on AWS account. To create a new Databricks Platform Free Trial account, follow the instructions in Get started: Account and workspace setup.

    Note

    For a new Databricks account, you must set up an initial workspace, which the preceding instructions guide you through. This tutorial enables you to use the Databricks Terraform provider to create an additional workspace beyond the initial one.

  • A client ID and client secret for a service principal that has the account admin role in your Databricks account. See Authentication using OAuth tokens for service principals.

    Note

    An account admin’s username and password can also be used to authenticate. However, Databricks strongly recommends that you use OAuth for service principals. To use a username and password use the following variables:

    • variable “databricks_account_username” {}

    • variable “databricks_account_password” {}

  • Your Databricks account ID. To get this value, follow Locate your account ID.

  • For the AWS account associated with your Databricks account, permissions for your AWS Identity and Access Management (IAM) user in the AWS account to create:

    • An IAM cross-account policy and role.

    • A virtual private cloud (VPC) and associated resources in Amazon VPC.

    • An Amazon S3 bucket.

    See Changing permissions for an IAM user on the AWS website.

  • For your IAM user in the AWS account, an AWS access key, which consists of an AWS secret key and an AWS secret access key. See Managing access keys (console) on the AWS website.

  • A development machine with the Terraform CLI and Git installed. See Download Terraform on the Terraform website and Install Git on the GitHub website.

  • An existing or new GitHub account. To create one, see Signing up for a new GitHub account on the GitHub website.

Step 1: Create a GitHub repository

In this step, you create a new repository in GitHub to store your Terraform files. It is a best practice to store, track, and control changes to IaC files in a system such as GitHub.

  1. Sign in to your GitHub account.

  2. Create a new repository in your GitHub account. Name the repository databricks-aws-terraform.

  3. Run the following commands, one command at a time, from your development machine’s terminal. In the git remote command, replace <your-GitHub-username> with your GitHub username. These commands create an empty directory, fill it with starter content, transform it into a local repository, and then upload this local repository into the new repository in your GitHub account.

    mkdir databricks-aws-terraform
    cd databricks-aws-terraform
    echo "# Databricks Terraform provider sample for AWS" >> README.md
    git init
    git add README.md
    git commit -m "First commit"
    git branch -M main
    git remote add origin git@github.com:<your-GitHub-username>/databricks-aws-terraform.git
    git push -u origin main
    

    Tip

    If you get a “permission denied” error after you run the git push command, see Connecting to GitHub with SSH on the GitHub website.

  4. In the root of your databricks-aws-terraform directory, use your favorite code editor to create a file named .gitignore with the following content. This file instructs GitHub to exclude the specified files in your repository. This is because you will download these files later in this tutorial.

    *.terraform
    *.tfvars
    

Step 2: Declare and initialize Terraform variables

In this step, you produce all of the code that Terraform needs to create the required Databricks and AWS resources.

Create the following seven files in the root of your databricks-aws-terraform directory. These files define your Databricks workspace and its dependent resources in your AWS account, in code. Links to related Databricks and AWS documentation on the Terraform website are included as comments within the code for future reference, and also in the accompanying text.

Important

You must provide Terraform with your AWS account credentials. These can be specified through sources such as environment variables or shared configuration and credentials files. See Authentication and Configuration on the Terraform website.

Warning

While Terraform supports hard-coding your AWS account credentials in Terraform files, this approach is not recommended, as it risks secret leakage should such files ever be committed to a public version control system.

  • vars.tf: This file defines Terraform input variables that are used in later files for:

    • Your Databricks account ID.

    • The client ID, also known as an application ID, and client secret for an account admin service principal in your Databricks account.

    • The Classless Inter-Domain Routing (CIDR) block for the dependent virtual private cloud (VPC) in Amazon Virtual Public Cloud (Amazon VPC). See VPC basics on the AWS website.

    • The AWS Region where the dependent AWS resources are created. Change this Region as needed. See Regions and Availability Zones and AWS Regional Services on the AWS website.

    This file also includes a Terraform local value and related logic for assigning randomly-generated identifiers to the resources that Terraform creates throughout these files.

    For related Terraform documentation, see random_string (Resource) on the Terraform website.

    variable "databricks_client_id" {}
    variable "databricks_client_secret" {}
    variable "databricks_account_id" {}
    
    variable "tags" {
      default = {}
    }
    
    variable "cidr_block" {
      default = "10.4.0.0/16"
    }
    
    variable "region" {
      default = "eu-west-1"
    }
    
    // See https://registry.terraform.io/providers/hashicorp/random/latest/docs/resources/string
    resource "random_string" "naming" {
      special = false
      upper   = false
      length  = 6
    }
    
    locals {
      prefix = "demo-${random_string.naming.result}"
    }
    
  • init.tf: This file initializes Terraform with the required Databricks Provider and the AWS Provider. This file also establishes your Databricks account credentials and instructs Terraform to use the E2 version of the Databricks on AWS platform.

    For related Terraform documentation, see Authentication on the Terraform website.

    terraform {
      required_providers {
        databricks = {
          source  = "databricks/databricks"
        }
        aws = {
          source = "hashicorp/aws"
          version = "3.49.0"
        }
      }
    }
    
    provider "aws" {
      region = var.region
    }
    
    // Initialize provider in "MWS" mode to provision the new workspace.
    // alias = "mws" instructs Databricks to connect to https://accounts.cloud.databricks.com, to create
    // a Databricks workspace that uses the E2 version of the Databricks on AWS platform.
    // See https://registry.terraform.io/providers/databricks/databricks/latest/docs#authentication
    provider "databricks" {
      alias    = "mws"
      host     = "https://accounts.cloud.databricks.com"
      client_id = var.databricks_client_id
      client_secret = var.databricks_client_secret
    }
    
  • cross-account-role.tf: This file instructs Terraform to create the required IAM cross-account role and related policies within your AWS account. This role enables Databricks to take the necessary actions within your AWS account. See Create an IAM role for workspace deployment.

    For related Terraform documentation, see the following on the Terraform website:

    // Create the required AWS STS assume role policy in your AWS account.
    // See https://registry.terraform.io/providers/databricks/databricks/latest/docs/data-sources/aws_assume_role_policy
    data "databricks_aws_assume_role_policy" "this" {
      external_id = var.databricks_account_id
    }
    
    // Create the required IAM role in your AWS account.
    // See https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role
    resource "aws_iam_role" "cross_account_role" {
      name               = "${local.prefix}-crossaccount"
      assume_role_policy = data.databricks_aws_assume_role_policy.this.json
      tags               = var.tags
    }
    
    // Create the required AWS cross-account policy in your AWS account.
    // See https://registry.terraform.io/providers/databricks/databricks/latest/docs/data-sources/aws_crossaccount_policy
    data "databricks_aws_crossaccount_policy" "this" {}
    
    // Create the required IAM role inline policy in your AWS account.
    // See https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role_policy
    resource "aws_iam_role_policy" "this" {
      name   = "${local.prefix}-policy"
      role   = aws_iam_role.cross_account_role.id
      policy = data.databricks_aws_crossaccount_policy.this.json
    }
    
    // Properly configure the cross-account role for the creation of new workspaces within your AWS account.
    // See https://registry.terraform.io/providers/databricks/databricks/latest/docs/resources/mws_credentials
    resource "databricks_mws_credentials" "this" {
      provider         = databricks.mws
      account_id       = var.databricks_account_id
      role_arn         = aws_iam_role.cross_account_role.arn
      credentials_name = "${local.prefix}-creds"
      depends_on       = [aws_iam_role_policy.this]
    }
    
  • vpc.tf: This file instructs Terraform to create the required VPC in your AWS account. See Customer-managed VPC.

    For related Terraform documentation, see the following on the Terraform website:

    // Allow access to the list of AWS Availability Zones within the AWS Region that is configured in vars.tf and init.tf.
    // See https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/availability_zones
    data "aws_availability_zones" "available" {}
    
    // Create the required VPC resources in your AWS account.
    // See https://registry.terraform.io/modules/terraform-aws-modules/vpc/aws/latest
    module "vpc" {
      source  = "terraform-aws-modules/vpc/aws"
      version = "3.2.0"
    
      name = local.prefix
      cidr = var.cidr_block
      azs  = data.aws_availability_zones.available.names
      tags = var.tags
    
      enable_dns_hostnames = true
      enable_nat_gateway   = true
      single_nat_gateway   = true
      create_igw           = true
    
      public_subnets  = [cidrsubnet(var.cidr_block, 3, 0)]
      private_subnets = [cidrsubnet(var.cidr_block, 3, 1),
                         cidrsubnet(var.cidr_block, 3, 2)]
    
      manage_default_security_group = true
      default_security_group_name = "${local.prefix}-sg"
    
      default_security_group_egress = [{
        cidr_blocks = "0.0.0.0/0"
      }]
    
      default_security_group_ingress = [{
        description = "Allow all internal TCP and UDP"
        self        = true
      }]
    }
    
    // Create the required VPC endpoints within your AWS account.
    // See https://registry.terraform.io/modules/terraform-aws-modules/vpc/aws/latest/submodules/vpc-endpoints
    module "vpc_endpoints" {
      source = "terraform-aws-modules/vpc/aws//modules/vpc-endpoints"
      version = "3.2.0"
    
      vpc_id             = module.vpc.vpc_id
      security_group_ids = [module.vpc.default_security_group_id]
    
      endpoints = {
        s3 = {
          service         = "s3"
          service_type    = "Gateway"
          route_table_ids = flatten([
            module.vpc.private_route_table_ids,
            module.vpc.public_route_table_ids])
          tags            = {
            Name = "${local.prefix}-s3-vpc-endpoint"
          }
        },
        sts = {
          service             = "sts"
          private_dns_enabled = true
          subnet_ids          = module.vpc.private_subnets
          tags                = {
            Name = "${local.prefix}-sts-vpc-endpoint"
          }
        },
        kinesis-streams = {
          service             = "kinesis-streams"
          private_dns_enabled = true
          subnet_ids          = module.vpc.private_subnets
          tags                = {
            Name = "${local.prefix}-kinesis-vpc-endpoint"
          }
        }
      }
    
      tags = var.tags
    }
    
    // Properly configure the VPC and subnets for Databricks within your AWS account.
    // See https://registry.terraform.io/providers/databricks/databricks/latest/docs/resources/mws_networks
    resource "databricks_mws_networks" "this" {
      provider           = databricks.mws
      account_id         = var.databricks_account_id
      network_name       = "${local.prefix}-network"
      security_group_ids = [module.vpc.default_security_group_id]
      subnet_ids         = module.vpc.private_subnets
      vpc_id             = module.vpc.vpc_id
    }
    
  • root-bucket.tf: This file instructs Terraform to create the required Amazon S3 root bucket within your AWS account. Databricks stores artifacts such as cluster logs, notebook revisions, and job results to an S3 bucket, which is commonly referred to as the root bucket.

    For related Terraform documentation, see the following on the Terraform website:

    // Create the S3 root bucket.
    // See https://registry.terraform.io/modules/terraform-aws-modules/s3-bucket/aws/latest
    resource "aws_s3_bucket" "root_storage_bucket" {
      bucket = "${local.prefix}-rootbucket"
      acl    = "private"
      versioning {
        enabled = false
      }
      force_destroy = true
      tags = merge(var.tags, {
        Name = "${local.prefix}-rootbucket"
      })
    }
    
    // Ignore public access control lists (ACLs) on the S3 root bucket and on any objects that this bucket contains.
    // See https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/s3_bucket_public_access_block
    resource "aws_s3_bucket_public_access_block" "root_storage_bucket" {
      bucket             = aws_s3_bucket.root_storage_bucket.id
      ignore_public_acls = true
      depends_on         = [aws_s3_bucket.root_storage_bucket]
    }
    
    // Configure a simple access policy for the S3 root bucket within your AWS account, so that Databricks can access data in it.
    // See https://registry.terraform.io/providers/databricks/databricks/latest/docs/data-sources/aws_bucket_policy
    data "databricks_aws_bucket_policy" "this" {
      bucket = aws_s3_bucket.root_storage_bucket.bucket
    }
    
    // Attach the access policy to the S3 root bucket within your AWS account.
    // See https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/s3_bucket_policy
    resource "aws_s3_bucket_policy" "root_bucket_policy" {
      bucket     = aws_s3_bucket.root_storage_bucket.id
      policy     = data.databricks_aws_bucket_policy.this.json
      depends_on = [aws_s3_bucket_public_access_block.root_storage_bucket]
    }
    
    // Configure the S3 root bucket within your AWS account for new Databricks workspaces.
    // See https://registry.terraform.io/providers/databricks/databricks/latest/docs/resources/mws_storage_configurations
    resource "databricks_mws_storage_configurations" "this" {
      provider                   = databricks.mws
      account_id                 = var.databricks_account_id
      bucket_name                = aws_s3_bucket.root_storage_bucket.bucket
      storage_configuration_name = "${local.prefix}-storage"
    }
    
  • workspace.tf: This file instructs Terraform to create the workspace within your Databricks account. This file also includes Terraform output values that represent the workspace’s URL and the Databricks personal access token for your Databricks user within your new workspace.

    For related Terraform documentation, see the following on the Terraform website:

    // Set up the Databricks workspace to use the E2 version of the Databricks on AWS platform.
    // See https://registry.terraform.io/providers/databricks/databricks/latest/docs/resources/mws_workspaces
    resource "databricks_mws_workspaces" "this" {
      provider        = databricks.mws
      account_id      = var.databricks_account_id
      aws_region      = var.region
      workspace_name  = local.prefix
      deployment_name = local.prefix
    
      credentials_id           = databricks_mws_credentials.this.credentials_id
      storage_configuration_id = databricks_mws_storage_configurations.this.storage_configuration_id
      network_id               = databricks_mws_networks.this.network_id
    }
    
    // Capture the Databricks workspace's URL.
    output "databricks_host" {
      value = databricks_mws_workspaces.this.workspace_url
    }
    
    // Initialize the Databricks provider in "normal" (workspace) mode.
    // See https://registry.terraform.io/providers/databricks/databricks/latest/docs#authentication
    provider "databricks" {
      // In workspace mode, you don't have to give providers aliases. Doing it here, however,
      // makes it easier to reference, for example when creating a Databricks personal access token
      // later in this file.
      alias = "created_workspace"
      host = databricks_mws_workspaces.this.workspace_url
    }
    
    // Create a Databricks personal access token, to provision entities within the workspace.
    resource "databricks_token" "pat" {
      provider = databricks.created_workspace
      comment  = "Terraform Provisioning"
      lifetime_seconds = 86400
    }
    
    // Export the Databricks personal access token's value, for integration tests to run on.
    output "databricks_token" {
      value     = databricks_token.pat.token_value
      sensitive = true
    }
    
  • tutorial.tfvars: This file contains your Databricks account ID and your Databricks service principal’s client ID and client secret. Including the directive *.tfvars in the .gitignore file helps to prevent accidentally checking these sensitive values into your remote GitHub repository.

    databricks_client_id = "<your-Databricks-client-id>"
    databricks_client_secret = "<your-Databricks-client-secret>"
    databricks_account_id = "<your-Databricks-account-ID>"
    

Note

If you are using an account admin’s username and password for authentication, use the following:

databricks_account_username = "<your-Databricks-account-username>"
databricks_account_password = "<your-Databricks-account-password>"
databricks_account_id = "<your-Databricks-account-ID>"

Step 3: Create the required Databricks and AWS resources

In this step, you instruct Terraform to create all of the required Databricks and AWS resources that are needed for your new workspace.

  1. Run the following commands, one command at a time, from the preceding directory. These commands instruct Terraform to download all of the required dependencies to your development machine, inspect the instructions in your Terraform files, determine what resources need to be added or deleted, and finally, create all of the specified resources.

    terraform init
    terraform apply -var-file="tutorial.tfvars"
    
  2. Within a few minutes, your Databricks workspace is ready. Use the workspace’s URL, displayed in the commands’ output, to sign in to your workspace. Be sure to sign in with your Databricks workspace administrator credentials.

Step 4: Commit your changes to your GitHub repository

In this step, you add your IaC source to your repository in GitHub.

Run the following commands, one command at a time, from the preceding directory. These commands create a new branch in your repository, add your IaC source files to that branch, and then push that local branch to your remote repository.

git checkout -b databricks_workspace
git add .gitignore
git add cross-account-role.tf
git add init.tf
git add root-bucket.tf
git add vars.tf
git add vpc.tf
git add workspace.tf
git commit -m "Create Databricks E2 workspace"
git push origin HEAD

Note that this tutorial uses local state. This is fine if you are the sole developer, but if you collaborate in a team, Databricks strongly recommends that you use Terraform remote state instead, which can then be shared between all members of a team. Terraform supports storing state in Terraform Cloud, HashiCorp Consul, Amazon S3, Azure Blob Storage, Google Cloud Storage and other options. Pushing local state to GitHub, for example, could unexpectedly expose sensitive data such as Databricks account usernames, passwords, client secrets, client ids, or personal access tokens, which could also trigger GitGuardian warnings.

Step 5: Clean up

In this step, you can clean up the resources that you used in this tutorial, if you no longer want them in your Databricks or AWS accounts.

To clean up, run the following command from the preceding directory, which deletes the workspace as well as the other related resources that were previously created.

terraform destroy

Additional resources