Databricks Terraform provider
HashiCorp Terraform is a popular open source tool for creating safe and predictable cloud infrastructure across several cloud providers. You can use the Databricks Terraform provider to manage your Databricks workspaces and the associated cloud infrastructure using a flexible, powerful tool. The goal of the Databricks Terraform provider is to support all Databricks REST APIs, supporting automation of the most complicated aspects of deploying and managing your data platforms. Databricks customers are using the Databricks Terraform provider to deploy and manage clusters and jobs and to configure data access. You use the Databricks Terraform provider to provision Databricks workspaces as well as the AWS Provider to provision required AWS resources for these workspaces.
Getting started
In this section, you install and configure requirements to use Terraform and the Databricks Terraform provider. You then configure Terraform authentication. Following this section, this article provides a sample configuration that you can experiment with to provision a Databricks notebook, cluster, and a job to run the notebook on the cluster, in an existing Databricks workspace.
Requirements
Note
This section describes the requirements for creating resources at the Databricks on AWS account level by using Databricks OAuth, and for creating resources at the Databricks on AWS workspace level by using Databricks personal access tokens. To use other supported Databricks authentication types, see Databricks client unified authentication.
To use Terraform to create resources at the AWS account level, and to use the Databricks Terraform provider to create resources at the Databricks on AWS account level, you must have the following:
An AWS account.
A Databricks on AWS account.
A service principal that has the account admin role in your Databricks account.
The Terraform CLI. See Download Terraform on the Terraform website.
The following seven Databricks environment variables:
DATABRICKS_CLIENT_ID
, set to the value of the client ID, also known as the application ID, of the service principal. See Authentication using OAuth for service principals.DATABRICKS_CLIENT_SECRET
, set to the value of the client secret of the service principal. See Authentication using OAuth for service principals.DATABRICKS_ACCOUNT_ID
, set to the value of the ID of your Databricks account. You can find this value in the corner of your Databricks account console.TF_VAR_databricks_account_id
, also set to the value of the ID of your Databricks account.AWS_ACCESS_KEY_ID
, set to the value of your AWS user’s access key ID. See Programmatic access in the AWS General Reference.AWS_SECRET_ACCESS_KEY
, set to the value of your AWS user’s secret access key. See Programmatic access in the AWS General Reference.AWS_REGION
, set to the value of the AWS Region code for your Databricks account. See Regional endpoints in the AWS General Reference.
Note
An account admin’s username and password can also be used to authenticate to the Terraform provider. Databricks strongly recommends that you use OAuth for service principals. To use a username and password, you must have the following environment variables:
DATABRICKS_USERNAME
, set to the value of your Databricks account-level admin username.DATABRICKS_PASSWORD
, set to the value of the password for your Databricks account-level admin user.
To set these environment variables, see your operating system’s documentation.
To use the Databricks Terraform provider to also create resources at the Databricks workspace level, you must have the following:
A Databricks workspace.
On your local development machine, you must have:
The Terraform CLI. See Download Terraform on the Terraform website.
One of the following:
Databricks CLI version 0.205 or above, configured with your Databricks personal access token by running
databricks configure --host <workspace-url> --profile <some-unique-profile-name>
. See Install or update the Databricks CLI and Databricks personal access token authentication.Note
As a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use OAuth tokens or personal access tokens belonging to service principals instead of workspace users. To create tokens for service principals, see Manage tokens for a service principal.
The following Databricks environment variables:
DATABRICKS_HOST
, set to the value of your Databricks workspace instance URL, for examplehttps://dbc-1234567890123456.cloud.databricks.com
DATABRICKS_CLIENT_ID
, set to the value of the client ID, also known as the application ID, of the service principal. See Authentication using OAuth for service principals.DATABRICKS_CLIENT_SECRET
, set to the value of the client secret of the service principal. See Authentication using OAuth for service principals.
Alternatively, you can use a personal access token instead of a service principal’s client ID and client secret:
DATABRICKS_TOKEN
, set to the value of your Databricks personal access token. See also Manage personal access tokens.
To set these environment variables, see your operating system’s documentation.
Note
As a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use OAuth tokens or personal access tokens belonging to service principals instead of workspace users. To create tokens for service principals, see Manage tokens for a service principal.
Configure Terraform authentication
Note
This section describes how to create resources at the Databricks on AWS account level by using Databricks OAuth, and how to create resources at the Databricks on AWS workspace level by using Databricks personal access tokens. To use other supported Databricks authentication types, see Databricks client unified authentication.
In your Terraform project, you must create a configuration to authenticate Terraform with your AWS account, and to authenticate the Databricks Terraform provider with your Databricks on AWS account and your Databricks workspace, as follows:
In your terminal, create an empty directory and then switch to it. (Each separate set of Terraform configuration files must be in its own directory, which is called a Terraform project.) For example:
mkdir terraform_demo && cd terraform_demo
.mkdir terraform_demo && cd terraform_demo
In this empty directory, create a file named
auth.tf
. Add the following content to this file, depending on your authentication method, and then save the file.Tip
If you use Visual Studio Code, the HashiCorp Terraform extension for Visual Studio Code adds editing features for Terraform files such as syntax highlighting, IntelliSense, code navigation, code formatting, a module explorer, and much more.
To use environment variables to authenticate at the AWS account level and at the Databricks on AWS account level, and to use a Databricks CLI configuration profile to authenticate at the Databricks workspace level, add the following content.
variable "databricks_connection_profile" {} terraform { required_providers { databricks = { source = "databricks/databricks" } aws = { source = "hashicorp/aws" } } } provider "aws" {} # Use Databricks CLI authentication. provider "databricks" { profile = var.databricks_connection_profile }
To use environment variables to authenticate at the AWS account level, the Databricks on AWS account level, and the Databricks workspace level, add the following content instead:
terraform { required_providers { databricks = { source = "databricks/databricks" } aws = { source = "hashicorp/aws" } } } provider "aws" {} # Use environment variables for authentication. provider "databricks" {}
Tip
If you want to create resources only at the Databricks workspace level, you can remove the
aws
block from any of the precedingrequired_providers
declarations along with theprovider "aws"
declaration.If you use a Databricks CLI configuration profile to authenticate at the Databricks workspace level, create another file named
auth.auto.tfvars
, add the following content to the file, and change the name of the profile that you want to use as needed:databricks_connection_profile = "DEFAULT"
Tip
*.auto.tfvars
files enable you to specify variable values separately from your code. This makes your.tf
files more modular and reusable across different usage scenarios.Initialize the working directory containing the
auth.tf
file by running theterraform init
command. For more information, see Command: init on the Terraform website.terraform init
Terraform downloads the specified providers and installs them in a hidden subdirectory of your current working directory, named
.terraform
. Theterraform init
command prints out which version of the providers were installed. Terraform also creates a lock file named.terraform.lock.hcl
which specifies the exact provider versions used, so that you can control when you want to update the providers used for your project.Check whether your project was configured correctly by running the
terraform plan
command. If there are any errors, fix them, and run the command again. For more information, see Command: plan on the Terraform website.terraform plan
Apply the changes required to reach the desired state of the configuration by running the
terraform apply
command. For more information, see Command: apply on the Terraform website.terraform apply
Because no resources have yet been specified in the
auth.tf
file, the output isApply complete! Resources: 0 added, 0 changed, 0 destroyed.
Also, Terraform writes data into a file calledterraform.tfstate
. To create resources, continue with Sample configuration, Next steps, or both to specify the desired resources to create, and then run theterraform apply
command again. Terraform stores the IDs and properties of the resources it manages in thisterraform.tfstate
file, so that it can update or destroy those resources going forward.
Sample configuration
This section provides a sample configuration that you can experiment with to provision a Databricks notebook, a cluster, and a job to run the notebook on the cluster, in an existing Databricks workspace. It assumes that you have already set up the requirements, as well as created a Terraform project and configured the project with Terraform authentication as described in the previous section.
Create another file named
me.tf
in the same directory that you created in Configure Terraform authentication, and add the following code. This file gets information about the current user (you):# Retrieve information about the current user. data "databricks_current_user" "me" {}
Create another file named
notebook.tf
, and add the following code. This file represents the notebook.variable "notebook_subdirectory" { description = "A name for the subdirectory to store the notebook." type = string default = "Terraform" } variable "notebook_filename" { description = "The notebook's filename." type = string } variable "notebook_language" { description = "The language of the notebook." type = string } resource "databricks_notebook" "this" { path = "${data.databricks_current_user.me.home}/${var.notebook_subdirectory}/${var.notebook_filename}" language = var.notebook_language source = "./${var.notebook_filename}" } output "notebook_url" { value = databricks_notebook.this.url }
Create another file named
notebook.auto.tfvars
, and add the following code. This file specifies the notebook’s properties.notebook_subdirectory = "Terraform" notebook_filename = "notebook-getting-started.py" notebook_language = "PYTHON"
Create another file named
notebook-getting-started.py
, and add the following code. This file represents the notebook’s contents.display(spark.range(10))
Create another file named
cluster.tf
, and add the following code. This file represents the cluster.variable "cluster_name" { description = "A name for the cluster." type = string default = "My Cluster" } variable "cluster_autotermination_minutes" { description = "How many minutes before automatically terminating due to inactivity." type = number default = 60 } variable "cluster_num_workers" { description = "The number of workers." type = number default = 1 } # Create the cluster with the "smallest" amount # of resources allowed. data "databricks_node_type" "smallest" { local_disk = true } # Use the latest Databricks Runtime # Long Term Support (LTS) version. data "databricks_spark_version" "latest_lts" { long_term_support = true } resource "databricks_cluster" "this" { cluster_name = var.cluster_name node_type_id = data.databricks_node_type.smallest.id spark_version = data.databricks_spark_version.latest_lts.id autotermination_minutes = var.cluster_autotermination_minutes num_workers = var.cluster_num_workers } output "cluster_url" { value = databricks_cluster.this.url }
Create another file named
cluster.auto.tfvars
, and add the following code. This file specifies the cluster’s properties.cluster_name = "My Cluster" cluster_autotermination_minutes = 60 cluster_num_workers = 1
Create another file named
job.tf
, and add the following code. This file represents the job that runs the notebook on the cluster.variable "job_name" { description = "A name for the job." type = string default = "My Job" } resource "databricks_job" "this" { name = var.job_name existing_cluster_id = databricks_cluster.this.cluster_id notebook_task { notebook_path = databricks_notebook.this.path } email_notifications { on_success = [ data.databricks_current_user.me.user_name ] on_failure = [ data.databricks_current_user.me.user_name ] } } output "job_url" { value = databricks_job.this.url }
Create another file named
job.auto.tfvars
, and add the following code. This file specifies the jobs’s properties.job_name = "My Job"
Run
terraform plan
. If there are any errors, fix them, and then run the command again.Run
terraform apply
.Verify that the notebook, cluster, and job were created: in the output of the
terraform apply
command, find the URLs fornotebook_url
,cluster_url
, andjob_url
, and go to them.Run the job: on the Jobs page, click Run Now. After the job finishes, check your email inbox.
When you are done with this sample, delete the notebook, cluster, and job from the Databricks workspace by running
terraform destroy
.Verify that the notebook, cluster, and job were deleted: refresh the notebook, cluster, and Jobs pages to each display a message that the resource cannot be found.
Next steps
Manage workspace resources for a Databricks workspace.
Troubleshooting
Note
For Terraform-specific support, see the Latest Terraform topics on the HashiCorp Discuss website. For issues specific to the Databricks Terraform Provider, see Issues in the databrickslabs/terraform-provider-databricks GitHub repository.
Error: Failed to install provider
Issue: If you did not check in a terraform.lock.hcl
file to your version control system, and you run the terraform init
command, the following message appears: Failed to install provider
. Additional output may include a message similar to the following:
Error while installing databrickslabs/databricks: v1.0.0: checksum list has no SHA-256 hash for "https://github.com/databricks/terraform-provider-databricks/releases/download/v1.0.0/terraform-provider-databricks_1.0.0_darwin_amd64.zip"
Cause: Your Terraform configurations reference outdated Databricks Terraform providers.
Solution:
Replace
databrickslabs/databricks
withdatabricks/databricks
in all of your.tf
files.To automate these replacements, run the following Python command from the parent folder that contains the
.tf
files to update:python3 -c "$(curl -Ls https://dbricks.co/updtfns)"
Run the following Terraform command and then approve the changes when prompted:
terraform state replace-provider databrickslabs/databricks databricks/databricks
For information about this command, see Command: state replace-provider in the Terraform documentation.
Verify the changes by running the following Terraform command:
terraform init
Error: Failed to query available provider packages
Issue: If you did not check in a terraform.lock.hcl
file to your version control system, and you run the terraform init
command, the following message appears: Failed to query available provider packages
.
Cause: Your Terraform configurations reference outdated Databricks Terraform providers.
Solution: Follow the solution instructions in Error: Failed to install provider.
Additional examples
Control access to clusters: see Enable cluster access control for your workspace and Cluster access control
Control access to jobs: see Jobs access control
Control access to pools: see Pool access control
Implement CI/CD pipelines to deploy Databricks resources using the Databricks Terraform provider
Additional resources
Databricks Provider Documentation on the Terraform Registry website
Terraform Documentation on the Terraform website
The terraform-databricks-examples repository in GitHub