Create a cluster, a notebook, and a job with the Databricks Terraform provider

This article shows how to use the Databricks Terraform provider to create a cluster, a notebook, and a job in an existing Databricks workspace.

This article is a companion to the following Databricks getting started articles:

You can also adapt the Terraform configurations in this article to create custom clusters, notebooks, and jobs in your workspaces.

Requirements

  • A Databricks workspace.

  • On your local development machine, you must have:

    • The Terraform CLI. See Download Terraform on the Terraform website.

    • One of the following:

      • The Databricks command-line interface (Databricks CLI), configured with your Databricks workspace instance URL, for example https://dbc-1234567890123456.cloud.databricks.com, and your Databricks personal access token, by running databricks configure --token. See Set up the CLI and Set up authentication.

        Note

        As a security best practice, when authenticating with automated tools, systems, scripts, and apps, Databricks recommends you use access tokens belonging to service principals instead of workspace users. For more information, see Service principals for Databricks automation.

      • The following two environment variables:

        To set these environment variables, see your operating system’s documentation.

        Note

        As a security best practice, when authenticating with automated tools, systems, scripts, and apps, Databricks recommends you use access tokens belonging to service principals instead of workspace users. For more information, see Service principals for Databricks automation.

Step 1: Set up the Terraform project

In this step, you set up a Terraform project to define the settings for Terraform to authenticate with your workspace. You also define the settings for the resources that Terraform deploys to your workspace.

  1. Create an empty directory and then switch to it. This directory contains your Terraform project files. (Each separate set of Terraform project files must be in its own parent directory.) To do this, in your terminal or PowerShell, run a command like the following:

    mkdir terraform_cluster_notebook_job && cd terraform_cluster_notebook_job
    
  2. In this empty directory, create a file named auth.tf, and add the following content to the file. This configuration initializes the Databricks Terraform provider and authenticates Terraform with your workspace.

    To authenticate with a Databricks CLI configuration profile, add the following content:

    variable "databricks_connection_profile" {
      description = "The name of the Databricks connection profile to use."
      type        = string
    }
    
    # Initialize the Databricks Terraform provider.
    terraform {
      required_providers {
        databricks = {
          source = "databricks/databricks"
        }
      }
    }
    
    # Use Databricks CLI authentication.
    provider "databricks" {
      profile = var.databricks_connection_profile
    }
    
    # Retrieve information about the current user.
    data "databricks_current_user" "me" {}
    

    To authenticate with environment variables, add the following content instead:

    # Initialize the Databricks Terraform provider.
    terraform {
      required_providers {
        databricks = {
          source = "databricks/databricks"
        }
      }
    }
    
    # Use environment variables for authentication.
    provider "databricks" {}
    
    # Retrieve information about the current user.
    data "databricks_current_user" "me" {}
    
  1. Create another file named auth.auto.tfvars, and add the following content to the file. This file contains variable values for authenticating Terraform with your workspace. Replace the placeholder values with your own values.

    To authenticate with a Databricks CLI configuration profile, add the following content:

    databricks_connection_profile = "DEFAULT"
    

    To authenticate with with environment variables, you do not need an auth.auto.tfvars file.

  1. Run the terraform init command. This command initializes your Terraform project by creating additional helper files and downloading the necessary Terraform modules.

    terraform init
    
  2. If you are creating a cluster, create another file named cluster.tf, and add the following content to the file. This content creates a cluster with the smallest amount of resources allowed. This cluster uses the lastest Databricks Runtime Long Term Support (LTS) version.

    For a cluster that works with Unity Catalog:

    variable "cluster_name" {}
    variable "cluster_autotermination_minutes" {}
    variable "cluster_num_workers" {}
    variable "cluster_data_security_mode" {}
    
    # Create the cluster with the "smallest" amount
    # of resources allowed.
    data "databricks_node_type" "smallest" {
      local_disk = true
    }
    
    # Use the latest Databricks Runtime
    # Long Term Support (LTS) version.
    data "databricks_spark_version" "latest_lts" {
      long_term_support = true
    }
    
    resource "databricks_cluster" "this" {
      cluster_name            = var.cluster_name
      node_type_id            = data.databricks_node_type.smallest.id
      spark_version           = data.databricks_spark_version.latest_lts.id
      autotermination_minutes = var.cluster_autotermination_minutes
      num_workers             = var.cluster_num_workers
      data_security_mode      = var.cluster_data_security_mode
    }
    
    output "cluster_url" {
     value = databricks_cluster.this.url
    }
    

    For an all-purpose cluster:

    variable "cluster_name" {
      description = "A name for the cluster."
      type        = string
      default     = "My Cluster"
    }
    
    variable "cluster_autotermination_minutes" {
      description = "How many minutes before automatically terminating due to inactivity."
      type        = number
      default     = 60
    }
    
    variable "cluster_num_workers" {
      description = "The number of workers."
      type        = number
      default     = 1
    }
    
    # Create the cluster with the "smallest" amount
    # of resources allowed.
    data "databricks_node_type" "smallest" {
      local_disk = true
    }
    
    # Use the latest Databricks Runtime
    # Long Term Support (LTS) version.
    data "databricks_spark_version" "latest_lts" {
      long_term_support = true
    }
    
    resource "databricks_cluster" "this" {
      cluster_name            = var.cluster_name
      node_type_id            = data.databricks_node_type.smallest.id
      spark_version           = data.databricks_spark_version.latest_lts.id
      autotermination_minutes = var.cluster_autotermination_minutes
      num_workers             = var.cluster_num_workers
    }
    
    output "cluster_url" {
     value = databricks_cluster.this.url
    }
    
  3. If you are creating the cluster, create another file named cluster.auto.tfvars, and add the following content to the file. This file contains variable values for customizing the cluster. Replace the placeholder values with your own values.

    For a cluster that works with Unity Catalog:

    cluster_name                    = "My Cluster"
    cluster_autotermination_minutes = 60
    cluster_num_workers             = 1
    cluster_data_security_mode      = "SINGLE_USER"
    

    For an all-purpose cluster:

    cluster_name                    = "My Cluster"
    cluster_autotermination_minutes = 60
    cluster_num_workers             = 1
    
  4. If you are creating a notebook, create another file named notebook.tf, and add the following content to the file:

    variable "notebook_subdirectory" {
      description = "A name for the subdirectory to store the notebook."
      type        = string
      default     = "Terraform"
    }
    
    variable "notebook_filename" {
      description = "The notebook's filename."
      type        = string
    }
    
    variable "notebook_language" {
      description = "The language of the notebook."
      type        = string
    }
    
    resource "databricks_notebook" "this" {
      path     = "${data.databricks_current_user.me.home}/${var.notebook_subdirectory}/${var.notebook_filename}"
      language = var.notebook_language
      source   = "./${var.notebook_filename}"
    }
    
    output "notebook_url" {
     value = databricks_notebook.this.url
    }
    
  5. Save the following notebook code to a file in the same directory as the notebook.tf file:

    For the Python notebook for Run your first ETL workload on Databricks, a file named notebook-getting-started-etl-quick-start.py with the following contents:

    # Databricks notebook source
    # Import functions
    from pyspark.sql.functions import input_file_name, current_timestamp
    
    # Define variables used in code below
    file_path = "/databricks-datasets/structured-streaming/events"
    username = spark.sql("SELECT regexp_replace(current_user(), '[^a-zA-Z0-9]', '_')").first()[0]
    table_name = f"{username}_etl_quickstart"
    checkpoint_path = f"/tmp/{username}/_checkpoint/etl_quickstart"
    
    # Clear out data from previous demo execution
    spark.sql(f"DROP TABLE IF EXISTS {table_name}")
    dbutils.fs.rm(checkpoint_path, True)
    
    # Configure Auto Loader to ingest JSON data to a Delta table
    (spark.readStream
      .format("cloudFiles")
      .option("cloudFiles.format", "json")
      .option("cloudFiles.schemaLocation", checkpoint_path)
      .load(file_path)
      .select("*", input_file_name().alias("source_file"), current_timestamp().alias("processing_time"))
      .writeStream
      .option("checkpointLocation", checkpoint_path)
      .trigger(availableNow=True)
      .toTable(table_name))
    
    # COMMAND ----------
    
    df = spark.read.table(table_name)
    
    # COMMAND ----------
    
    display(df)
    

    For the SQL notebook for Get started with Databricks as a data scientist, a file named notebook-getting-started-quick-start.sql with the following contents:

    -- Databricks notebook source
    -- MAGIC %python
    -- MAGIC diamonds = (spark.read
    -- MAGIC   .format("csv")
    -- MAGIC   .option("header", "true")
    -- MAGIC   .option("inferSchema", "true")
    -- MAGIC   .load("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv")
    -- MAGIC )
    -- MAGIC 
    -- MAGIC diamonds.write.format("delta").save("/mnt/delta/diamonds")
    
    -- COMMAND ----------
    
    DROP TABLE IF EXISTS diamonds;
    
    CREATE TABLE diamonds USING DELTA LOCATION '/mnt/delta/diamonds/'
    
    -- COMMAND ----------
    
    SELECT color, avg(price) AS price FROM diamonds GROUP BY color ORDER BY COLOR
    

    For the Python notebook for Run your first end-to-end analytics pipeline in the Databricks Lakehouse, a file named notebook-getting-started-lakehouse-e2e.py with the following contents:

    # Databricks notebook source
    external_location = "<your_external_location>"
    catalog = "<your_catalog>"
    
    dbutils.fs.put(f"{external_location}/foobar.txt", "Hello world!", True)
    display(dbutils.fs.head(f"{external_location}/foobar.txt"))
    dbutils.fs.rm(f"{external_location}/foobar.txt")
    
    display(spark.sql(f"SHOW SCHEMAS IN {catalog}"))
    
    # COMMAND ----------
    
    from pyspark.sql.functions import col
    
    # Set parameters for isolation in workspace and reset demo
    username = spark.sql("SELECT regexp_replace(current_user(), '[^a-zA-Z0-9]', '_')").first()[0]
    database = f"{catalog}.e2e_lakehouse_{username}_db"
    source = f"{external_location}/e2e-lakehouse-source"
    table = f"{database}.target_table"
    checkpoint_path = f"{external_location}/_checkpoint/e2e-lakehouse-demo"
    
    spark.sql(f"SET c.username='{username}'")
    spark.sql(f"SET c.database={database}")
    spark.sql(f"SET c.source='{source}'")
    
    spark.sql("DROP DATABASE IF EXISTS ${c.database} CASCADE")
    spark.sql("CREATE DATABASE ${c.database}")
    spark.sql("USE ${c.database}")
    
    # Clear out data from previous demo execution
    dbutils.fs.rm(source, True)
    dbutils.fs.rm(checkpoint_path, True)
    
    
    # Define a class to load batches of data to source
    class LoadData:
    
      def __init__(self, source):
        self.source = source
    
      def get_date(self):
        try:
          df = spark.read.format("json").load(source)
        except:
            return "2016-01-01"
        batch_date = df.selectExpr("max(distinct(date(tpep_pickup_datetime))) + 1 day").first()[0]
        if batch_date.month == 3:
          raise Exception("Source data exhausted")
          return batch_date
    
      def get_batch(self, batch_date):
        return (
          spark.table("samples.nyctaxi.trips")
            .filter(col("tpep_pickup_datetime").cast("date") == batch_date)
        )
    
      def write_batch(self, batch):
        batch.write.format("json").mode("append").save(self.source)
    
      def land_batch(self):
        batch_date = self.get_date()
        batch = self.get_batch(batch_date)
        self.write_batch(batch)
    
    RawData = LoadData(source)
    
    # COMMAND ----------
    
    RawData.land_batch()
    
    # COMMAND ----------
    
    # Import functions
    from pyspark.sql.functions import input_file_name, current_timestamp
    
    # Configure Auto Loader to ingest JSON data to a Delta table
    (spark.readStream
      .format("cloudFiles")
      .option("cloudFiles.format", "json")
      .option("cloudFiles.schemaLocation", checkpoint_path)
      .load(file_path)
      .select("*", input_file_name().alias("source_file"), current_timestamp().alias("processing_time"))
      .writeStream
      .option("checkpointLocation", checkpoint_path)
      .trigger(availableNow=True)
      .option("mergeSchema", "true")
      .toTable(table))
    
    # COMMAND ----------
    
    df = spark.read.table(table_name)
    
    # COMMAND ----------
    
    display(df)
    
  6. If you are creating the notebook, create another file named notebook.auto.tfvars, and add the following content to the file. This file contains variable values for customizing the notebook configuration.

    For the Python notebook for Run your first ETL workload on Databricks:

    notebook_subdirectory = "Terraform"
    notebook_filename     = "notebook-getting-started-etl-quick-start.py"
    notebook_language     = "PYTHON"
    

    For the SQL notebook for Get started with Databricks as a data scientist:

    notebook_subdirectory = "Terraform"
    notebook_filename     = "notebook-getting-started-quickstart.sql"
    notebook_language     = "SQL"
    

    For the Python notebook for Run your first end-to-end analytics pipeline in the Databricks Lakehouse:

    notebook_subdirectory = "Terraform"
    notebook_filename     = "notebook-getting-started-lakehouse-e2e.py"
    notebook_language     = "PYTHON"
    
  7. If you are creating a notebook, in your Databricks workspace, be sure to set up any requirements for the notebook to run successfully, by referring to the following instructions for:

  8. If you are creating a job, create another file named job.tf, and add the following content to the file. This content creates a job to run the notebook.

    variable "job_name" {
      description = "A name for the job."
      type        = string
      default     = "My Job"
    }
    
    resource "databricks_job" "this" {
      name = var.job_name
      existing_cluster_id = databricks_cluster.this.cluster_id
      notebook_task {
        notebook_path = databricks_notebook.this.path
      }
      email_notifications {
        on_success = [ data.databricks_current_user.me.user_name ]
        on_failure = [ data.databricks_current_user.me.user_name ]
      }
    }
    
    output "job_url" {
      value = databricks_job.this.url
    }
    
  9. If you are creating the job, create another file named job.auto.tfvars, and add the following content to the file. This file contains a variable value for customizing the job configuration.

    job_name = "My Job"
    

Step 2: Run the configurations

In this step, you run the Terraform configurations to deploy the cluster, the notebook, and the job into your Databricks workspace.

  1. Check to see whether your Terraform configurations are valid by running the terraform validate command. If any errors are reported, fix them, and run the command again.

    terraform validate
    
  2. Check to see what Terraform will do in your workspace, before Terraform actually does it, by running the terraform plan command.

    terraform plan
    
  3. Deploy the cluster, the notebook, and the job into your workspace by running the terraform apply command. When prompted to deploy, type yes and press Enter.

    terraform apply
    

    Terraform deploys the resources that are specified in your project. Deploying these resources (especially a cluster) can take several minutes.

Step 3: Explore the results

  1. If you created a cluster, in the output of the terraform apply command, copy the link next to cluster_url, and paste it into your web browser’s address bar.

  2. If you created a notebook, in the output of the terraform apply command, copy the link next to notebook_url, and paste it into your web browser’s address bar.

    Note

    Before you use the notebook, you might need to customize its contents. See the related documentation about how to customize the notebook.

  3. If you created a job, in the output of the terraform apply command, copy the link next to job_url, and paste it into your web browser’s address bar.

    Note

    Before you run the notebook, you might need to customize its contents. See the links at the beginning of this article for related documentation about how to customize the notebook.

  4. If you created a job, run the job as follows:

    1. Click Run now on the job page.

    2. After the job finishes running, to view the job run’s results, in the Completed runs (past 60 days) list on the job page, click the most recent time entry in the Start time column. The Output pane shows the result of running the notebook’s code.

Step 4: Clean up

In this step, you delete the preceding resources from your workspace.

  1. Check to see what Terraform will do in your workspace, before Terraform actually does it, by running the terraform plan command.

    terraform plan
    
  2. Delete the cluster, the notebook, and the job from your workspace by running the terraform destroy command. When prompted to delete, type yes and press Enter.

    terraform destroy
    

    Terraform deletes the resources that are specified in your project.