Terraform CDK Databricks Provider

Note

This article covers the Cloud Development Kit for Terraform (CDKTF), which is neither provided nor supported by Databricks. To contact the provider, see the Terraform Community.

This article shows you how to use Python or TypeScript along with the Terraform CDK Databricks Provider and the Cloud Development Kit for Terraform (CDKTF). The CDKTF is a third-party, infrastructure as code (IaC) platform that enables you to create, deploy, and manage Databricks resources by using familiar programming languages, tools, and engineering practices. Although this article shows you how to use Python and TypeScript, the CDKTF supports additional languages such as Java, C#, and Go.

The Terraform CDK Databricks provider is based on the Databricks Terraform provider. For more information, see Terraform Cloud. The CDKTF is based on the AWS Cloud Development Kit (AWS CDK).

Requirements

You must have a Databricks workspace, as this article deploys resources into an existing workspace.

On your local development machine, you must have the following installed:

  • Terraform, version 1.1 or higher. To check whether you have Terraform installed, and to check the installed version, run the command terraform -v from your terminal or with PowerShell. Install Terraform, if you do not have it already installed.

    terraform -v
    
  • Node.js, version 16.13 or higher, and npm. To check whether you have Node.js and npm installed, and to check the installed versions, run the commands node -v and npm -v. The latest versions of Node.js already include npm. Install Node.js and npm by using Node Version Manager (nvm), if you do not have Node.js and npm already installed.

    node -v
    npm -v
    
  • The CDKTF CLI. To check whether you have the CDKTF CLI installed, and to check the installed version, run the command cdktf --version. Install the CDKTF CLI by using npm, if you do not have it already installed.

    cdktf --version
    

    Tip

    You can also install the CDKTF CLI on macOS with Homebrew. See Install CDKTF.

  • The appropriate language runtime tools, as follows:

    Python version 3.7 or higher and pipenv version 2021.5.29 or higher. To check whether you have Python and pipenv installed, and to check the installed versions, run the commands python --version and pipenv --version. Install Python and install pipenv, if you do not have them already installed.

    python --version
    pipenv --version
    

    TypeScript version 4.4 or higher. To check whether you have TypeScript installed, and to check the installed version, run the command tsc -v. Install TypeScript, if you do not have it already installed.

    tsc -v
    

    Install the language Prerequisites.

  • One of the following:

    • The Databricks command-line interface (Databricks CLI), configured with your Databricks personal access token by running databricks configure --token. See Set up the CLI and Set up authentication.

      Note

      As a security best practice, when authenticating with automated tools, systems, scripts, and apps, Databricks recommends you use access tokens belonging to service principals instead of workspace users. For more information, see Service principals for Databricks automation.

    • The following two environment variables:

      • DATABRICKS_HOST, set to the value of your workspace instance URL, for example https://dbc-1234567890123456.cloud.databricks.com

      • DATABRICKS_TOKEN, set to the value of your Databricks personal access token.

      To set these environment variables, see your operating system’s documentation.

      Note

      As a security best practice, when authenticating with automated tools, systems, scripts, and apps, Databricks recommends you use access tokens belonging to service principals instead of workspace users. For more information, see Service principals for Databricks automation.

Step 1: Create a CDKTF project

In this step, on your local development machine you set up the necessary directory structure for a CDKTF project. You then create your CDKTF project within this directory structure.

  1. Create an empty directory for your CDKTF project, and then switch to it. Run the following commands in your terminal or with PowerShell:

    mkdir cdktf-demo
    cd cdktf-demo
    
    md cdktf-demo
    cd cdktf-demo
    
  2. Create a CDKTF project by running the following command:

    cdktf init --template=python --local
    
    cdktf init --template=typescript --local
    
  3. When prompted for a Project Name, accept the default project name of cdktf-demo by pressing Enter.

  4. When prompted for a Project Description, accept the default project description by pressing Enter.

  5. If prompted Do you want to start from an existing Terraform project, enter N and press Enter.

  6. If prompted Do you want to send crash reports to the CDKTF team, enter n and press Enter.

The CDKTF creates the following files and subdirectories in your cdktf-demo directory:

  • .gitignore, which is a list of files and directories that Git ignores if you want to push this project into a remote Git repository.

  • cdktf.json, which contains configuration settings for your CDKTF project. See Configuration File for more information on configuration settings.

  • help, which contains information about some next steps you can take to work with your CDKTF project.

  • main-test.py, which contains supporting unit tests that you can write for your CDKTF project. See Unit Tests for more information on unit testing.

  • main.py, which contains the Python code that you write for your CDKTF project.

  • Pipfile and Pipfile.lock, which manage code dependencies for your CDKTF project.

  • jest.config.js and a __tests__ subdirectory, which manage supporting unit tests that you can write for your CDKTF project. See Unit Tests for more information on unit testing.

  • A node_modules subdirectory, which contains code dependencies for your CDKTF project.

  • .gitignore, which is a list of files and directories that Git ignores if you want to push this project into a remote Git repository.

  • cdktf.json, which contains configuration settings for your CDKTF project. See Configuration File for more information on configuration settings.

  • help, which contains information about some next steps you can take to work with your CDKTF project.

  • main.ts, which contains the TypeScript code that you write for your CDKTF project.

  • .npmrc, package.json, package-lock.json, setup.js, and tsconfig.json, which manage code dependencies and other settings for your CDKTF project.

Step 2: Define resources

In this step, you use the Terraform CDK Databricks provider to define a notebook and a job to run that notebook.

  1. Install the project dependencies as follows:

    Using pipenv, install into your CDKTF project the Terraform CDK Databricks Provider to generate Databricks resources. To do this, run the following:

    pipenv install cdktf-cdktf-provider-databricks
    

    Using npm (for TypeScript), install into your CDKTF project the Terraform CDK Databricks Provider to generate Databricks resources. Also install the TypeScript definitions for Node.js package to use the Buffer class to write code into the notebook. To do this, run the following:

    npm install @cdktf/provider-databricks --force
    npm install --save-dev @types/node
    
  2. Replace the contents of the main.py file (for Python) or the main.ts file (for TypeScript) with the following code. This code authenticates the CDKTF with your Databricks workspace and then generates a notebook along with a job to run the notebook. To view syntax documentation for this code, see the Terraform CDK Databricks provider construct reference for Python or TypeScript.

    #!/usr/bin/env python
    from unicodedata import name
    from constructs import Construct
    from cdktf import App, TerraformStack, TerraformOutput
    from cdktf_cdktf_provider_databricks import *
    import vars
    from base64 import b64encode
    
    class MyStack(TerraformStack):
      def __init__(self, scope: Construct, ns: str):
        super().__init__(scope, ns)
    
        DatabricksProvider(
          scope = self,
          id    = "databricksAuth"
        )
    
        current_user = DataDatabricksCurrentUser(
          scope     = self,
          id_       = "currentUser"
        )
    
        # Define the notebook.
        notebook = Notebook(
          scope          = self,
          id_            = "notebook",
          path           = f"{current_user.home}/CDKTF/{vars.resource_prefix}-notebook.py",
          language       = "PYTHON",
          content_base64 = b64encode(b"display(spark.range(10))").decode("UTF-8")
        )
    
        # Define the job to run the notebook.
        job = Job(
          scope = self,
          id_   = "job",
          name  = f"{vars.resource_prefix}-job",
          new_cluster = JobNewCluster(
            num_workers   = vars.num_workers,
            spark_version = vars.spark_version,
            node_type_id  = vars.node_type_id
          ),
          notebook_task = JobNotebookTask(
            notebook_path = f"{current_user.home}/CDKTF/{vars.resource_prefix}-notebook.py"
          ),
          email_notifications = JobEmailNotifications(
            on_success = [ current_user.user_name ],
            on_failure = [ current_user.user_name ]
          )
        )
    
        # Output the notebook and job URLs.
        TerraformOutput(
          scope = self,
          id    = "Notebook URL",
          value = notebook.url
        )
    
        TerraformOutput(
          scope = self,
          id    = "Job URL",
          value = job.url
        )
    
    app = App()
    MyStack(app, "cdktf-python")
    
    app.synth()
    
    import { Construct } from "constructs";
    import { App, TerraformOutput, TerraformStack } from "cdktf";
    import { DatabricksProvider, DataDatabricksCurrentUser, Notebook, Job } from "@cdktf/provider-databricks";
    import * as vars from "./vars";
    
    class MyStack extends TerraformStack {
      constructor(scope: Construct, name: string) {
        super(scope, name);
    
        new DatabricksProvider(this, "databricksAuth", {})
    
        const currentUser = new DataDatabricksCurrentUser(this, "currentUser", {});
    
        // Define the notebook.
        const notebook = new Notebook(this, "notebook", {
          path: `${currentUser.home}/CDKTF/${vars.resourcePrefix}-notebook.py`,
          language: "PYTHON",
          contentBase64: Buffer.from("display(spark.range(10))", "utf8").toString("base64")
        });
    
        // Define the job to run the notebook.
        const job = new Job(this, "job", {
          name: `${vars.resourcePrefix}-job`,
          newCluster: {
            numWorkers: vars.numWorkers,
            sparkVersion: vars.sparkVersion,
            nodeTypeId: vars.nodeTypeId
          },
          notebookTask: {
            notebookPath: `${currentUser.home}/CDKTF/${vars.resourcePrefix}-notebook.py`
          },
          emailNotifications: {
            onSuccess: [ currentUser.userName ],
            onFailure: [ currentUser.userName ]
          }
        });
    
        // Output the notebook and job URLs.
        new TerraformOutput(this, "Notebook URL", {
          value: notebook.url
        });
    
        new TerraformOutput(this, "Job URL", {
          value: job.url
        });
      }
    }
    
    const app = new App();
    new MyStack(app, "cdktf-demo");
    app.synth();
    
  3. Create a file named vars.py (for Python) or vars.ts (for TypeScript) in the same directory as main.py (for Python) or main.ts (for TypeScript). Replace the following values with your own values to specify a resource prefix and cluster settings such as the number of workers, Spark runtime version string, and node type.

    #!/usr/bin/env python
    resource_prefix = "cdktf-demo"
    num_workers     = 1
    spark_version   = "10.4.x-scala2.12"
    node_type_id    = "i3.xlarge"
    
    export const resourcePrefix = "cdktf-demo"
    export const numWorkers     = 1
    export const sparkVersion   = "10.4.x-scala2.12"
    export const nodeTypeId     = "i3.xlarge"
    

Step 3: Deploy the resources

In this step, you use the CDKTF CLI to deploy, into your existing Databricks workspace, the defined notebook and the job to run that notebook.

  1. Generate the Terraform code equivalent for your CDKTF project. To do this, run the cdktf synth command.

    cdktf synth
    
  2. Before making changes, you can can review the pending resource changes. Run the following:

    cdktf diff
    
  3. Deploy the notebook and job by running the cdktf deploy command.

    cdktf deploy
    
  4. When prompted to Approve, press Enter. Terraform creates and deploys the notebook and job into your workspace.

Step 4: Interact with the resources

In this step, you run the job in your Databricks workspace, which runs the specified notebook.

  1. To view the notebook that the job will run in your workspace, copy the Notebook URL link that appears in the output of the cdk deploy command and paste it into your web browser’s address bar.

  2. To view the job that runs the notebook in your workspace, copy the Job URL link that appears in the output of the cdk deploy command and paste it into your web browser’s address bar.

  3. To run the job, click the Run now button on the job page.

(Optional) Step 5: Make changes to a resource

In this optional step, you change the notebook’s code, redeploy the changed notebook, and then use the job to rerun the changed notebook.

If you do not want to make any changes to the notebook, skip ahead to Step 6: Clean up.

  1. In the main.py file (for Python) or the main.ts file (for TypeScript), change the notebook variable declaration from the following:

        notebook = Notebook(
          scope          = self,
          id_            = "notebook",
          path           = f"{current_user.home}/CDKTF/{vars.resource_prefix}-notebook.py",
          language       = "PYTHON",
          content_base64 = b64encode(b"display(spark.range(10))").decode("UTF-8")
        )
    
       const notebook = new Notebook(this, "notebook", {
          path: currentUser.home + "/CDKTF/" + vars.resourcePrefix + "-notebook.py",
          language: "PYTHON",
          contentBase64: Buffer.from("display(spark.range(10))", "utf8").toString("base64")
       });
    

    To the following:

        notebook = Notebook(
          scope          = self,
          id_            = "notebook",
          path           = f"{current_user.home}/CDKTF/{vars.resource_prefix}-notebook.py",
          language       = "PYTHON",
          content_base64 = b64encode(b'''
    data = [
      { "Category": 'A', "ID": 1, "Value": 121.44 },
      { "Category": 'B', "ID": 2, "Value": 300.01 },
      { "Category": 'C', "ID": 3, "Value": 10.99 },
      { "Category": 'E', "ID": 4, "Value": 33.87}
    ]
    
    df = spark.createDataFrame(data)
    
    display(df)
    ''').decode("UTF-8")
        )
    
       const notebook = new Notebook(this, "notebook", {
          path: currentUser.home + "/CDKTF/" + vars.resourcePrefix + "-notebook.py",
          language: "PYTHON",
          contentBase64: Buffer.from(`
    data = [
    { "Category": 'A', "ID": 1, "Value": 121.44 },
    { "Category": 'B', "ID": 2, "Value": 300.01 },
    { "Category": 'C', "ID": 3, "Value": 10.99 },
    { "Category": 'E', "ID": 4, "Value": 33.87}
    ]
    
    df = spark.createDataFrame(data)
    
    display(df)`, "utf8").toString("base64")
       });
    

    Note

    Make sure that the lines of code beginning and ending with backticks (`) are flush with the edge of your code editor. Otherwise, Terraform will insert whitespace into the notebook that may cause the new Python code to fail to run.

  2. Regenerate the Terraform code equivalent for your CDKTF project. To do this, run the following:

    cdktf synth
    
  3. Before making changes, you can can review the pending resource changes. Run the following:

    cdktf diff
    
  4. Deploy the notebook changes by running the cdktf deploy command.

    cdktf deploy
    
  5. When prompted to Approve, press Enter. Terraform changes the notebook’s contents.

  6. To view the changed notebook that the job will run in your workspace, refresh the notebook that you opened earlier, or copy the Notebook URL link that appears in the output of the cdk deploy command and paste it into your web browser’s address bar.

  7. To view the job that runs the changed notebook in your workspace, refresh the job that you opened earlier, or copy the Job URL link that appears in the output of the cdk deploy command and paste it into your web browser’s address bar.

  8. To run the job, click the Run now button on the job page.

Step 6: Clean up

In this step, you use the CDKTF CLI to remove the notebook and job from your Databricks workspace.

  1. Remove the resources from your workspace by running the cdktf destroy command:

    cdktf destroy
    
  2. When prompted to Approve, press Enter. Terraform removes the resources from your workspace.