Terraform CDK Databricks Provider
Note
This article covers the Cloud Development Kit for Terraform (CDKTF), which is neither provided nor supported by Databricks. To contact the provider, see the Terraform Community.
This article shows you how to use Python or TypeScript along with the Terraform CDK Databricks Provider and the Cloud Development Kit for Terraform (CDKTF). The CDKTF is a third-party, infrastructure as code (IaC) platform that enables you to create, deploy, and manage Databricks resources by using familiar programming languages, tools, and engineering practices. Although this article shows you how to use Python and TypeScript, the CDKTF supports additional languages such as Java, C#, and Go.
The Terraform CDK Databricks provider is based on the Databricks Terraform provider. For more information, see Terraform Cloud. The CDKTF is based on the AWS Cloud Development Kit (AWS CDK).
Requirements
You must have a Databricks workspace, as this article deploys resources into an existing workspace.
On your local development machine, you must have the following installed:
Terraform, version 1.1 or higher. To check whether you have Terraform installed, and to check the installed version, run the command
terraform -v
from your terminal or with PowerShell. Install Terraform, if you do not have it already installed.terraform -v
Node.js, version 16.13 or higher, and npm. To check whether you have Node.js and
npm
installed, and to check the installed versions, run the commandsnode -v
andnpm -v
. The latest versions of Node.js already includenpm
. Install Node.js and npm by using Node Version Manager (nvm), if you do not have Node.js andnpm
already installed.node -v npm -v
The CDKTF CLI. To check whether you have the CDKTF CLI installed, and to check the installed version, run the command
cdktf --version
. Install the CDKTF CLI by using npm, if you do not have it already installed.cdktf --version
Tip
You can also install the CDKTF CLI on macOS with Homebrew. See Install CDKTF.
The appropriate language runtime tools, as follows:
Python version 3.7 or higher and pipenv version 2021.5.29 or higher. To check whether you have Python and
pipenv
installed, and to check the installed versions, run the commandspython --version
andpipenv --version
. Install Python and install pipenv, if you do not have them already installed.python --version pipenv --version
TypeScript version 4.4 or higher. To check whether you have TypeScript installed, and to check the installed version, run the command
tsc -v
. Install TypeScript, if you do not have it already installed.tsc -v
Install the language Prerequisites.
One of the following:
The Databricks command-line interface (Databricks CLI), configured with your Databricks personal access token by running
databricks configure --token
. See Set up the CLI and Set up authentication.Note
As a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use OAuth tokens or personal access tokens belonging to service principals instead of workspace users. To create tokens for service principals, see Manage personal access tokens for a service principal.
The following Databricks environment variables:
DATABRICKS_HOST
, set to the value of your workspace instance URL, for examplehttps://dbc-1234567890123456.cloud.databricks.com
DATABRICKS_CLIENT_ID
, set to the value of the client ID, also known as the application ID, of the service principal. See Authentication using OAuth tokens for service principals.DATABRICKS_CLIENT_SECRET
, set to the value of the client secret of the service principal. See Authentication using OAuth tokens for service principals.
Alternatively, you can use a personal access token instead of a service principal’s client ID and client secret:
DATABRICKS_TOKEN
, set to the value of your Databricks personal access token. See also Manage personal access tokens.
To set these environment variables, see your operating system’s documentation.
Note
As a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use OAuth tokens or personal access tokens belonging to service principals instead of workspace users. To create tokens for service principals, see Manage personal access tokens for a service principal.
Step 1: Create a CDKTF project
In this step, on your local development machine you set up the necessary directory structure for a CDKTF project. You then create your CDKTF project within this directory structure.
Create an empty directory for your CDKTF project, and then switch to it. Run the following commands in your terminal or with PowerShell:
mkdir cdktf-demo cd cdktf-demo
md cdktf-demo cd cdktf-demo
Create a CDKTF project by running the following command:
cdktf init --template=python --local
cdktf init --template=typescript --local
When prompted for a Project Name, accept the default project name of
cdktf-demo
by pressing Enter.When prompted for a Project Description, accept the default project description by pressing Enter.
If prompted Do you want to start from an existing Terraform project, enter
N
and press Enter.If prompted Do you want to send crash reports to the CDKTF team, enter
n
and press Enter.
The CDKTF creates the following files and subdirectories in your cdktf-demo
directory:
.gitignore
, which is a list of files and directories that Git ignores if you want to push this project into a remote Git repository.cdktf.json
, which contains configuration settings for your CDKTF project. See Configuration File for more information on configuration settings.help
, which contains information about some next steps you can take to work with your CDKTF project.main-test.py
, which contains supporting unit tests that you can write for your CDKTF project. See Unit Tests for more information on unit testing.main.py
, which contains the Python code that you write for your CDKTF project.Pipfile
andPipfile.lock
, which manage code dependencies for your CDKTF project.
jest.config.js
and a__tests__
subdirectory, which manage supporting unit tests that you can write for your CDKTF project. See Unit Tests for more information on unit testing.A
node_modules
subdirectory, which contains code dependencies for your CDKTF project..gitignore
, which is a list of files and directories that Git ignores if you want to push this project into a remote Git repository.cdktf.json
, which contains configuration settings for your CDKTF project. See Configuration File for more information on configuration settings.help
, which contains information about some next steps you can take to work with your CDKTF project.main.ts
, which contains the TypeScript code that you write for your CDKTF project..npmrc
,package.json
,package-lock.json
,setup.js
, andtsconfig.json
, which manage code dependencies and other settings for your CDKTF project.
Step 2: Define resources
In this step, you use the Terraform CDK Databricks provider to define a notebook and a job to run that notebook.
Install the project dependencies as follows:
Using
pipenv
, install into your CDKTF project the Terraform CDK Databricks Provider to generate Databricks resources. To do this, run the following:pipenv install cdktf-cdktf-provider-databricks
Using
npm
(for TypeScript), install into your CDKTF project the Terraform CDK Databricks Provider to generate Databricks resources. Also install the TypeScript definitions for Node.js package to use theBuffer
class to write code into the notebook. To do this, run the following:npm install @cdktf/provider-databricks --force npm install --save-dev @types/node
Replace the contents of the
main.py
file (for Python) or themain.ts
file (for TypeScript) with the following code. This code authenticates the CDKTF with your Databricks workspace and then generates a notebook along with a job to run the notebook. To view syntax documentation for this code, see the Terraform CDK Databricks provider construct reference for Python or TypeScript.#!/usr/bin/env python from unicodedata import name from constructs import Construct from cdktf import App, TerraformStack, TerraformOutput from cdktf_cdktf_provider_databricks import * import vars from base64 import b64encode class MyStack(TerraformStack): def __init__(self, scope: Construct, ns: str): super().__init__(scope, ns) DatabricksProvider( scope = self, id = "databricksAuth" ) current_user = DataDatabricksCurrentUser( scope = self, id_ = "currentUser" ) # Define the notebook. notebook = Notebook( scope = self, id_ = "notebook", path = f"{current_user.home}/CDKTF/{vars.resource_prefix}-notebook.py", language = "PYTHON", content_base64 = b64encode(b"display(spark.range(10))").decode("UTF-8") ) # Define the job to run the notebook. job = Job( scope = self, id_ = "job", name = f"{vars.resource_prefix}-job", new_cluster = JobNewCluster( num_workers = vars.num_workers, spark_version = vars.spark_version, node_type_id = vars.node_type_id ), notebook_task = JobNotebookTask( notebook_path = f"{current_user.home}/CDKTF/{vars.resource_prefix}-notebook.py" ), email_notifications = JobEmailNotifications( on_success = [ current_user.user_name ], on_failure = [ current_user.user_name ] ) ) # Output the notebook and job URLs. TerraformOutput( scope = self, id = "Notebook URL", value = notebook.url ) TerraformOutput( scope = self, id = "Job URL", value = job.url ) app = App() MyStack(app, "cdktf-python") app.synth()
import { Construct } from "constructs"; import { App, TerraformOutput, TerraformStack } from "cdktf"; import { DatabricksProvider, DataDatabricksCurrentUser, Notebook, Job } from "@cdktf/provider-databricks"; import * as vars from "./vars"; class MyStack extends TerraformStack { constructor(scope: Construct, name: string) { super(scope, name); new DatabricksProvider(this, "databricksAuth", {}) const currentUser = new DataDatabricksCurrentUser(this, "currentUser", {}); // Define the notebook. const notebook = new Notebook(this, "notebook", { path: `${currentUser.home}/CDKTF/${vars.resourcePrefix}-notebook.py`, language: "PYTHON", contentBase64: Buffer.from("display(spark.range(10))", "utf8").toString("base64") }); // Define the job to run the notebook. const job = new Job(this, "job", { name: `${vars.resourcePrefix}-job`, newCluster: { numWorkers: vars.numWorkers, sparkVersion: vars.sparkVersion, nodeTypeId: vars.nodeTypeId }, notebookTask: { notebookPath: `${currentUser.home}/CDKTF/${vars.resourcePrefix}-notebook.py` }, emailNotifications: { onSuccess: [ currentUser.userName ], onFailure: [ currentUser.userName ] } }); // Output the notebook and job URLs. new TerraformOutput(this, "Notebook URL", { value: notebook.url }); new TerraformOutput(this, "Job URL", { value: job.url }); } } const app = new App(); new MyStack(app, "cdktf-demo"); app.synth();
Create a file named
vars.py
(for Python) orvars.ts
(for TypeScript) in the same directory asmain.py
(for Python) ormain.ts
(for TypeScript). Replace the following values with your own values to specify a resource prefix and cluster settings such as the number of workers, Spark runtime version string, and node type.#!/usr/bin/env python resource_prefix = "cdktf-demo" num_workers = 1 spark_version = "10.4.x-scala2.12" node_type_id = "i3.xlarge"
export const resourcePrefix = "cdktf-demo" export const numWorkers = 1 export const sparkVersion = "10.4.x-scala2.12" export const nodeTypeId = "i3.xlarge"
Step 3: Deploy the resources
In this step, you use the CDKTF CLI to deploy, into your existing Databricks workspace, the defined notebook and the job to run that notebook.
Generate the Terraform code equivalent for your CDKTF project. To do this, run the
cdktf synth
command.cdktf synth
Before making changes, you can review the pending resource changes. Run the following:
cdktf diff
Deploy the notebook and job by running the
cdktf deploy
command.cdktf deploy
When prompted to Approve, press Enter. Terraform creates and deploys the notebook and job into your workspace.
Step 4: Interact with the resources
In this step, you run the job in your Databricks workspace, which runs the specified notebook.
To view the notebook that the job will run in your workspace, copy the Notebook URL link that appears in the output of the
cdk deploy
command and paste it into your web browser’s address bar.To view the job that runs the notebook in your workspace, copy the Job URL link that appears in the output of the
cdk deploy
command and paste it into your web browser’s address bar.To run the job, click the Run now button on the job page.
(Optional) Step 5: Make changes to a resource
In this optional step, you change the notebook’s code, redeploy the changed notebook, and then use the job to rerun the changed notebook.
If you do not want to make any changes to the notebook, skip ahead to Step 6: Clean up.
In the
main.py
file (for Python) or themain.ts
file (for TypeScript), change thenotebook
variable declaration from the following:notebook = Notebook( scope = self, id_ = "notebook", path = f"{current_user.home}/CDKTF/{vars.resource_prefix}-notebook.py", language = "PYTHON", content_base64 = b64encode(b"display(spark.range(10))").decode("UTF-8") )
const notebook = new Notebook(this, "notebook", { path: currentUser.home + "/CDKTF/" + vars.resourcePrefix + "-notebook.py", language: "PYTHON", contentBase64: Buffer.from("display(spark.range(10))", "utf8").toString("base64") });
To the following:
notebook = Notebook( scope = self, id_ = "notebook", path = f"{current_user.home}/CDKTF/{vars.resource_prefix}-notebook.py", language = "PYTHON", content_base64 = b64encode(b''' data = [ { "Category": 'A', "ID": 1, "Value": 121.44 }, { "Category": 'B', "ID": 2, "Value": 300.01 }, { "Category": 'C', "ID": 3, "Value": 10.99 }, { "Category": 'E', "ID": 4, "Value": 33.87} ] df = spark.createDataFrame(data) display(df) ''').decode("UTF-8") )
const notebook = new Notebook(this, "notebook", { path: currentUser.home + "/CDKTF/" + vars.resourcePrefix + "-notebook.py", language: "PYTHON", contentBase64: Buffer.from(` data = [ { "Category": 'A', "ID": 1, "Value": 121.44 }, { "Category": 'B', "ID": 2, "Value": 300.01 }, { "Category": 'C', "ID": 3, "Value": 10.99 }, { "Category": 'E', "ID": 4, "Value": 33.87} ] df = spark.createDataFrame(data) display(df)`, "utf8").toString("base64") });
Note
Make sure that the lines of code beginning and ending with backticks (`) are flush with the edge of your code editor. Otherwise, Terraform will insert whitespace into the notebook that may cause the new Python code to fail to run.
Regenerate the Terraform code equivalent for your CDKTF project. To do this, run the following:
cdktf synth
Before making changes, you can review the pending resource changes. Run the following:
cdktf diff
Deploy the notebook changes by running the
cdktf deploy
command.cdktf deploy
When prompted to Approve, press Enter. Terraform changes the notebook’s contents.
To view the changed notebook that the job will run in your workspace, refresh the notebook that you opened earlier, or copy the Notebook URL link that appears in the output of the
cdk deploy
command and paste it into your web browser’s address bar.To view the job that runs the changed notebook in your workspace, refresh the job that you opened earlier, or copy the Job URL link that appears in the output of the
cdk deploy
command and paste it into your web browser’s address bar.To run the job, click the Run now button on the job page.
Step 6: Clean up
In this step, you use the CDKTF CLI to remove the notebook and job from your Databricks workspace.
Remove the resources from your workspace by running the
cdktf destroy
command:cdktf destroy
When prompted to Approve, press Enter. Terraform removes the resources from your workspace.