Terraform CDK Databricks Provider
Note
This article covers the Cloud Development Kit for Terraform (CDKTF), which is neither provided nor supported by Databricks. To contact the provider, see the Terraform Community.
This article shows you how to use Python along with the Terraform CDK Databricks Provider and the Cloud Development Kit for Terraform (CDKTF). The CDKTF is a third-party, infrastructure as code (IaC) platform that enables you to create, deploy, and manage Databricks resources by using familiar programming languages, tools, and engineering practices. Although this article shows you how to use Python, the CDKTF supports additional languages such as TypeScript, Java, C#, and Go.
The Terraform CDK Databricks provider is based on the Databricks Terraform provider. For more information, see Terraform Cloud. The CDKTF is based on the AWS Cloud Development Kit (AWS CDK).
Requirements
You must have a Databricks workspace, as this article deploys resources into an existing workspace.
On your local development machine, you must have the following installed:
Terraform, version 1.1 or higher. To check whether you have Terraform installed, and to check the installed version, run the command
terraform -v
from your terminal or with PowerShell. Install Terraform, if you do not have it already installed.terraform -v
Node.js, version 16.13 or higher, and npm. To check whether you have Node.js and
npm
installed, and to check the installed versions, run the commandsnode -v
andnpm -v
. The latest versions of Node.js already includenpm
. Install Node.js and npm by using Node Version Manager (nvm), if you do not have Node.js andnpm
already installed.node -v npm -v
The CDKTF CLI. To check whether you have the CDKTF CLI installed, and to check the installed version, run the command
cdktf --version
. Install the CDKTF CLI by using npm, if you do not have it already installed.cdktf --version
Tip
You can also install the CDKTF CLI on macOS with Homebrew. See Install CDKTF.
Python version 3.7 or higher and pipenv version 2021.5.29 or higher. To check whether you have Python and
pipenv
installed, and to check the installed versions, run the commandspython --version
andpipenv --version
. Install Python and install pipenv, if they are not already installed.python --version pipenv --version
Databricks authentication configured for the supported authentication type that you want to use. See Authentication in the Databricks Terraform provider documentation.
Step 1: Create a CDKTF project
In this step, on your local development machine you set up the necessary directory structure for a CDKTF project. You then create your CDKTF project within this directory structure.
Create an empty directory for your CDKTF project, and then switch to it. Run the following commands in your terminal or with PowerShell:
mkdir cdktf-demo cd cdktf-demo
md cdktf-demo cd cdktf-demo
Create a CDKTF project by running the following command:
cdktf init --template=python --local
When prompted for a Project Name, accept the default project name of
cdktf-demo
by pressing Enter.When prompted for a Project Description, accept the default project description by pressing Enter.
If prompted Do you want to start from an existing Terraform project, enter
N
and press Enter.If prompted Do you want to send crash reports to the CDKTF team, enter
n
and press Enter.
The CDKTF creates the following files and subdirectories in your cdktf-demo
directory:
.gitignore
, which is a list of files and directories that Git ignores if you want to push this project into a remote Git repository.cdktf.json
, which contains configuration settings for your CDKTF project. See Configuration File for more information on configuration settings.help
, which contains information about some next steps you can take to work with your CDKTF project.main-test.py
, which contains supporting unit tests that you can write for your CDKTF project. See Unit Tests for more information on unit testing.main.py
, which contains the Python code that you write for your CDKTF project.Pipfile
andPipfile.lock
, which manage code dependencies for your CDKTF project.
Step 2: Define resources
In this step, you use the Terraform CDK Databricks provider to define a notebook and a job to run that notebook.
Install the project dependencies: using
pipenv
, install into your CDKTF project the Terraform CDK Databricks Provider to generate Databricks resources. To do this, run the following:pipenv install cdktf-cdktf-provider-databricks
Replace the contents of the
main.py
file with the following code. This code authenticates the CDKTF with your Databricks workspace, then generates a notebook along with a job to run the notebook. To view syntax documentation for this code, see the Terraform CDK Databricks provider construct reference for Python.#!/usr/bin/env python from constructs import Construct from cdktf import ( App, TerraformStack, TerraformOutput ) from cdktf_cdktf_provider_databricks import ( data_databricks_current_user, job, notebook, provider ) import vars from base64 import b64encode class MyStack(TerraformStack): def __init__(self, scope: Construct, ns: str): super().__init__(scope, ns) provider.DatabricksProvider( scope = self, id = "databricksAuth" ) current_user = data_databricks_current_user.DataDatabricksCurrentUser( scope = self, id_ = "currentUser" ) # Define the notebook. my_notebook = notebook.Notebook( scope = self, id_ = "notebook", path = f"{current_user.home}/CDKTF/{vars.resource_prefix}-notebook.py", language = "PYTHON", content_base64 = b64encode(b"display(spark.range(10))").decode("UTF-8") ) # Define the job to run the notebook. my_job = job.Job( scope = self, id_ = "job", name = f"{vars.resource_prefix}-job", task = [ job.JobTask( task_key = f"{vars.resource_prefix}-task", new_cluster = job.JobTaskNewCluster( num_workers = vars.num_workers, spark_version = vars.spark_version, node_type_id = vars.node_type_id ), notebook_task = job.JobTaskNotebookTask( notebook_path = f"{current_user.home}/CDKTF/{vars.resource_prefix}-notebook.py" ), email_notifications = job.JobTaskEmailNotifications( on_success = [ current_user.user_name ], on_failure = [ current_user.user_name ] ) ) ] ) # Output the notebook and job URLs. TerraformOutput( scope = self, id = "Notebook URL", value = my_notebook.url ) TerraformOutput( scope = self, id = "Job URL", value = my_job.url ) app = App() MyStack(app, "cdktf-demo") app.synth()
Create a file named
vars.py
in the same directory asmain.py
. Replace the following values with your own values to specify a resource prefix and cluster settings such as the number of workers, Spark runtime version string, and node type.#!/usr/bin/env python resource_prefix = "cdktf-demo" num_workers = 1 spark_version = "14.3.x-scala2.12" node_type_id = "i3.xlarge"
Step 3: Deploy the resources
In this step, you use the CDKTF CLI to deploy, into your existing Databricks workspace, the defined notebook and the job to run that notebook.
Generate the Terraform code equivalent for your CDKTF project. To do this, run the
cdktf synth
command.cdktf synth
Before making changes, you can review the pending resource changes. Run the following:
cdktf diff
Deploy the notebook and job by running the
cdktf deploy
command.cdktf deploy
When prompted to Approve, press Enter. Terraform creates and deploys the notebook and job into your workspace.
Step 4: Interact with the resources
In this step, you run the job in your Databricks workspace, which runs the specified notebook.
To view the notebook that the job will run in your workspace, copy the Notebook URL link that appears in the output of the
cdk deploy
command and paste it into your web browser’s address bar.To view the job that runs the notebook in your workspace, copy the Job URL link that appears in the output of the
cdk deploy
command and paste it into your web browser’s address bar.To run the job, click the Run now button on the job page.
(Optional) Step 5: Make changes to a resource
In this optional step, you change the notebook’s code, redeploy the changed notebook, and then use the job to rerun the changed notebook.
If you do not want to make any changes to the notebook, skip ahead to Step 6: Clean up.
In the
main.py
file, change thenotebook
variable declaration from the following:my_notebook = notebook.Notebook( scope = self, id_ = "notebook", path = f"{current_user.home}/CDKTF/{vars.resource_prefix}-notebook.py", language = "PYTHON", content_base64 = b64encode(b"display(spark.range(10))").decode("UTF-8") )
To the following:
my_notebook = notebook.Notebook( scope = self, id_ = "notebook", path = f"{current_user.home}/CDKTF/{vars.resource_prefix}-notebook.py", language = "PYTHON", content_base64 = b64encode(b''' data = [ { "Category": 'A', "ID": 1, "Value": 121.44 }, { "Category": 'B', "ID": 2, "Value": 300.01 }, { "Category": 'C', "ID": 3, "Value": 10.99 }, { "Category": 'E', "ID": 4, "Value": 33.87} ] df = spark.createDataFrame(data) display(df) ''').decode("UTF-8") )
Note
Make sure that the lines of code between with triple quotes (
'''
) are aligned with the edge of your code editor, as shown. Otherwise, Terraform will insert additional whitespace into the notebook that may cause the new Python code to fail to run.Regenerate the Terraform code equivalent for your CDKTF project. To do this, run the following:
cdktf synth
Before making changes, you can review the pending resource changes. Run the following:
cdktf diff
Deploy the notebook changes by running the
cdktf deploy
command.cdktf deploy
When prompted to Approve, press Enter. Terraform changes the notebook’s contents.
To view the changed notebook that the job will run in your workspace, refresh the notebook that you opened earlier, or copy the Notebook URL link that appears in the output of the
cdk deploy
command and paste it into your web browser’s address bar.To view the job that runs the changed notebook in your workspace, refresh the job that you opened earlier, or copy the Job URL link that appears in the output of the
cdk deploy
command and paste it into your web browser’s address bar.To run the job, click the Run now button on the job page.
Step 6: Clean up
In this step, you use the CDKTF CLI to remove the notebook and job from your Databricks workspace.
Remove the resources from your workspace by running the
cdktf destroy
command:cdktf destroy
When prompted to Approve, press Enter. Terraform removes the resources from your workspace.
Testing
You can test your CDKTF project before you deploy it. See Unit Tests in the CDKTF documentation.
For Python-based CDKTF projects, you can write and run tests by using the Python test framework pytest along with the cdktf
package’s Testing
class. The following example file named test_main.py
tests the CDKTF code in this article’s preceding main.py
file. The first test checks whether the project’s notebook will contain the expected Base64-encoded representation of the notebook’s content. The second test checks whether the project’s job will contain the expected job name. To run these tests, run the pytest
command from the project’s root directory.
from cdktf import App, Testing
from cdktf_cdktf_provider_databricks import job, notebook
from main import MyStack
class TestMain:
app = App()
stack = MyStack(app, "cdktf-demo")
synthesized = Testing.synth(stack)
def test_notebook_should_have_expected_base64_content(self):
assert Testing.to_have_resource_with_properties(
received = self.synthesized,
resource_type = notebook.Notebook.TF_RESOURCE_TYPE,
properties = {
"content_base64": "ZGlzcGxheShzcGFyay5yYW5nZSgxMCkp"
}
)
def test_job_should_have_expected_job_name(self):
assert Testing.to_have_resource_with_properties(
received = self.synthesized,
resource_type = job.Job.TF_RESOURCE_TYPE,
properties = {
"name": "cdktf-demo-job"
}
)
More resources
Terraform CDK Databricks provider construct reference for TypeScript, Python, Java, C#, and Go
Enable logging for CDKTF applications