dbx by Databricks Labs
Important
This documentation has been retired and might not be updated.
Databricks recommends that you use Databricks Asset Bundles instead of dbx
by Databricks Labs. See What are Databricks Asset Bundles? and Migrate from dbx to bundles.
Note
This article covers dbx
by Databricks Labs, which is provided as-is and is not supported by Databricks through customer technical support channels. Questions and feature requests can be communicated through the Issues page of the databrickslabs/dbx repo on GitHub.
dbx by Databricks Labs is an open source tool which is designed to extend the legacy Databricks command-line interface (Databricks CLI) and to provide functionality for rapid development lifecycle and continuous integration and continuous delivery/deployment (CI/CD) on the Databricks platform.
dbx
simplifies jobs launch and deployment processes across multiple environments. It also helps to package your project and deliver it to your Databricks environment in a versioned fashion. Designed in a CLI-first manner, it is built to be actively used both inside CI/CD pipelines and as a part of local tooling (such as local IDEs, including Visual Studio Code and PyCharm).
The typical development workflow with dbx
is:
Create a remote repository with a Git provider Databricks supports, if you do not have a remote repo available already.
Clone your remote repo into your Databricks workspace.
Create or move a Databricks notebook into the cloned repo in your Databricks workspace. Use this notebook to begin prototyping the code that you want your Databricks clusters to run.
To enhance and modularize your notebook code by adding separate helper classes and functions, configuration files, and tests, switch over to using a local development machine with
dbx
, your preferred IDE, and Git installed.Clone your remote repo to your local development machine.
Move your code out of your notebook into one or more local code files.
As you code locally, push your work from your local repo to your remote repo. Also, sync your remote repo with your Databricks workspace.
Tip
Alternatively, you can use dbx sync to automatically synchronize local file changes with corresponding files in your workspace, in real time.
Keep using the notebook in your Databricks workspace for rapid prototyping, and keep moving validated code from your notebook to your local machine. Keep using your local IDE for tasks such as code modularization, code completion, linting, unit testing, and step-through debugging of code and objects that do not require a live connection to Databricks.
Use
dbx
to batch run your local code on your target clusters, as desired. (This is similar to running the spark-submit script in Spark’sbin
directory to launch applications on a Spark cluster.)When you are ready for production, use a CI/CD platform such as GitHub Actions, Azure DevOps, or GitLab to automate running your remote repo’s code on your clusters.
Requirements
To use dbx
, you must have the following installed on your local development machine, regardless of whether your code uses Python, Scala, or Java:
Python version 3.8 or above.
If your code uses Python, you should use a version of Python that matches the one that is installed on your target clusters. To get the version of Python that is installed on an existing cluster, you can use the cluster’s web terminal to run the
python --version
command. See also the “System environment” section in the Databricks Runtime release notes versions and compatibility for the Databricks Runtime version for your target clusters.pip.
If your code uses Python, a method to create Python virtual environments to ensure you are using the correct versions of Python and package dependencies in your
dbx
projects. This article covers pipenv.dbx version 0.8.0 or above. You can install this package from the Python Package Index (PyPI) by running
pip install dbx
.To confirm that
dbx
is installed, run the following command:dbx --version
If the version number is returned,
dbx
is installed.If the version number is below 0.8.0, upgrade
dbx
by running the following command, and then check the version number again:pip install dbx --upgrade dbx --version # Or ... python -m pip install dbx --upgrade dbx --version
The Databricks CLI version 0.18 or below, set up with authentication. The legacy Databricks CLI (Databricks CLI version 0.17) is automatically installed when you install
dbx
. This authentication can be set up on your local development machine in one or both of the following locations:Within the
DATABRICKS_HOST
andDATABRICKS_TOKEN
environment variables (starting with legacy Databricks CLI version 0.8.0).In a Databricks configuration profile within your
.databrickscfg
file.
dbx
looks for authentication credentials in these two locations, respectively.dbx
uses only the first set of matching credentials that it finds.Note
dbx
does not support the use of a .netrc file for authentication, beginning with legacy Databricks CLI version 0.17.2. To check your installed legacy Databricks CLI version, run the commanddatabricks --version
.git for pushing and syncing local and remote code changes.
Continue with the instructions for one of the following IDEs:
Note
Databricks has validated usage of the preceding IDEs with dbx
; however, dbx
should work with any IDE. You can also use No IDE (terminal only).
dbx
is optimized to work with single-file Python code files and compiled Scala and Java JAR files. dbx
does not work with single-file R code files or compiled R code packages. This is because dbx
works with the Jobs API 2.0 and 2.1, and these APIs cannot run single-file R code files or compiled R code packages as jobs.
Visual Studio Code
Complete the following instructions to begin using Visual Studio Code and Python with dbx
.
On your local development machine, you must have the following installed in addition to the general requirements:
The Python extension for Visual Studio Code. For more information, see Extension Marketplace on the Visual Studio Code website.
For more information, see Getting Started with Python in VS Code in the Visual Studio Code documentation.
Follow these steps to begin setting up your dbx
project structure:
From your terminal, create a blank folder. These instructions use a folder named
dbx-demo
. You can give yourdbx
project’s root folder any name you want. If you use a different name, replace the name throughout these steps. After you create the folder, switch to it, and then start Visual Studio Code from that folder.For Linux and macOS:
mkdir dbx-demo cd dbx-demo code .
Tip
If
command not found: code
displays after you runcode .
, see Launching from the command line on the Microsoft website.For Windows:
md dbx-demo cd dbx-demo code .
In Visual Studio Code, create a Python virtual environment for this project:
On the menu bar, click View > Terminal.
From the root of the
dbx-demo
folder, run thepipenv
command with the following option, where<version>
is the target version of Python that you already have installed locally (and, ideally, a version that matches your target clusters’ version of Python), for example3.8.14
.pipenv --python <version>
Make a note of the
Virtualenv location
value in the output of thepipenv
command, as you will need it in the next step.
Select the target Python interpreter, and then activate the Python virtual environment:
On the menu bar, click View > Command Palette, type
Python: Select
, and then click Python: Select Interpreter.Select the Python interpreter within the path to the Python virtual environment that you just created. (This path is listed as the
Virtualenv location
value in the output of thepipenv
command.)On the menu bar, click View > Command Palette, type
Terminal: Create
, and then click Terminal: Create New Terminal.
For more information, see Using Python environments in VS Code in the Visual Studio Code documentation.
Continue with Create a dbx project.
PyCharm
Complete the following instructions to begin using PyCharm and Python with dbx
.
On your local development machine, you must have PyCharm installed in addition to the general requirements.
Follow these steps to begin setting up your dbx
project structure:
In PyCharm, on the menu bar, click File > New Project.
In the Create Project dialog, choose a location for your new project.
Expand Python interpreter: New Pipenv environment.
Select New environment using, if it is not already selected, and then select Pipenv from the drop-down list.
For Base interpreter, select the location that contains the Python interpreter for the target version of Python that you already have installed locally (and, ideally, a version that matches your target clusters’ version of Python).
For Pipenv executable, select the location that contains your local installation of
pipenv
, if it is not already auto-detected.If you want to create a minimal
dbx
project, and you want to use themain.py
file with that minimaldbx
project, then select the Create a main.py welcome script box. Otherwise, clear this box.Click Create.
In the Project tool window, right-click the project’s root folder, and then click Open in > Terminal.
Continue with Create a dbx project.
IntelliJ IDEA
Complete the following instructions to begin using IntelliJ IDEA and Scala with dbx
. These instructions create a minimal sbt-based Scala project that you can use to start a dbx
project.
On your local development machine, you must have the following installed in addition to the general requirements:
The Scala plugin for IntelliJ IDEA. For more information, see Discover IntelliJ IDEA for Scala in the IntelliJ IDEA documentation.
Java Runtime Environment (JRE) 8. While any edition of JRE 8 should work, Databricks has so far only validated usage of
dbx
and IntelliJ IDEA with the OpenJDK 8 JRE. Databricks has not yet validated usage ofdbx
with IntelliJ IDEA and Java 11. For more information, see Java Development Kit (JDK) in the IntelliJ IDEA documentation.
Follow these steps to begin setting up your dbx
project structure:
Step 1: Create an sbt-based Scala project
In IntelliJ IDEA, depending on your view, click Projects > New Project or File > New > Project.
In the New Project dialog, click Scala, click sbt, and then click Next.
Enter a project name and a location for the project.
For JDK, select your installation of the OpenJDK 8 JRE.
For sbt, choose the highest available version of
sbt
that is listed.For Scala, ideally, choose the version of Scala that matches your target clusters’ version of Scala. See the “System environment” section in the Databricks Runtime release notes versions and compatibility for the Databricks Runtime version for your target clusters.
Next to Scala, select the Sources box if it is not already selected.
Add a package prefix to Package Prefix. These steps use the package prefix
com.example.demo
. If you specify a different package prefix, replace the package prefix throughout these steps.Click Finish.
Step 2: Add an object to the package
You can add any required objects to your package. This package contains a single object named SampleApp
.
In the Project tool window (View > Tool Windows > Project), right-click the project-name > src > main > scala folder, and then click New > Scala Class.
Choose Object, and type the object’s name and then press Enter. For example, type
SampleApp
. If you enter a different object name here, be sure to replace the name throughout these steps.Replace the contents of the
SampleApp.scala
file with the following code:package com.example.demo object SampleApp { def main(args: Array[String]) { } }
Step 3: Build the project
Add any required project build settings and dependencies to your project. This step assumes that you are building a project that was set up in the previous steps and it depends on only the following libraries.
Replace the contents of the project’s
build.sbt
file with the following content:ThisBuild / version := "0.1.0-SNAPSHOT" ThisBuild / scalaVersion := "2.12.14" val sparkVersion = "3.2.1" lazy val root = (project in file(".")) .settings( name := "dbx-demo", idePackagePrefix := Some("com.example.demo"), libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion withSources(), libraryDependencies += "org.apache.spark" %% "spark-sql" % sparkVersion withSources(), libraryDependencies += "org.apache.spark" %% "spark-hive" % sparkVersion withSources() )
In the preceding file, replace:
2.12.14
with the version of Scala that you chose earlier for this project.3.2.1
with the version of Spark that you chose earlier for this project.dbx-demo
with the name of your project.com.example.demo
with the name of your package prefix.
On the menu bar, click View > Tool Windows > sbt.
In the sbt tool window, right-click the name of your project, and click Reload sbt Project. Wait until
sbt
finishes downloading the project’s dependencies from an Internet artifact store such as Coursier or Ivy by default, depending on your version ofsbt
. You can watch the download progress in the status bar. If you add or change any more dependencies to this project, you must repeat this project reloading step for each set of dependencies that you add or change.On the menu bar, click IntelliJ IDEA > Preferences.
In the Preferences dialog, click Build, Execution, Deployment > Build Tools > sbt.
In JVM, for JRE, select your installation of the OpenJDK 8 JRE.
In sbt projects, select the name of your project.
In sbt shell, select builds.
Click OK.
On the menu bar, click Build > Build Project. The build’s results appear in the sbt shell tool window (View > Tool Windows > sbt shell).
Step 4: Add code to the project
Add any required code to your project. This step assumes that you only want to add code to the SampleApp.scala
file in the example
package.
In the project’s src
> main
> scala
> SampleApp.scala
file, add the code that you want dbx
to batch run on your target clusters. For basic testing, use the example Scala code in the section Code example.
Step 5: Run the project
On the menu bar, click Run > Edit Configurations.
In the Run/Debug Configurations dialog, click the + (Add New Configuration) icon, or Add new, or Add new run configuration.
In the drop-down, click sbt Task.
For Name, enter a name for the configuration, for example, Run the program.
For Tasks, enter
~run
.Select Use sbt shell.
Click OK.
On the menu bar, click Run > Run ‘Run the program’. The run’s results appear in the sbt shell tool window.
Step 6: Build the project as a JAR
You can add any JAR build settings to your project that you want. This step assumes that you only want to build a JAR that is based on the project that was set up in the previous steps.
On the menu bar, click File > Project Structure.
In the Project Structure dialog, click Project Settings > Artifacts.
Click the + (Add) icon.
In the drop-down list, select JAR > From modules with dependencies.
In the Create JAR from Modules dialog, for Module, select the name of your project.
For Main Class, click the folder icon.
In the Select Main Class dialog, on the Search by Name tab, select SampleApp, and then click OK.
For JAR files from libraries, select copy to the output directory and link via manifest.
Click OK to close the Create JAR from Modules dialog.
Click OK to close the Project Structure dialog.
On the menu bar, click Build > Build Artifacts.
In the context menu that appears, select project-name:jar > Build. Wait while
sbt
builds your JAR. The build’s results appear in the Build Output tool window (View > Tool Windows > Build).
The JAR is built to the project’s out
> artifacts
> <project-name>_jar
folder. The JAR’s name is <project-name>.jar
.
Step 7: Display the terminal in the IDE
With your dbx
project structure now in place, you are ready to create your dbx
project.
Display the IntelliJ IDEA terminal by clicking View > Tool Windows > Terminal on the menu bar, and then continue with Create a dbx project.
Eclipse
Complete the following instructions to begin using Eclipse and Java with dbx
. These instructions create a minimal Maven-based Java project that you can use to start a dbx
project.
On your local development machine, you must have the following installed in addition to the general requirements:
A version of Eclipse. These instructions use the Eclipse IDE for Java Developers edition of the Eclipse IDE.
An edition of the Java Runtime Environment (JRE) or Java Development Kit (JDK) 11, depending on your local machine’s operating system. While any edition of JRE or JDK 11 should work, Databricks has so far only validated usage of
dbx
and the Eclipse IDE for Java Developers with Eclipse 2022-03 R, which includes AdoptOpenJDK 11.
Follow these steps to begin setting up your dbx
project structure:
Step 1: Create a Maven-based Java project
In Eclipse, click File > New > Project.
In the New Project dialog, expand Maven, select Maven Project, and click Next.
In the New Maven Project dialog, select Create a simple project (skip archetype selection), and click Next.
For Group Id, enter a group ID that conforms to Java’s package name rules. These steps use the package name of
com.example.demo
. If you enter a different group ID, substitute it throughout these steps.For Artifact Id, enter a name for the JAR file without the version number. These steps use the JAR name of
dbx-demo
. If you enter a different name for the JAR file, substitute it throughout these steps.Click Finish.
Step 2: Add a class to the package
You can add any classes to your package that you want. This package will contain a single class named SampleApp
.
In the Project Explorer view (Window > Show View > Project Explorer), select the project-name project icon, and then click File > New > Class.
In the New Java Class dialog, for Package, enter
com.example.demo
.For Name, enter
SampleApp
.For Modifiers, select public.
Leave Superclass blank.
For Which method stubs would you like to create, select public static void Main(String[] args).
Click Finish.
Step 3: Add dependencies to the project
In the Project Explorer view, double-click project-name > pom.xml.
Add the following dependencies as a child element of the
<project>
element, and then save the file:<dependencies> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.12</artifactId> <version>3.2.1</version> <scope>provided</scope> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.12</artifactId> <version>3.2.1</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-hive_2.12</artifactId> <version>3.2.1</version> <scope>provided</scope> </dependency> </dependencies>
Replace:
2.12
with your target clusters’ version of Scala.3.2.1
with your target clusters’ version of Spark.
See the “System environment” section in the Databricks Runtime release notes versions and compatibility for the Databricks Runtime version for your target clusters.
Step 4: Compile the project
In the project’s
pom.xml
file, add the following Maven compiler properties as a child element of the<project>
element, and then save the file:<properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <maven.compiler.source>1.6</maven.compiler.source> <maven.compiler.target>1.6</maven.compiler.target> </properties>
In the Project Explorer view, right-click the project-name project icon, and then click Run As > Run Configurations.
In the Run Configurations dialog, click Maven Build.
Click the New launch configuration icon.
Enter a name for this launch configuration, for example clean compile.
For Base directory, click Workspace, choose your project’s directory, and click OK.
For Goals, enter
clean compile
.Click Run. The run’s output appears in the Console view (Window > Show View > Console).
Step 5: Add code to the project
You can add any code to your project that you want. This step assumes that you only want to add code to a file named SampleApp.java
for a package named com.example.demo
.
In the project’s src/main/java
> com.example.demo
> SampleApp.java
file, add the code that you want dbx
to batch run on your target clusters. (If you do not have any code handy, you can use the Java code in the Code example, listed toward the end of this article.)
Step 6: Run the project
In the Project Explorer view, right-click the project-name project icon, and then click Run As > Run Configurations.
In the Run Configurations dialog, expand Java Application, and then click App.
Click Run. The run’s output appears in the Console view.
Step 7: Build the project as a JAR
In the Project Explorer view, right-click the project-name project icon, and then click Run As > Run Configurations.
In the Run Configurations dialog, click Maven Build.
Click the New launch configuration icon.
Enter a name for this launch configuration, for example clean package.
For Base directory, click Workspace, choose your project’s directory, and click OK.
For Goals, enter
clean package
.Click Run. The run’s output appears in the Console view.
The JAR is built to the <project-name>
> target
folder. The JAR’s name is <project-name>-0.0.1-SNAPSHOT.jar
.
Note
If the JAR does not appear in the target
folder in the Project Explorer window at first, you can try to display it by right-clicking the project-name project icon, and then click Refresh.
Step 8: Display the terminal in the IDE
With your dbx
project structure now in place, you are ready to create your dbx
project. To start, set Project Explorer view to show the hidden files (files starting with a dot (./
)) the dbx
generates, as follows:
In the Project Explorer view, click the ellipses (View Menu) filter icon, and then click Filters and Customization.
In the Filters and Customization dialog, on the Pre-set filters tab, clear the .* resources box.
Click OK.
Next, display the Eclipse terminal as follows:
Click Window > Show View > Terminal on the menu bar.
In the terminal’s command prompt does not appear, in the Terminal view, click the Open a Terminal icon.
Use the
cd
command to switch to your project’s root directory.Continue with Create a dbx project.
No IDE (terminal only)
Complete the following instructions to begin using a terminal and Python with dbx
.
Follow these steps to use a terminal to begin setting up your dbx
project structure:
From your terminal, create a blank folder. These instructions use a folder named
dbx-demo
(but you can give yourdbx
project’s root folder any name you want). After you create the folder, switch to it.For Linux and macOS:
mkdir dbx-demo cd dbx-demo
For Windows:
md dbx-demo cd dbx-demo
Create a Python virtual environment for this project by running the
pipenv
command, with the following option, from the root of thedbx-demo
folder, where<version>
is the target version of Python that you already have installed locally, for example3.8.14
.pipenv --python <version>
Activate your Python virtual environment by running
pipenv shell
.pipenv shell
Continue with Create a dbx project.
Create a dbx project
With your dbx
project structure in place from one of the previous sections, you are now ready to create one of the following types of projects:
Create a minimal dbx project for Python
The following minimal dbx
project is the simplest and fastest approach to getting started with Python and dbx
. It demonstrates batch running of a single Python code file on an existing Databricks all-purpose cluster in your Databricks workspace.
Note
To create a dbx
templated project for Python that demonstrates batch running of code on all-purpose clusters and jobs clusters, remote code artifact deployments, and CI/CD platform setup, skip ahead to Create a dbx templated project for Python with CI/CD support.
To complete this procedure, you must have an existing all-purpose cluster in your workspace. (See View compute or Compute configuration reference.) Ideally (but not required), the version of Python in your Python virtual environment should match the version that is installed on this cluster. To identify the version of Python on the cluster, use the cluster’s web terminal to run the command python --version
.
python --version
From your terminal, from your
dbx
project’s root folder, run thedbx configure
command with the following option. This command creates a hidden.dbx
folder within yourdbx
project’s root folder. This.dbx
folder containslock.json
andproject.json
files.dbx configure --profile DEFAULT --environment default
Note
The
project.json
file defines an environment nameddefault
along with a reference to theDEFAULT
profile within your.databrickscfg
file. If you wantdbx
to use a different profile, replace--profile DEFAULT
with--profile
followed by your target profile’s name, in thedbx configure
command.For example, if you have a profile named
DEV
within your.databrickscfg
file and you wantdbx
to use it instead of theDEFAULT
profile, yourproject.json
file might look like this instead, in which you case you would also replace--environment default
with--environment dev
in thedbx configure
command:{ "environments": { "default": { "profile": "DEFAULT", "storage_type": "mlflow", "properties": { "workspace_directory": "/Workspace/Shared/dbx/projects/<current-folder-name>", "artifact_location": "dbfs:/dbx/<current-folder-name>" } }, "dev": { "profile": "DEV", "storage_type": "mlflow", "properties": { "workspace_directory": "/Workspace/Shared/dbx/projects/<some-other-folder-name>", "artifact_location": "dbfs:/dbx/<some-other-folder-name>" } } } }
If you want
dbx
to use theDATABRICKS_HOST
andDATABRICKS_TOKEN
environment variables instead of a profile in your.databrickscfg
file, then leave out the--profile
option altogether from thedbx configure
command.Create a folder named
conf
within yourdbx
project’s root folder.For Linux and macOS:
mkdir conf
For Windows:
md conf
Add a file named
deployment.yaml
file to theconf
directory, with the following file contents:build: no_build: true environments: default: workflows: - name: "dbx-demo-job" spark_python_task: python_file: "file://dbx-demo-job.py"
Note
The
deployment.yaml
file contains the lower-cased worddefault
, which is a reference to the upper-casedDEFAULT
profile within your.databrickscfg
file. If you wantdbx
to use a different profile, replacedefault
with your target profile’s name.For example, if you have a profile named
DEV
within your.databrickscfg
file and you wantdbx
to use it instead of theDEFAULT
profile, yourdeployment.yaml
file might look like this instead:environments: default: workflows: - name: "dbx-demo-job" spark_python_task: python_file: "file://dbx-demo-job.py" dev: workflows: - name: "<some-other-job-name>" spark_python_task: python_file: "file://<some-other-filename>.py"
If you want
dbx
to use theDATABRICKS_HOST
andDATABRICKS_TOKEN
environment variables instead of a profile in your.databrickscfg
file, then leavedefault
in thedeployment.yaml
as is.dbx
will use this reference by default.Tip
To add Spark configuration key-value pairs to a job, use the
spark_conf
field, for example:environments: default: workflows: - name: "dbx-demo-job" spark_conf: spark.speculation: true spark.streaming.ui.retainedBatches: 5 spark.driver.extraJavaOptions: "-verbose:gc -XX:+PrintGCDetails" # ...
To add permissions to a job, use the
access_control_list
field, for example:environments: default: workflows: - name: "dbx-demo-job" access_control_list: - user_name: "someone@example.com" permission_level: "IS_OWNER" - group_name: "some-group" permission_level: "CAN_VIEW" # ...
Note that the
access_control_list
field must be exhaustive, so the job’s owner should be added to the list as well as adding other user and group permissions.Add the code to run on the cluster to a file named
dbx-demo-job.py
and add the file to the root folder of yourdbx
project. (If you do not have any code handy, you can use the Python code in the Code example, listed toward the end of this article.)Note
You do not have to name this file
dbx-demo-job.py
. If you choose a different file name, be sure to update thepython_file
field in theconf/deployment.yaml
file to match.Run the command
dbx execute
command with the following options. In this command, replace<existing-cluster-id>
with the ID of the target cluster in your workspace. (To get the ID, see Cluster URL and ID.)dbx execute --cluster-id=<existing-cluster-id> dbx-demo-job --no-package
To view the run’s results locally, see your terminal’s output. To view the run’s results on your cluster, go to the Standard output pane in the Driver logs tab for your cluster. (See Compute driver and worker logs.)
Continue with Next steps.
Create a minimal dbx project for Scala or Java
The following minimal dbx
project is the simplest and fastest approach to getting started with dbx
and Scala or Java. It demonstrates deploying a single Scala or Java JAR to your Databricks workspace and then running that deployed JAR on a Databricks jobs cluster in your Databricks workspace.
Note
Databricks limits how you can run Scala and Java code on clusters:
You cannot run a single Scala or Java file as a job on a cluster as you can with a single Python file. To run Scala or Java code, you must first build it into a JAR.
You can run a JAR as a job on an existing all-purpose cluster. However, you cannot reinstall any updates to that JAR on the same all-purpose cluster. In this case, you must use a job cluster instead. This section uses the job cluster approach.
You must first deploy the JAR to your Databricks workspace before you can run that deployed JAR on any all-purpose cluster or jobs cluster in that workspace.
In your terminal, from your project’s root folder, run the
dbx configure
command with the following option. This command creates a hidden.dbx
folder within your project’s root folder. This.dbx
folder containslock.json
andproject.json
files.dbx configure --profile DEFAULT --environment default
Note
The
project.json
file defines an environment nameddefault
along with a reference to theDEFAULT
profile within your.databrickscfg
file. If you wantdbx
to use a different profile, replace--profile DEFAULT
with--profile
followed by your target profile’s name, in thedbx configure
command.For example, if you have a profile named
DEV
within your.databrickscfg
file and you wantdbx
to use it instead of theDEFAULT
profile, yourproject.json
file might look like this instead, in which you case you would also replace--environment default
with--environment dev
in thedbx configure
command:{ "environments": { "default": { "profile": "DEFAULT", "storage_type": "mlflow", "properties": { "workspace_directory": "/Workspace/Shared/dbx/projects/<current-folder-name>", "artifact_location": "dbfs:/dbx/<current-folder-name>" } }, "dev": { "profile": "DEV", "storage_type": "mlflow", "properties": { "workspace_directory": "/Workspace/Shared/dbx/projects/<some-other-folder-name>", "artifact_location": "dbfs:/dbx/<some-other-folder-name>" } } } }
If you want
dbx
to use theDATABRICKS_HOST
andDATABRICKS_TOKEN
environment variables instead of a profile in your.databrickscfg
file, then leave out the--profile
option altogether from thedbx configure
command.Create a folder named
conf
within your project’s root folder.For Linux and macOS:
mkdir conf
For Windows:
md conf
Add a file named
deployment.yaml
file to theconf
directory, with the following minimal file contents:build: no_build: true environments: default: workflows: - name: "dbx-demo-job" new_cluster: spark_version: "10.4.x-scala2.12" node_type_id: "i3.xlarge" aws_attributes: first_on_demand: 1 availability: "SPOT" num_workers: 2 instance_pool_id: "my-instance-pool" libraries: - jar: "file://out/artifacts/dbx_demo_jar/dbx-demo.jar" spark_jar_task: main_class_name: "com.example.demo.SampleApp"
Replace:
The value of
spark_version
with the appropriate runtime version for your target jobs cluster.The value of
node_type_id
with the appropriate Worker and driver node types for your target jobs cluster.The value of
instance_pool_id
with the ID of an existing instance pool in your workspace, to enable faster running of jobs. If you do not have an existing instance pool available or you do not want to use an instance pool, remove this line altogether.The value of
jar
with the path in the project to the JAR. For IntelliJ IDEA with Scala, it could befile://out/artifacts/dbx_demo_jar/dbx-demo.jar
. For the Eclipse IDE with Java, it could befile://target/dbx-demo-0.0.1-SNAPSHOT.jar
.The value of
main_class_name
with the name of main class in the JAR, for examplecom.example.demo.SampleApp
.
Note
The
deployment.yaml
file contains the worddefault
, which is a reference to thedefault
environment in the.dbx/project.json
file, which in turn is a reference to theDEFAULT
profile within your.databrickscfg
file. If you wantdbx
to use a different profile, replacedefault
in thisdeployment.yaml
file with the corresponding reference in the.dbx/project.json
file, which in turn references the corresponding profile within your.databrickscfg
file.For example, if you have a profile named
DEV
within your.databrickscfg
file and you wantdbx
to use it instead of theDEFAULT
profile, yourdeployment.yaml
file might look like this instead:environments: default: workflows: - name: "dbx-demo-job" # ... dev: workflows: - name: "<some-other-job-name>" # ...
If you want
dbx
to use theDATABRICKS_HOST
andDATABRICKS_TOKEN
environment variables instead of a profile in your.databrickscfg
file, then leavedefault
in thedeployment.yaml
as is.dbx
will use thedefault
environment settings (except for theprofile
value) in the.dbx/project.json
file by default.Tip
To add Spark configuration key-value pairs to a job, use the
spark_conf
field, for example:environments: default: workflows: - name: "dbx-demo-job" spark_conf: spark.speculation: true spark.streaming.ui.retainedBatches: 5 spark.driver.extraJavaOptions: "-verbose:gc -XX:+PrintGCDetails" # ...
To add permissions to a job, use the
access_control_list
field, for example:environments: default: workflows: - name: "dbx-demo-job" access_control_list: - user_name: "someone@example.com" permission_level: "IS_OWNER" - group_name: "some-group" permission_level: "CAN_VIEW" # ...
Note that the
access_control_list
field must be exhaustive, so the job’s owner should be added to the list as well as adding other user and group permissions.Run the
dbx deploy
command.dbx
deploys the JAR to the location in the.dbx/project.json
file’sartifact_location
path for the matching environment.dbx
also deploys the project’s files as part of an MLflow experiment, to the location listed in the.dbx/project.json
file’sworkspace_directory
path for the matching environment.dbx deploy --no-package
Run the
dbx launch
command with the following options. This command runs the job with the matching name inconf/deployment.yaml
. To find the deployed JAR to run as part of the job,dbx
references the location in the.dbx/project.json
file’sartifact_location
path for the matching environment. To determine which specific JAR to run,dbx
references the MLflow experiment in the location listed in the.dbx/project.json
file’sworkspace_directory
path for the matching environment.dbx launch dbx-demo-job
To view the job run’s results on your jobs cluster, see View jobs.
To view the experiment that the job referenced, see Organize training runs with MLflow experiments.
Continue with Next steps.
Create a dbx templated project for Python with CI/CD support
The following dbx
templated project for Python demonstrates support for batch running of Python code on Databricks all-purpose clusters and jobs clusters in your Databricks workspaces, remote code artifact deployments, and CI/CD platform setup. (To create a minimal dbx
project for Python that only demonstrates batch running of a single Python code file on an existing all-purpose cluster, skip back to Create a minimal dbx project for Python.)
From your terminal, in your
dbx
project’s root folder, run thedbx init
command.dbx init
For project_name, enter a name for your project, or press Enter to accept the default project name.
For version, enter a starting version number for your project, or press Enter to accept the default project version.
For cloud, select the number that corresponds to the Databricks cloud version that you want your project to use, or press Enter to accept the default.
For cicd_tool, select the number that corresponds to the supported CI/CD tool that you want your project to use, or press Enter to accept the default.
For project_slug, enter a prefix that you want to use for resources in your project, or press Enter to accept the default.
For workspace_directory, enter the local path to the workspace directory for your project, or press Enter to accept the default.
For artifact_location, enter the path in your Databricks workspace to where your project’s artifacts will be written, or press Enter to accept the default.
For profile, enter the name of the CLI authentication profile that you want your project to use, or press Enter to accept the default.
Tip
You can skip the preceding steps by running dbx init
with hard-coded template parameters, for example:
dbx init --template="python_basic" \
-p "project_name=cicd-sample-project" \
-p "cloud=AWS" \
-p "cicd_tool=GitHub Actions" \
-p "profile=DEFAULT" \
--no-input
dbx
calculates the parameters project_slug
, workspace_directory
, and artifact_location
automatically. These three parameters are optional, and they are useful only for more advanced use cases.
See the init
command in CLI Reference in the dbx
documentation.
See also Next steps.
Code example
If you do not have any code readily available to batch run with dbx
, you can experiment by having dbx
batch run the following code. This code creates a small table in your workspace, queries the table, and then deletes the table.
Tip
If you want to leave the table in your workspace instead of deleting it, comment out the last line of code in this example before you batch run it with dbx
.
# For testing and debugging of local objects, run
# "pip install pyspark=X.Y.Z", where "X.Y.Z"
# matches the version of PySpark
# on your target clusters.
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from datetime import date
spark = SparkSession.builder.appName("dbx-demo").getOrCreate()
# Create a DataFrame consisting of high and low temperatures
# by airport code and date.
schema = StructType([
StructField('AirportCode', StringType(), False),
StructField('Date', DateType(), False),
StructField('TempHighF', IntegerType(), False),
StructField('TempLowF', IntegerType(), False)
])
data = [
[ 'BLI', date(2021, 4, 3), 52, 43],
[ 'BLI', date(2021, 4, 2), 50, 38],
[ 'BLI', date(2021, 4, 1), 52, 41],
[ 'PDX', date(2021, 4, 3), 64, 45],
[ 'PDX', date(2021, 4, 2), 61, 41],
[ 'PDX', date(2021, 4, 1), 66, 39],
[ 'SEA', date(2021, 4, 3), 57, 43],
[ 'SEA', date(2021, 4, 2), 54, 39],
[ 'SEA', date(2021, 4, 1), 56, 41]
]
temps = spark.createDataFrame(data, schema)
# Create a table on the cluster and then fill
# the table with the DataFrame's contents.
# If the table already exists from a previous run,
# delete it first.
spark.sql('USE default')
spark.sql('DROP TABLE IF EXISTS demo_temps_table')
temps.write.saveAsTable('demo_temps_table')
# Query the table on the cluster, returning rows
# where the airport code is not BLI and the date is later
# than 2021-04-01. Group the results and order by high
# temperature in descending order.
df_temps = spark.sql("SELECT * FROM demo_temps_table " \
"WHERE AirportCode != 'BLI' AND Date > '2021-04-01' " \
"GROUP BY AirportCode, Date, TempHighF, TempLowF " \
"ORDER BY TempHighF DESC")
df_temps.show()
# Results:
#
# +-----------+----------+---------+--------+
# |AirportCode| Date|TempHighF|TempLowF|
# +-----------+----------+---------+--------+
# | PDX|2021-04-03| 64| 45|
# | PDX|2021-04-02| 61| 41|
# | SEA|2021-04-03| 57| 43|
# | SEA|2021-04-02| 54| 39|
# +-----------+----------+---------+--------+
# Clean up by deleting the table from the cluster.
spark.sql('DROP TABLE demo_temps_table')
package com.example.demo
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import java.sql.Date
object SampleApp {
def main(args: Array[String]) {
val spark = SparkSession.builder().master("local").getOrCreate()
val schema = StructType(Array(
StructField("AirportCode", StringType, false),
StructField("Date", DateType, false),
StructField("TempHighF", IntegerType, false),
StructField("TempLowF", IntegerType, false)
))
val data = List(
Row("BLI", Date.valueOf("2021-04-03"), 52, 43),
Row("BLI", Date.valueOf("2021-04-02"), 50, 38),
Row("BLI", Date.valueOf("2021-04-01"), 52, 41),
Row("PDX", Date.valueOf("2021-04-03"), 64, 45),
Row("PDX", Date.valueOf("2021-04-02"), 61, 41),
Row("PDX", Date.valueOf("2021-04-01"), 66, 39),
Row("SEA", Date.valueOf("2021-04-03"), 57, 43),
Row("SEA", Date.valueOf("2021-04-02"), 54, 39),
Row("SEA", Date.valueOf("2021-04-01"), 56, 41)
)
val rdd = spark.sparkContext.makeRDD(data)
val temps = spark.createDataFrame(rdd, schema)
// Create a table on the Databricks cluster and then fill
// the table with the DataFrame's contents.
// If the table already exists from a previous run,
// delete it first.
spark.sql("USE default")
spark.sql("DROP TABLE IF EXISTS demo_temps_table")
temps.write.saveAsTable("demo_temps_table")
// Query the table on the Databricks cluster, returning rows
// where the airport code is not BLI and the date is later
// than 2021-04-01. Group the results and order by high
// temperature in descending order.
val df_temps = spark.sql("SELECT * FROM demo_temps_table " +
"WHERE AirportCode != 'BLI' AND Date > '2021-04-01' " +
"GROUP BY AirportCode, Date, TempHighF, TempLowF " +
"ORDER BY TempHighF DESC")
df_temps.show()
// Results:
//
// +-----------+----------+---------+--------+
// |AirportCode| Date|TempHighF|TempLowF|
// +-----------+----------+---------+--------+
// | PDX|2021-04-03| 64| 45|
// | PDX|2021-04-02| 61| 41|
// | SEA|2021-04-03| 57| 43|
// | SEA|2021-04-02| 54| 39|
// +-----------+----------+---------+--------+
// Clean up by deleting the table from the Databricks cluster.
spark.sql("DROP TABLE demo_temps_table")
}
}
package com.example.demo;
import java.util.ArrayList;
import java.util.List;
import java.sql.Date;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.*;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.Dataset;
public class SampleApp {
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("Temps Demo")
.config("spark.master", "local")
.getOrCreate();
// Create a Spark DataFrame consisting of high and low temperatures
// by airport code and date.
StructType schema = new StructType(new StructField[] {
new StructField("AirportCode", DataTypes.StringType, false, Metadata.empty()),
new StructField("Date", DataTypes.DateType, false, Metadata.empty()),
new StructField("TempHighF", DataTypes.IntegerType, false, Metadata.empty()),
new StructField("TempLowF", DataTypes.IntegerType, false, Metadata.empty()),
});
List<Row> dataList = new ArrayList<Row>();
dataList.add(RowFactory.create("BLI", Date.valueOf("2021-04-03"), 52, 43));
dataList.add(RowFactory.create("BLI", Date.valueOf("2021-04-02"), 50, 38));
dataList.add(RowFactory.create("BLI", Date.valueOf("2021-04-01"), 52, 41));
dataList.add(RowFactory.create("PDX", Date.valueOf("2021-04-03"), 64, 45));
dataList.add(RowFactory.create("PDX", Date.valueOf("2021-04-02"), 61, 41));
dataList.add(RowFactory.create("PDX", Date.valueOf("2021-04-01"), 66, 39));
dataList.add(RowFactory.create("SEA", Date.valueOf("2021-04-03"), 57, 43));
dataList.add(RowFactory.create("SEA", Date.valueOf("2021-04-02"), 54, 39));
dataList.add(RowFactory.create("SEA", Date.valueOf("2021-04-01"), 56, 41));
Dataset<Row> temps = spark.createDataFrame(dataList, schema);
// Create a table on the Databricks cluster and then fill
// the table with the DataFrame's contents.
// If the table already exists from a previous run,
// delete it first.
spark.sql("USE default");
spark.sql("DROP TABLE IF EXISTS demo_temps_table");
temps.write().saveAsTable("demo_temps_table");
// Query the table on the Databricks cluster, returning rows
// where the airport code is not BLI and the date is later
// than 2021-04-01. Group the results and order by high
// temperature in descending order.
Dataset<Row> df_temps = spark.sql("SELECT * FROM demo_temps_table " +
"WHERE AirportCode != 'BLI' AND Date > '2021-04-01' " +
"GROUP BY AirportCode, Date, TempHighF, TempLowF " +
"ORDER BY TempHighF DESC");
df_temps.show();
// Results:
//
// +-----------+----------+---------+--------+
// |AirportCode| Date|TempHighF|TempLowF|
// +-----------+----------+---------+--------+
// | PDX|2021-04-03| 64| 45|
// | PDX|2021-04-02| 61| 41|
// | SEA|2021-04-03| 57| 43|
// | SEA|2021-04-02| 54| 39|
// +-----------+----------+---------+--------+
// Clean up by deleting the table from the Databricks cluster.
spark.sql("DROP TABLE demo_temps_table");
}
}
Next steps
Extend your conf/deployment.yaml file to support various types of all-purpose and jobs cluster definitions.
Declare multitask jobs in your
conf/deployment.yaml
file.Reference named properties in your
conf/deployment.yaml
file.Batch run code as new jobs on clusters with the dbx execute command.
Additional resources
Batch deploy code artifacts to Databricks workspace storage with the dbx deploy command.
Batch run existing jobs on clusters with the dbx launch command.
Learn more about dbx and CI/CD.
databrickslabs/dbx repository on GitHub