CI/CD with Jenkins on Databricks
Note
This article covers Jenkins, which is neither provided nor supported by Databricks. To contact the provider, see Jenkins Help.
This article shows you how to set up a CI/CD pipeline on Databricks using Jenkins, an open source automation server you can use to build, test, and deploy software.
For an overview of CI/CD on Databricks, see What is CI/CD on Databricks?.
Develop and commit your code
One of the first steps in designing a CI/CD pipeline is deciding on a code commit and branching strategy to manage the development and integration of new and updated code without adversely affecting the code currently in production. Part of this decision involves choosing a version control system to contain your code and facilitate the promotion of that code. Databricks enables you to commit notebooks to Git repositories supported by several Git providers. See Supported Git providers.
If your version control system is not among those supported through direct notebook integration, or if you want more flexibility and control than the self-service Git integration, you can use the Databricks CLI tutorial to export notebooks and commit them from your local machine. This script should be run from within a local Git repository that is set up to sync with the appropriate remote repository. When run, this script:
Checks out the desired branch.
Pulls new changes from the remote branch.
Exports notebooks from the Databricks workspace using the Databricks CLI.
Prompts the user for a commit message or use the default if one is not provided.
Commits the updated notebooks to the local branch.
Pushes the changes to the remote branch.
The following script performs these steps. In this script, replace the following placeholders:
Replace
<branch>
with the name of the target branch that you want to push the changes to.Replace
<workspace-directory-path>
with the path in the Databricks workspace that contains the notebooks to export.Replace
<local-directory-path>
with the path on your local machine to put the exported notebooks. This should be the path to the local Git repository that is set up to sync with the appropriate remote repository.Replace
<commit-message>
with some meaningful comment for the Git commit.
git checkout <branch>
git pull
databricks workspace export-dir <workspace-directory-path> <local-directory-path> --overwrite
dt=`date '+%Y-%m-%d %H:%M:%S'`
msg_default="Databricks export on $dt"
read -p "Enter the commit comment [$msg_default]: " msg
msg=${msg:-$msg_default}
echo $msg
git add .
git commit -m "<commit-message>"
git push
The preceding Databricks CLI command applies to Databricks CLI versions 0.200 and above.
If you prefer to develop in an integrated development environment (IDE) rather than in Databricks notebooks, you can use the version control system integration features built into modern IDEs or the Git CLI to commit your code.
Databricks provides Databricks Connect to connect local IDEs to remote Databricks clusters. This is especially useful when developing libraries, as it allows you to run and unit test your code on Databricks clusters without having to deploy that code. See Databricks Connect limitations to determine whether your use case is supported.
Depending on your branching strategy and promotion process, the point at which a CI/CD pipeline initiates a build vary. However, committed code from various contributors are eventually merged into a designated branch to be built and deployed. Branch management steps run outside of Databricks using the interfaces provided by the version control system.
There are numerous CI/CD tools you can use to manage and run your CI/CD pipelines. This article illustrates how to use the Jenkins automation server. CI/CD is a design pattern, so the steps and stages outlined in this article should transfer with a few changes to the pipeline definition language in each tool. Furthermore, much of the code in this example pipeline runs standard Python code, which you can invoke in other tools.
If you have not yet installed or started Jenkins, see Installing Jenkins.
Configure your agent
Jenkins uses a centralized service for coordination and one-to-many execution agents. In this article’s example, you use the default permanent agent node included with the Jenkins server. For this example, you must manually install the following tools on the agent, in this case the Jenkins server:
venv: a Python virtual environment management system that is included with Python.
Python 3.10: this example uses Python to run tests, build a Python wheel, and run Python deployment scripts. The version of Python is important, because tests require that the version of Python running on the agent should match that of the Databricks cluster. This example uses Databricks Runtime 13.3 LTS, which includes Python 3.10.
To do the manual installation, configure your agent as follows:
Install Python 3.10.
From your local Git repository that is set up to sync with the appropriate remote repository, use
venv
to create a Python virtual environment within that local repository by running the following command depending on your agent’s operating system:# Linux and macOS python3.10 -m venv ./.venv # Windows python3.10 -m venv .\.venv
Add the following entries to your local Git repository’s
.gitignore
file, so that the Python virtual environment system files are not checked into your version control system:# Linux and macOS .venv # Windows .venv
Activate the Python virtual environment, depending on your operating system and shell. For specific instructions, see How venvs work. This example assumes macOS and
zsh
:source ./.venv/bin/activate
With the virtual environment activated, run the following three
pip
commands to install the following packages that this article’s example relies on:Databricks Connect (the
databricks-connect
package) for connecting the agent to a Databricks cluster that is running in the remote Databricks workspace. Databricks recommends that the Databricks Connect package version match the version of the Databricks Runtime that is installed on your remote cluster. This example assumes a cluster that has Databricks Runtime 13.3 LTS installed.The
pytest
package for Python unit testing.The
wheel
package for building Python wheels.
Here are the related commands (on some systems, you might need to replace
pip3
withpip
):pip3 install --upgrade "databricks-connect==13.3.*" pip3 install --upgrade pytest pip3 install --upgrade wheel
Note
Databricks recommends that you append the “dot-asterisk” notation in the following command to specify
databricks-connect==X.Y.*`` instead of
databricks-connect=X.Y`, to make sure that the most recent package is installed. While this is not a requirement, it helps make sure that you can use the latest supported features for that cluster.Install Databricks CLI version 0.200 or above. For details, see Install the CLI.
Design the pipeline
Jenkins provides a few different project types to create CI/CD pipelines. This example implements a Jenkins Pipeline. Jenkins Pipelines provide an interface to define stages in a Pipeline using Groovy code to call and configure Jenkins plugins.

You write a Pipeline definition in a text file (called a Jenkinsfile) which in turn is checked into a project’s source control repository. For more information, see Jenkins Pipeline. Here is an example Pipeline. In this example Jenkinsfile, replace the following placeholders:
Replace
<path-to-local-cloned-git-repo>
with the path to the local Git repository that is set up to sync with the appropriate remote repository. For example, on macOS this could be/Users/<your-username>/jenkins-databricks-demo
.Replace
<repo-owner>/<repo-name>
with the individual or organization owner name and the name of your Git repository that is stored with your third-party Git provider. This example assumes that your Git repository is stored in GitHub. For example, this could behttps://github.com/<your-username>/jenkins-databricks-demo
.Replace
<credential-id>
with the Jenkins credential ID that maps to your GitHub personal access token, to enable Jenkins to pull from your remote Git repository. To create a GitHub personal access token, see Managing your personal access tokens in the GitHub documentation.To provide Jenkins with your GitHub personal access token, from your Jenkins Dashboard:
On the sidebar, click Manage Jenkins.
In the Security section, click (global).
Click Add Credentials.
For Kind, select Secret text.
For Secret, enter your GitHub personal access token.
For ID, enter a meaningful ID, for example
github-token
. This is the Jenkins credential ID for<credential-id>
.
Replace
<release-branch-name>
with the name of the release branch in your Git repository. For example, this could bemain
.Replace
<workspace-path>
with the path to your Databricks workspace where your Databricks notebooks are stored. For example, this could be/Workspace/Users/<your-username>/jenkins-databricks-demo
.Replace
<dbfs-path>
with the DBFS path from your Databricks workspace where you will store your Python wheel builds. For example, this could bedbfs:/jenkins-databricks-demo
.Replace
<databricks-cli-path>
with the path on your local machine where the Databricks CLI is installed. For example, on macOS this could be/usr/local/bin
.
Also, set the following environment variables on the Jenkins server:
DATABRICKS_TOKEN
, set to the value of your Databricks personal access token for your Databricks workspace. To create a Databricks personal access token, see Databricks personal access tokens for workspace users.Note
As a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use OAuth tokens or personal access tokens belonging to service principals instead of workspace users. To create tokens for service principals, see Manage tokens for a service principal.
DATABRICKS_HOST
, set to your Databricks workspace URL, starting withhttps://
. See Workspace instance names, URLs, and IDs.DATABRICKS_CLUSTER_ID
, set to the ID of your Databricks cluster. See Cluster URL and ID.
To set environment variables on your Jenkins server, on the Jenkins Dashboard:
In the sidebar, click Manage Jenkins.
In the System Configuration section, click System.
In the Global properties section, select the Environment variables checkbox.
Click Add and then enter the environment variable’s Name and Value. Repeat this for each additional environment variable.
When you are finished adding environment variables, click Save to return to the Jenkins Dashboard.
Here is the Jenkinsfile, which should be added to the root of the local Git repository:
// Filename: Jenkinsfile
node {
def GITREPO = "<path-to-local-cloned-git-repo>"
def GITREPOREMOTE = "https://github.com/<repo-owner>/<repo-name>"
def GITHUBCREDID = "<credential-id>"
def CURRENTRELEASE = "<release-branch-name>"
def SCRIPTPATH = "${GITREPO}/Automation/Deployments"
def NOTEBOOKPATH = "${GITREPO}/Workspace"
def LIBRARYPATH = "${GITREPO}/Libraries"
def BUILDPATH = "${GITREPO}/Builds/${env.JOB_NAME}-${env.BUILD_NUMBER}"
def OUTFILEPATH = "${BUILDPATH}/Validation/Output"
def TESTRESULTPATH = "${BUILDPATH}/Validation/reports/junit/test-reports"
def WORKSPACEPATH = "<workspace-path>"
def DBFSPATH = "dbfs:/<dbfs-path>"
def VENVPATH = "${GITREPO}/.venv"
def DBCLIPATH = "<databricks-cli-path>"
stage('Checkout') {
withCredentials ([string(credentialsId: GITHUBCREDID, variable: 'GITHUB_TOKEN')]) {
echo "Pulling branch '${CURRENTRELEASE}' from GitHub..."
git branch: CURRENTRELEASE, credentialsId: GITHUB_TOKEN, url: GITREPOREMOTE
}
}
stage('Run Unit Tests') {
try {
sh """#!/bin/bash
# Enable venv environment for tests
source ${VENVPATH}/bin/activate
# Python tests for libs
python3 -m pytest --junit-xml=${TESTRESULTPATH}/TEST-libout.xml ${LIBRARYPATH}/python/dbxdemo/build/lib/dbxdemo/test_*.py || true
"""
} catch(err) {
step([$class: 'JUnitResultArchiver', testResults: '--junit-xml=${TESTRESULTPATH}/TEST-*.xml'])
if (currentBuild.result == 'UNSTABLE')
currentBuild.result = 'FAILURE'
throw err
}
}
stage('Package') {
sh """#!/bin/bash
# Enable venv environment for packaging
# source ${VENVPATH}/bin/activate
# Package Python wheel
cd ${LIBRARYPATH}/python/dbxdemo
python3 setup.py sdist bdist_wheel
"""
}
stage('Build Artifact') {
sh """mkdir -p ${BUILDPATH}/Workspace
mkdir -p ${BUILDPATH}/Libraries/python
mkdir -p ${BUILDPATH}/Validation/Output
#Get modified files
git diff --name-only --diff-filter=AMR HEAD^1 HEAD | xargs -I '{}' cp -r '{}' ${BUILDPATH}
# Get packaged libs
find ${LIBRARYPATH} -name '*.whl' | xargs -I '{}' cp '{}' ${BUILDPATH}/Libraries/python/
"""
}
stage('Deploy') {
sh """#!/bin/bash
# Copy notebooks into build
cp -R ${NOTEBOOKPATH} ${BUILDPATH}
# Use Databricks CLI to deploy notebooks
${DBCLIPATH}/databricks workspace import-dir ${BUILDPATH}/Workspace ${WORKSPACEPATH} --overwrite
# Use Databricks CLI to deploy Python wheel
${DBCLIPATH}/databricks fs cp -r ${BUILDPATH}/Libraries/python ${DBFSPATH} --overwrite
"""
sh """#!/bin/bash
# Enable venv environment for commands
source ${VENVPATH}/bin/activate
# Get space-delimited list of libraries
LIBS=\$(find ${BUILDPATH}/Libraries/python/ -name '*.whl' | sed 's#.*/##')
echo \$LIBS
# Script to uninstall, reboot if needed and install library
python3 ${SCRIPTPATH}/installWhlLibrary.py \
--libs=\$LIBS \
--dbfspath=${DBFSPATH}
"""
}
stage('Run Integration Tests') {
sh """#!/bin/bash
# Enable venv environment for script
source ${VENVPATH}/bin/activate
python3 ${SCRIPTPATH}/executenotebook.py \
--localpath=${NOTEBOOKPATH} \
--workspacepath=${WORKSPACEPATH} \
--outfilepath=${OUTFILEPATH}
"""
sh """#!/bin/bash
sed -i -e 's #ENV# ${OUTFILEPATH} g' ${SCRIPTPATH}/evaluatenotebookruns.py
python3 -m pytest --junit-xml=${TESTRESULTPATH}/TEST-notebookout.xml ${SCRIPTPATH}/evaluatenotebookruns.py || true
"""
}
stage('Report Test Results') {
sh """find ${OUTFILEPATH} -name '*.json' -exec gzip --verbose {} \\;
touch ${TESTRESULTPATH}/TEST-*.xml
"""
junit allowEmptyResults: true, testResults: '**/test-results/*.xml'
}
}
Create the following folder structure in your local Git repo, to store all of the subfolders and files for this article’s example. Note that the .venv
folder and the Jenkinsfile
file will already exist in your local Git repository.
|-- .venv
|-- Automation
| `-- Deployments
|-- Builds
|-- Libraries
|-- Workspace
`-- Jenkinsfile
The remainder of this article discusses each step in the Pipeline.
Define local environment variables
This article’s example Pipeline uses the following local environment variables:
GITREPO
: The path to your local Git repository.GITREPOREMOTE
: The URL to your remote Git repository.GITHUBCREDID
: The Jenkins credential ID that maps to your GitHub personal access token.CURRENTRELEASE
: The name of your Git repository’s deployment branch.SCRIPTPATH
: The path in your local Git repository that contains the required automation scripts for this article’s example.NOTEBOOKPATH
: The path in your local Git repository that contains the required notebook for this article’s example.LIBRARYPATH
: The path in your local Git repository that contains the required files for building this article’s example Python wheel.BUILDPATH
: The path in your local Git repository to contain all of the build artifacts that Jenkins generates.OUTFILEPATH
: The path in your local Git repository to contain the JSON result files that Jenkins generates from automated tests.TESTRESULTPATH
: The path in your local Git repository to contain the Junit test result summaries that Jenkins generates.WORKSPACEPATH
: The Databricks workspace path where your Databricks notebooks are stored for this article’s example.DBFSPATH
: The DBFS path from your Databricks workspace where your Python wheel and non-notebook code are stored for this article’s example.VENVPATH
: The path in your local Git repository to your Python virtual environment system files.DBCLIPATH
: The path on your local machine to your Databricks CLI installation.
Get the latest code changes
The Checkout
stage downloads code from the designated branch to the agent execution agent by using a Jenkins plugin:
stage('Checkout') {
withCredentials ([string(credentialsId: GITHUBCREDID, variable: 'GITHUB_TOKEN')]) {
echo "Pulling branch '${CURRENTRELEASE}' from GitHub..."
git branch: CURRENTRELEASE, credentialsId: GITHUB_TOKEN_, url: GITREPOREMOTE
}
}
Develop unit tests
There are a few different options when deciding how to unit test your code. For library code developed outside of a Databricks notebook, the process is like traditional software development practices. You write a unit test using a testing framework, like the Python pytest module, and JUnit-formatted XML files store the test results.
The Databricks process differs in that the code being tested is Apache Spark code intended to be run on a Spark cluster
often running locally or in this case on Databricks. To accommodate this requirement, you use
Databricks Connect. Since Databricks Connect was configured earlier, no changes to the test code are
required to run the tests on Databricks clusters. You installed Databricks Connect in a
venv
virtual environment. Once the venv
environment is activated, the tests are run by using
the Python tool, pytest
, to which you provide the locations for the tests and the resulting
output files.
Test library code using Databricks Connect
The Run Unit Tests
stage uses pytest
to test a library to make sure it works before packaging it into a Python wheel:
stage('Run Unit Tests') {
try {
sh """#!/bin/bash
# Enable venv environment for tests
source ${VENVPATH}/bin/activate
# Python tests for libs
python3 -m pytest --junit-xml=${TESTRESULTPATH}/TEST-libout.xml ${LIBRARYPATH}/python/dbxdemo/build/lib/dbxdemo/test_*.py || true
"""
} catch(err) {
step([$class: 'JUnitResultArchiver', testResults: '--junit-xml=${TESTRESULTPATH}/TEST-*.xml'])
if (currentBuild.result == 'UNSTABLE')
currentBuild.result = 'FAILURE'
throw err
}
}
To have Jenkins run this test, create two files named addcol.py
and test_addcol.py
, and add them to a folder structure named python/dbxdemo/dbxdemo
within your local Git repo’s Libraries
folder, visualized as follows (ellipses indicate omitted folders in the repo, for brevity):
|-- ...
|-- Libraries
| `-- python
| `-- dbxdemo
| `-- dbxdemo
| |-- addcol.py
| `-- test_addcol.py
|-- ...
The addcol.py
file contains a library function that might be built into a Python wheel and then installed on a Databricks cluster. It is
a simple function that adds a new column, populated by a literal, to an Apache Spark DataFrame. This file also contains a simple function that gets a running DatabricksSession
from Databricks Connect. This DatabrickSession
is similar to a SparkSession
running on a Spark cluster:
# Filename: addcol.py
import pyspark.sql.functions as F
from databricks.connect import DatabricksSession
def get_spark():
spark = DatabricksSession.builder.getOrCreate()
return spark
def with_status(df):
return df.withColumn("status", F.lit("checked"))
The test_addcol.py
file contains tests to pass a mock DataFrame object to the with_status
function, defined in addcol.py
. The
result is then compared to a DataFrame object containing the expected values. If the values match,
which in this case they do, the test passes:
# Filename: test_addcol.py
import pytest
from addcol import *
class TestAppendCol(object):
def test_with_status(self):
source_data = [
("paula", "white", "paula.white@example.com"),
("john", "baer", "john.baer@example.com")
]
source_df = get_spark().createDataFrame(
source_data,
["first_name", "last_name", "email"]
)
actual_df = with_status(source_df)
expected_data = [
("paula", "white", "paula.white@example.com", "checked"),
("john", "baer", "john.baer@example.com", "checked")
]
expected_df = get_spark().createDataFrame(
expected_data,
["first_name", "last_name", "email", "status"]
)
assert(expected_df.collect() == actual_df.collect())
Package the library code
The Package
stage packages the library code into a Python wheel:
stage('Package') {
sh """#!/bin/bash
# Enable venv environment for packaging
# source ${VENVPATH}/bin/activate
# Package Python wheel
cd ${LIBRARYPATH}/python/dbxdemo
python3 setup.py sdist bdist_wheel
"""
}
To have Jenkins package this library code into a Python wheel, create two file named __init__.py
and __main__.py
in the same folder as the preceding two files. Also, create a file named setup.py
in the python/dbxdemo
folder, visualized as follows (ellipses indicate omitted folders, for brevity):
|-- ...
|-- Libraries
| `-- python
| `-- dbxdemo
| |-- dbxdemo
| | |-- __init__.py
| | |-- __main__.py
| | |-- addcol.py
| | `-- test_addcol.py
| `-- setup.py
|-- ...
The __init__.py
file contains the library’s version number and author. Replace <my-author-name>
with your name:
# Filename: __init__.py
__version__ = '0.0.1'
__author__ = '<my-author-name>'
import sys, os
sys.path.append(os.path.join(os.path.dirname(__file__), "..", ".."))
The __main__.py
file contains the library’s entry point:
# Filename: __main__.py
import sys, os
sys.path.append(os.path.join(os.path.dirname(__file__), "..", ".."))
from addcol import *
def main():
pass
if __name__ == "__main__":
main()
The setup.py
file contains additional settings for building the library into a Python wheel. Replace <my-url>
, <my-author-name>@<my-organization>
, and <my-package-description>
with meaningful values:
# Filename: setup.py
from setuptools import setup, find_packages
import dbxdemo
setup(
name = "dbxdemo",
version = dbxdemo.__version__,
author = dbxdemo.__author__,
url = "https://<my-url>",
author_email = "<my-author-name>@<my-organization>",
description = "<my-package-description>",
packages = find_packages(include = ["dbxdemo"]),
entry_points={"group_1": "run=dbxdemo.__main__:main"},
install_requires = ["setuptools"]
)
Add the notebook
To have Jenkins call this Python wheel from a notebook, create a file named dbxdemo-notebook.py
and add it to your your local Git repo’s Workspace
folder, visualized as follows (ellipses indicate omitted folders in the repo, for brevity):
|-- ...
|-- Workspace
| `-- dbxdemo-notebook.py
|-- ...
Add the following content to this notebook. This notebook creates a DataFrame object, passes it to the library’s with_status
function, and prints the result:
# Databricks notebook source
from dbxdemo.addcol import with_status
df = (spark.createDataFrame(
schema = ["first_name", "last_name", "email"],
data = [
("paula", "white", "paula.white@example.com"),
("john", "baer", "john.baer@example.com")
]
))
new_df = with_status(df)
display(new_df)
# Output:
#
# +------------+-----------+-------------------------+---------+
# | first_name | last_name | email | status |
# +============+===========+=========================+=========+
# | paula | white | paula.white@example.com | checked |
# +------------+-----------+-------------------------+---------+
# | john | baer | john.baer@example.com | checked |
# +------------+-----------+-------------------------+---------+
Generate and store the deployment artifact
The Build Artifact
builds the Python wheel as well as prepares to add the notebook code to be
deployed to the workspace in a later stage, and prepares to add the result summaries for the tests. This stage uses git diff
to flag all new files
that have been included in the most recent Git merge. This is only an example method, so the
implementation in your Pipeline might differ, but the objective is to add all files intended for the
current release.
stage('Build Artifact') {
sh """mkdir -p ${BUILDPATH}/Workspace
mkdir -p ${BUILDPATH}/Libraries/python
mkdir -p ${BUILDPATH}/Validation/Output
#Get modified files
git diff --name-only --diff-filter=AMR HEAD^1 HEAD | xargs -I '{}' cp -r '{}' ${BUILDPATH}
# Get packaged libs
find ${LIBRARYPATH} -name '*.whl' | xargs -I '{}' cp '{}' ${BUILDPATH}/Libraries/python/
"""
}
Deploy artifacts
The Deploy
stage instructs Jenkins to copy the notebook over into the local deployment folder, uses the Databricks CLI tutorial to upload the notebook and library from the local deployment folder into the Databricks workspace, and then installs the deployed library onto the Databricks cluster:
stage('Deploy') {
sh """#!/bin/bash
# Copy notebooks into build
cp -R ${NOTEBOOKPATH} ${BUILDPATH}
# Use Databricks CLI to deploy notebooks
${DBCLIPATH}/databricks workspace import-dir ${BUILDPATH}/Workspace ${WORKSPACEPATH} --overwrite
# Use Databricks CLI to deploy Python wheel
${DBCLIPATH}/databricks fs cp -r ${BUILDPATH}/Libraries/python ${DBFSPATH} --overwrite
"""
sh """#!/bin/bash
# Enable venv environment for commands
source ${VENVPATH}/bin/activate
# Get space-delimited list of libraries (omitted: | paste -sd " ")
LIBS=\$(find ${BUILDPATH}/Libraries/python/ -name '*.whl' | sed 's#.*/##')
echo \$LIBS
# Script to uninstall, reboot if needed and install library
python3 ${SCRIPTPATH}/installWhlLibrary.py \
--libs=\$LIBS \
--dbfspath=${DBFSPATH}
"""
}
Installing a new version of a library on a Databricks cluster requires that any previously installed version of the library uninstalled first. The following Python script uses the Databricks SDK for Python to do this, as follows:
Check if the library is installed.
Uninstall the library.
Restart the cluster if any uninstalls were performed.
Wait until the cluster is running again before proceeding.
Install the library.
To have Jenkins run this script, create a file named installWhlLibrary.py
and add it to your local Git repository’s Automation/Deployments
folder, visualized as follows (ellipses indicate omitted folders, for brevity):
|-- ...
|-- Automation
| `-- Deployments
| `-- installWhlLibrary.py
|-- ...
# Filename: installWhlLibrary.py
#!/usr/bin/python3
import sys
import getopt
import time
import os
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.compute import Library
def main():
libs = ''
dbfspath = ''
cluster_id = os.getenv('DATABRICKS_CLUSTER_ID')
w = WorkspaceClient()
try:
opts, args = getopt.getopt(sys.argv[1:], 'hstcld',
['libs=', 'dbfspath='])
except getopt.GetoptError:
print(
'installWhlLibrary.py -l <libs> -d <dbfspath>')
sys.exit(2)
for opt, arg in opts:
if opt == '-h':
print(
'installWhlLibrary.py -l <libs> -d <dbfspath>')
sys.exit()
elif opt in ('-l', '--libs'):
libs=arg
elif opt in ('-d', '--dbfspath'):
dbfspath=arg
print('-l is ' + libs)
print('-d is ' + dbfspath)
libslist = libs.split()
# Uninstall the library if it already exists on the cluster.
i = 0
for lib in libslist:
dbfslib = dbfspath + "/" + lib
print("'" + dbfslib + "' before: " + getLibStatus(w, dbfslib, cluster_id))
if (getLibStatus(w, dbfslib, cluster_id) != 'not found'):
print("'" + lib + "' exists. Uninstalling. Please wait...")
i = i + 1
values = [Library(whl = dbfslib)]
w.libraries.uninstall(cluster_id = cluster_id, libraries = values)
print("'" + dbfslib + "' after: " + getLibStatus(w, dbfslib, cluster_id))
# Restart the cluster if the libraries were uninstalled.
if i > 0:
print("Restarting cluster '" + cluster_id + "'. This could take several minutes. Please wait...")
w.clusters.restart_and_wait(cluster_id = cluster_id)
p = 0
waiting = True
while waiting:
time.sleep(30)
resp = w.clusters.get(cluster_id = cluster_id)
print("Current cluster state: " + resp.state.name)
if resp.state.name in ['RUNNING','INTERNAL_ERROR', 'SKIPPED'] or p >= 10:
break
p = p + 1
# Install the libraries.
for lib in libslist:
dbfslib = dbfspath + "/" + lib
print("Installing '" + dbfslib + "'. Please wait...")
values = [Library(whl = dbfslib)]
w.libraries.install(cluster_id = cluster_id, libraries = values)
print("'" + dbfslib + "' after: " + getLibStatus(w, dbfslib, cluster_id))
def getLibStatus(workspace_client, dbfslib, cluster_id):
resp = workspace_client.libraries.cluster_status(cluster_id = cluster_id)
if resp.library_statuses:
for status in resp.library_statuses:
if (status.library.whl == dbfslib):
return status.status.name
else:
return 'not found'
else:
return 'not found'
if __name__ == '__main__':
main()
Test notebook code by using another notebook
Once the artifact has been deployed, it is important to run integration tests to ensure all the code
is working together in the new environment. To do this you can run a notebook containing asserts to
test the deployment. In this case you are using the same test you used in the unit test, but now it
is importing the installed appendcol
library from the Python wheel that was just installed on the cluster.
To automate this test and include it in your CI/CD Pipeline, instruct Jenkins to use the Databricks SDK for Python to
run the notebook from the Jenkins server. This allows you to check whether the notebook
run passed or failed using pytest
. If the asserts in the notebook fail, this shows
in the output returned by the SDK and subsequently in the JUnit test results.
stage('Run Integration Tests') {
sh """#!/bin/bash
# Enable venv environment for script
source ${VENVPATH}/bin/activate
python3 ${SCRIPTPATH}/executenotebook.py \
--localpath=${NOTEBOOKPATH} \
--workspacepath=${WORKSPACEPATH} \
--outfilepath=${OUTFILEPATH}
"""
sh """#!/bin/bash
sed -i -e 's #ENV# ${OUTFILEPATH} g' ${SCRIPTPATH}/evaluatenotebookruns.py
python3 -m pytest --junit-xml=${TESTRESULTPATH}/TEST-notebookout.xml ${SCRIPTPATH}/evaluatenotebookruns.py || true
"""
}
This stage calls two Python scripts. The first script runs the notebook as an anonymous job. Once the job has completed, the JSON output is saved to the path specified by the function arguments passed at invocation.
To have Jenkins run this script, create a file named executenotebook.py
and add it to your local Git repository’s Automation/Deployments
folder, visualized as follows (ellipses indicate omitted folders, for brevity):
|-- ...
|-- Automation
| `-- Deployments
| |-- executenotebook.py
| `-- installWhlLibrary.py
|-- ...
# Filename: executenotebook.py
#!/usr/bin/python3
import json
import os
import sys
import getopt
import time
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.jobs import SubmitTask, NotebookTask
def main():
cluster_id = os.getenv('DATABRICKS_CLUSTER_ID')
localpath = ''
workspacepath = ''
outfilepath = ''
w = WorkspaceClient()
try:
opts, args = getopt.getopt(sys.argv[1:], 'hs:t:c:lwo',
['localpath=', 'workspacepath=', 'outfilepath='])
except getopt.GetoptError:
print(
'executenotebook.py -l <localpath> -w <workspacepath> -o <outfilepath>)')
sys.exit(2)
for opt, arg in opts:
if opt == '-h':
print(
'executenotebook.py -l <localpath> -w <workspacepath> -o <outfilepath>')
sys.exit()
elif opt in ('-l', '--localpath'):
localpath = arg
elif opt in ('-w', '--workspacepath'):
workspacepath = arg
elif opt in ('-o', '--outfilepath'):
outfilepath = arg
print('-l is ' + localpath)
print('-w is ' + workspacepath)
print('-o is ' + outfilepath)
# Generate an array of notebook names and locations by traversing the local path.
notebooks = []
for path, subdirs, files in os.walk(localpath):
for name in files:
fullpath = path + '/' + name
# Remove the local path to the repo but keep the workspace path.
fullworkspacepath = workspacepath + path.replace(localpath, '')
name, file_extension = os.path.splitext(fullpath)
if file_extension.lower() in ['.scala', '.sql', '.r', '.py']:
row = [fullpath, fullworkspacepath, 1]
notebooks.append(row)
# For each notebook found:
for notebook in notebooks:
nameonly = os.path.basename(notebook[0])
workspacepath = notebook[1]
name, file_extension = os.path.splitext(nameonly)
# Remove the file extension.
fullworkspacepath = workspacepath + '/' + name
print("Running job for: '" + fullworkspacepath + "'. This could take several minutes. Please wait...")
resp = w.jobs.submit_and_wait(
run_name = name,
tasks = [SubmitTask(
task_key = name,
existing_cluster_id = cluster_id,
timeout_seconds = 3600,
notebook_task = NotebookTask(
notebook_path = fullworkspacepath,
base_parameters = []
)
)]
)
i = 0
waiting = True
while waiting:
time.sleep(10)
print("Current job state: " + resp.state.life_cycle_state.name)
if resp.state.life_cycle_state.name in ['TERMINATED', 'INTERNAL_ERROR', 'SKIPPED'] or i >= 12:
break
i = i + 1
if outfilepath != '':
file = open(outfilepath + '/' + str(resp.run_id) + '.json', 'w')
file.write(json.dumps(resp.as_dict()))
file.close()
if __name__ == '__main__':
main()
The second Python script defines the test_job_run
function, which parses and
evaluates the JSON to determine if the assert statements within the notebook passed or failed.
An additional test, test_performance
, catches tests that run longer than expected.
To have Jenkins run this script, create a file named evaluatenotebookruns.py
and add it to your local Git repository’s Automation/Deployments
folder, visualized as follows (ellipses indicate omitted folders, for brevity):
|-- ...
|-- Automation
| `-- Deployments
| |-- evaluatenotebookruns.py
| |-- executenotebook.py
| `-- installWhlLibrary.py
|-- ...
# Filename: evaluatenotebookruns.py
import unittest
import json
import glob
import os
class TestJobOutput(unittest.TestCase):
test_output_path = '#ENV#'
def test_performance(self):
path = self.test_output_path
statuses = []
for filename in glob.glob(os.path.join(path, '*.json')):
print('Evaluating: ' + filename)
data = json.load(open(filename))
duration = data['execution_duration']
if duration > 100000:
status = 'FAILED'
else:
status = 'SUCCESS'
statuses.append(status)
self.assertFalse('FAILED' in statuses)
def test_job_run(self):
path = self.test_output_path
statuses = []
for filename in glob.glob(os.path.join(path, '*.json')):
print('Evaluating: ' + filename)
data = json.load(open(filename))
status = data['state']['result_state']
statuses.append(status)
self.assertFalse('FAILED' in statuses)
if __name__ == '__main__':
unittest.main()
As seen earlier in the unit test stage, Jenkins uses pytest
to run the tests and generate the result summaries.
Publish test results
The JSON results are archived and the test results are published to Jenkins by using the junit
Jenkins plugin. This enables you to visualize reports and dashboards related to the status of
the build process.
stage('Report Test Results') {
sh """find ${OUTFILEPATH} -name '*.json' -exec gzip --verbose {} \\;
touch ${TESTRESULTPATH}/TEST-*.xml
"""
junit allowEmptyResults: true, testResults: '**/test-results/*.xml'
}

Push your code to version control
You should now push the contents of your local Git repository into your remote Git repository. To do this, you should first add the following entries to the .gitignore
file, as you should probably not push local Python build files and Python caches into your remote. Typically you want to regenerate the lastest builds locally and then copy those locally-tested builds into your remote workspace:
.venv
*__pycache__/
*.pytest_cache
Builds/
*.egg-info
Libraries/python/dbxdemo/build/
Libraries/python/dbxdemo/dist/
Run your Jenkins Pipeline
You should now be ready to run your Jenkins Pipeline, if you have not done so already. To do this, from your Jenkins Dashboard:
On the sidebar, click New Item.
Enter a name for the pipeline, for example
jenkins-databricks-demo
.Click Pipeline and then click OK.
In the Pipeline section, for Definition, select Pipeline script from SCM.
For SCM, select Git.
For Repositories > Repository URL, enter the URL to your remote Git repository, for example
https://github.com/<your-username>/jenkins-databricks-demo
.For Branches to build > Branch Specifer, enter
*/<release-branch-name>
.Click Save.
Make sure that your target cluster is running in your Databricks workspace.
On the sidebar, click Build Now.
To see the results, click the latest Pipeline run (for example,
#1
) and then click Console Output.
At this point, the CI/CD pipeline has completed an integration and deployment cycle. By automating this process, you can ensure that your code has been tested and deployed by an efficient, consistent, and repeatable process.