CI/CD with Jenkins on Databricks

Note

This article covers Jenkins, which is neither provided nor supported by Databricks. To contact the provider, see Jenkins Help.

This article shows you how to set up a CI/CD pipeline on Databricks using Jenkins, an open source automation server you can use to build, test, and deploy software.

For an overview of CI/CD on Databricks, see What is CI/CD on Databricks?.

Develop and commit your code

One of the first steps in designing a CI/CD pipeline is deciding on a code commit and branching strategy to manage the development and integration of new and updated code without adversely affecting the code currently in production. Part of this decision involves choosing a version control system to contain your code and facilitate the promotion of that code. Databricks enables you to commit notebooks to Git repositories supported by several Git providers. See Supported Git providers.

If your version control system is not among those supported through direct notebook integration, or if you want more flexibility and control than the self-service Git integration, you can use the Databricks CLI tutorial to export notebooks and commit them from your local machine. This script should be run from within a local Git repository that is set up to sync with the appropriate remote repository. When run, this script:

  1. Checks out the desired branch.

  2. Pulls new changes from the remote branch.

  3. Exports notebooks from the Databricks workspace using the Databricks CLI.

  4. Prompts the user for a commit message or use the default if one is not provided.

  5. Commits the updated notebooks to the local branch.

  6. Pushes the changes to the remote branch.

The following script performs these steps. In this script, replace the following placeholders:

  • Replace <branch> with the name of the target branch that you want to push the changes to.

  • Replace <workspace-directory-path> with the path in the Databricks workspace that contains the notebooks to export.

  • Replace <local-directory-path> with the path on your local machine to put the exported notebooks. This should be the path to the local Git repository that is set up to sync with the appropriate remote repository.

  • Replace <commit-message> with some meaningful comment for the Git commit.

git checkout <branch>
git pull
databricks workspace export-dir <workspace-directory-path> <local-directory-path> --overwrite

dt=`date '+%Y-%m-%d %H:%M:%S'`
msg_default="Databricks export on $dt"
read -p "Enter the commit comment [$msg_default]: " msg
msg=${msg:-$msg_default}
echo $msg

git add .
git commit -m "<commit-message>"
git push

The preceding Databricks CLI command applies to Databricks CLI versions 0.200 and above.

If you prefer to develop in an integrated development environment (IDE) rather than in Databricks notebooks, you can use the version control system integration features built into modern IDEs or the Git CLI to commit your code.

Databricks provides Databricks Connect to connect local IDEs to remote Databricks clusters. This is especially useful when developing libraries, as it allows you to run and unit test your code on Databricks clusters without having to deploy that code. See Databricks Connect limitations to determine whether your use case is supported.

Depending on your branching strategy and promotion process, the point at which a CI/CD pipeline initiates a build vary. However, committed code from various contributors are eventually merged into a designated branch to be built and deployed. Branch management steps run outside of Databricks using the interfaces provided by the version control system.

There are numerous CI/CD tools you can use to manage and run your CI/CD pipelines. This article illustrates how to use the Jenkins automation server. CI/CD is a design pattern, so the steps and stages outlined in this article should transfer with a few changes to the pipeline definition language in each tool. Furthermore, much of the code in this example pipeline runs standard Python code, which you can invoke in other tools.

If you have not yet installed or started Jenkins, see Installing Jenkins.

Configure your agent

Jenkins uses a centralized service for coordination and one-to-many execution agents. In this article’s example, you use the default permanent agent node included with the Jenkins server. For this example, you must manually install the following tools on the agent, in this case the Jenkins server:

  • venv: a Python virtual environment management system that is included with Python.

  • Python 3.10: this example uses Python to run tests, build a Python wheel, and run Python deployment scripts. The version of Python is important, because tests require that the version of Python running on the agent should match that of the Databricks cluster. This example uses Databricks Runtime 13.3 LTS, which includes Python 3.10.

To do the manual installation, configure your agent as follows:

  1. Install Python 3.10.

  2. From your local Git repository that is set up to sync with the appropriate remote repository, use venv to create a Python virtual environment within that local repository by running the following command depending on your agent’s operating system:

    # Linux and macOS
    python3.10 -m venv ./.venv
    
    # Windows
    python3.10 -m venv .\.venv
    
  3. Add the following entries to your local Git repository’s .gitignore file, so that the Python virtual environment system files are not checked into your version control system:

    # Linux and macOS
    .venv
    
    # Windows
    .venv
    
  4. Activate the Python virtual environment, depending on your operating system and shell. For specific instructions, see How venvs work. This example assumes macOS and zsh:

    source ./.venv/bin/activate
    
  5. With the virtual environment activated, run the following three pip commands to install the following packages that this article’s example relies on:

    • Databricks Connect (the databricks-connect package) for connecting the agent to a Databricks cluster that is running in the remote Databricks workspace. Databricks recommends that the Databricks Connect package version match the version of the Databricks Runtime that is installed on your remote cluster. This example assumes a cluster that has Databricks Runtime 13.3 LTS installed.

    • The pytest package for Python unit testing.

    • The wheel package for building Python wheels.

    Here are the related commands (on some systems, you might need to replace pip3 with pip):

    pip3 install --upgrade "databricks-connect==13.3.*"
    pip3 install --upgrade pytest
    pip3 install --upgrade wheel
    

    Note

    Databricks recommends that you append the “dot-asterisk” notation in the following command to specify databricks-connect==X.Y.*`` instead of databricks-connect=X.Y`, to make sure that the most recent package is installed. While this is not a requirement, it helps make sure that you can use the latest supported features for that cluster.

  6. Install Databricks CLI version 0.200 or above. For details, see Install the CLI.

Design the pipeline

Jenkins provides a few different project types to create CI/CD pipelines. This example implements a Jenkins Pipeline. Jenkins Pipelines provide an interface to define stages in a Pipeline using Groovy code to call and configure Jenkins plugins.

Jenkins project types

You write a Pipeline definition in a text file (called a Jenkinsfile) which in turn is checked into a project’s source control repository. For more information, see Jenkins Pipeline. Here is an example Pipeline. In this example Jenkinsfile, replace the following placeholders:

  • Replace <path-to-local-cloned-git-repo> with the path to the local Git repository that is set up to sync with the appropriate remote repository. For example, on macOS this could be /Users/<your-username>/jenkins-databricks-demo.

  • Replace <repo-owner>/<repo-name> with the individual or organization owner name and the name of your Git repository that is stored with your third-party Git provider. This example assumes that your Git repository is stored in GitHub. For example, this could be https://github.com/<your-username>/jenkins-databricks-demo.

  • Replace <credential-id> with the Jenkins credential ID that maps to your GitHub personal access token, to enable Jenkins to pull from your remote Git repository. To create a GitHub personal access token, see Managing your personal access tokens in the GitHub documentation.

    To provide Jenkins with your GitHub personal access token, from your Jenkins Dashboard:

    1. On the sidebar, click Manage Jenkins.

    2. In the Security section, click (global).

    3. Click Add Credentials.

    4. For Kind, select Secret text.

    5. For Secret, enter your GitHub personal access token.

    6. For ID, enter a meaningful ID, for example github-token. This is the Jenkins credential ID for <credential-id>.

  • Replace <release-branch-name> with the name of the release branch in your Git repository. For example, this could be main.

  • Replace <workspace-path> with the path to your Databricks workspace where your Databricks notebooks are stored. For example, this could be /Workspace/Users/<your-username>/jenkins-databricks-demo.

  • Replace <dbfs-path> with the DBFS path from your Databricks workspace where you will store your Python wheel builds. For example, this could be dbfs:/jenkins-databricks-demo.

  • Replace <databricks-cli-path> with the path on your local machine where the Databricks CLI is installed. For example, on macOS this could be /usr/local/bin.

Also, set the following environment variables on the Jenkins server:

To set environment variables on your Jenkins server, on the Jenkins Dashboard:

  1. In the sidebar, click Manage Jenkins.

  2. In the System Configuration section, click System.

  3. In the Global properties section, select the Environment variables checkbox.

  4. Click Add and then enter the environment variable’s Name and Value. Repeat this for each additional environment variable.

  5. When you are finished adding environment variables, click Save to return to the Jenkins Dashboard.

Here is the Jenkinsfile, which should be added to the root of the local Git repository:

// Filename: Jenkinsfile
node {
  def GITREPO         = "<path-to-local-cloned-git-repo>"
  def GITREPOREMOTE   = "https://github.com/<repo-owner>/<repo-name>"
  def GITHUBCREDID    = "<credential-id>"
  def CURRENTRELEASE  = "<release-branch-name>"
  def SCRIPTPATH      = "${GITREPO}/Automation/Deployments"
  def NOTEBOOKPATH    = "${GITREPO}/Workspace"
  def LIBRARYPATH     = "${GITREPO}/Libraries"
  def BUILDPATH       = "${GITREPO}/Builds/${env.JOB_NAME}-${env.BUILD_NUMBER}"
  def OUTFILEPATH     = "${BUILDPATH}/Validation/Output"
  def TESTRESULTPATH  = "${BUILDPATH}/Validation/reports/junit/test-reports"
  def WORKSPACEPATH   = "<workspace-path>"
  def DBFSPATH        = "dbfs:/<dbfs-path>"
  def VENVPATH        = "${GITREPO}/.venv"
  def DBCLIPATH       = "<databricks-cli-path>"

  stage('Checkout') {
    withCredentials ([string(credentialsId: GITHUBCREDID, variable: 'GITHUB_TOKEN')]) {
      echo "Pulling branch '${CURRENTRELEASE}' from GitHub..."
      git branch: CURRENTRELEASE, credentialsId: GITHUB_TOKEN, url: GITREPOREMOTE
    }
  }
  stage('Run Unit Tests') {
    try {
      sh """#!/bin/bash
            # Enable venv environment for tests
            source ${VENVPATH}/bin/activate
            # Python tests for libs
            python3 -m pytest --junit-xml=${TESTRESULTPATH}/TEST-libout.xml ${LIBRARYPATH}/python/dbxdemo/build/lib/dbxdemo/test_*.py || true
           """
    } catch(err) {
      step([$class: 'JUnitResultArchiver', testResults: '--junit-xml=${TESTRESULTPATH}/TEST-*.xml'])
      if (currentBuild.result == 'UNSTABLE')
        currentBuild.result = 'FAILURE'
      throw err
    }
  }
  stage('Package') {
    sh """#!/bin/bash
          # Enable venv environment for packaging
          # source ${VENVPATH}/bin/activate

          # Package Python wheel
          cd ${LIBRARYPATH}/python/dbxdemo
          python3 setup.py sdist bdist_wheel
       """
  }
  stage('Build Artifact') {
    sh """mkdir -p ${BUILDPATH}/Workspace
          mkdir -p ${BUILDPATH}/Libraries/python
          mkdir -p ${BUILDPATH}/Validation/Output
          #Get modified files
          git diff --name-only --diff-filter=AMR HEAD^1 HEAD | xargs -I '{}' cp -r '{}' ${BUILDPATH}

          # Get packaged libs
          find ${LIBRARYPATH} -name '*.whl' | xargs -I '{}' cp '{}' ${BUILDPATH}/Libraries/python/
       """
  }
  stage('Deploy') {
    sh """#!/bin/bash
          # Copy notebooks into build
          cp -R ${NOTEBOOKPATH} ${BUILDPATH}

          # Use Databricks CLI to deploy notebooks
          ${DBCLIPATH}/databricks workspace import-dir ${BUILDPATH}/Workspace ${WORKSPACEPATH} --overwrite

          # Use Databricks CLI to deploy Python wheel
          ${DBCLIPATH}/databricks fs cp -r ${BUILDPATH}/Libraries/python ${DBFSPATH} --overwrite
       """
    sh """#!/bin/bash
          # Enable venv environment for commands
          source ${VENVPATH}/bin/activate

          # Get space-delimited list of libraries
          LIBS=\$(find ${BUILDPATH}/Libraries/python/ -name '*.whl' | sed 's#.*/##')

          echo \$LIBS

          # Script to uninstall, reboot if needed and install library
          python3 ${SCRIPTPATH}/installWhlLibrary.py \
                    --libs=\$LIBS \
                    --dbfspath=${DBFSPATH}
       """
  }
  stage('Run Integration Tests') {
    sh """#!/bin/bash
          # Enable venv environment for script
          source ${VENVPATH}/bin/activate

          python3 ${SCRIPTPATH}/executenotebook.py \
                  --localpath=${NOTEBOOKPATH} \
                  --workspacepath=${WORKSPACEPATH} \
                  --outfilepath=${OUTFILEPATH}
       """
    sh """#!/bin/bash
          sed -i -e 's #ENV# ${OUTFILEPATH} g' ${SCRIPTPATH}/evaluatenotebookruns.py
          python3 -m pytest --junit-xml=${TESTRESULTPATH}/TEST-notebookout.xml ${SCRIPTPATH}/evaluatenotebookruns.py || true
       """
  }
  stage('Report Test Results') {
    sh """find ${OUTFILEPATH} -name '*.json' -exec gzip --verbose {} \\;
          touch ${TESTRESULTPATH}/TEST-*.xml
       """
    junit allowEmptyResults: true, testResults: '**/test-results/*.xml'
  }
}

Create the following folder structure in your local Git repo, to store all of the subfolders and files for this article’s example. Note that the .venv folder and the Jenkinsfile file will already exist in your local Git repository.

|-- .venv
|-- Automation
|     `-- Deployments
|-- Builds
|-- Libraries
|-- Workspace
`-- Jenkinsfile

The remainder of this article discusses each step in the Pipeline.

Define local environment variables

This article’s example Pipeline uses the following local environment variables:

  • GITREPO: The path to your local Git repository.

  • GITREPOREMOTE: The URL to your remote Git repository.

  • GITHUBCREDID: The Jenkins credential ID that maps to your GitHub personal access token.

  • CURRENTRELEASE: The name of your Git repository’s deployment branch.

  • SCRIPTPATH: The path in your local Git repository that contains the required automation scripts for this article’s example.

  • NOTEBOOKPATH: The path in your local Git repository that contains the required notebook for this article’s example.

  • LIBRARYPATH: The path in your local Git repository that contains the required files for building this article’s example Python wheel.

  • BUILDPATH: The path in your local Git repository to contain all of the build artifacts that Jenkins generates.

  • OUTFILEPATH: The path in your local Git repository to contain the JSON result files that Jenkins generates from automated tests.

  • TESTRESULTPATH: The path in your local Git repository to contain the Junit test result summaries that Jenkins generates.

  • WORKSPACEPATH: The Databricks workspace path where your Databricks notebooks are stored for this article’s example.

  • DBFSPATH: The DBFS path from your Databricks workspace where your Python wheel and non-notebook code are stored for this article’s example.

  • VENVPATH: The path in your local Git repository to your Python virtual environment system files.

  • DBCLIPATH: The path on your local machine to your Databricks CLI installation.

Get the latest code changes

The Checkout stage downloads code from the designated branch to the agent execution agent by using a Jenkins plugin:

stage('Checkout') {
  withCredentials ([string(credentialsId: GITHUBCREDID, variable: 'GITHUB_TOKEN')]) {
    echo "Pulling branch '${CURRENTRELEASE}' from GitHub..."
    git branch: CURRENTRELEASE, credentialsId: GITHUB_TOKEN_, url: GITREPOREMOTE
  }
}

Develop unit tests

There are a few different options when deciding how to unit test your code. For library code developed outside of a Databricks notebook, the process is like traditional software development practices. You write a unit test using a testing framework, like the Python pytest module, and JUnit-formatted XML files store the test results.

The Databricks process differs in that the code being tested is Apache Spark code intended to be run on a Spark cluster often running locally or in this case on Databricks. To accommodate this requirement, you use Databricks Connect. Since Databricks Connect was configured earlier, no changes to the test code are required to run the tests on Databricks clusters. You installed Databricks Connect in a venv virtual environment. Once the venv environment is activated, the tests are run by using the Python tool, pytest, to which you provide the locations for the tests and the resulting output files.

Test library code using Databricks Connect

The Run Unit Tests stage uses pytest to test a library to make sure it works before packaging it into a Python wheel:

stage('Run Unit Tests') {
  try {
    sh """#!/bin/bash
          # Enable venv environment for tests
          source ${VENVPATH}/bin/activate
          # Python tests for libs
          python3 -m pytest --junit-xml=${TESTRESULTPATH}/TEST-libout.xml ${LIBRARYPATH}/python/dbxdemo/build/lib/dbxdemo/test_*.py || true
          """
  } catch(err) {
    step([$class: 'JUnitResultArchiver', testResults: '--junit-xml=${TESTRESULTPATH}/TEST-*.xml'])
    if (currentBuild.result == 'UNSTABLE')
      currentBuild.result = 'FAILURE'
    throw err
  }
}

To have Jenkins run this test, create two files named addcol.py and test_addcol.py, and add them to a folder structure named python/dbxdemo/dbxdemo within your local Git repo’s Libraries folder, visualized as follows (ellipses indicate omitted folders in the repo, for brevity):

|-- ...
|-- Libraries
|     `-- python
|           `-- dbxdemo
|                 `-- dbxdemo
|                       |-- addcol.py
|                       `-- test_addcol.py
|-- ...

The addcol.py file contains a library function that might be built into a Python wheel and then installed on a Databricks cluster. It is a simple function that adds a new column, populated by a literal, to an Apache Spark DataFrame. This file also contains a simple function that gets a running DatabricksSession from Databricks Connect. This DatabrickSession is similar to a SparkSession running on a Spark cluster:

# Filename: addcol.py
import pyspark.sql.functions as F
from databricks.connect import DatabricksSession

def get_spark():
  spark = DatabricksSession.builder.getOrCreate()
  return spark

def with_status(df):
  return df.withColumn("status", F.lit("checked"))

The test_addcol.py file contains tests to pass a mock DataFrame object to the with_status function, defined in addcol.py. The result is then compared to a DataFrame object containing the expected values. If the values match, which in this case they do, the test passes:

# Filename: test_addcol.py
import pytest
from addcol import *

class TestAppendCol(object):

  def test_with_status(self):
    source_data = [
      ("paula", "white", "paula.white@example.com"),
      ("john", "baer", "john.baer@example.com")
    ]
    source_df = get_spark().createDataFrame(
      source_data,
      ["first_name", "last_name", "email"]
    )

    actual_df = with_status(source_df)

    expected_data = [
      ("paula", "white", "paula.white@example.com", "checked"),
      ("john", "baer", "john.baer@example.com", "checked")
    ]
    expected_df = get_spark().createDataFrame(
      expected_data,
      ["first_name", "last_name", "email", "status"]
    )

    assert(expected_df.collect() == actual_df.collect())

Package the library code

The Package stage packages the library code into a Python wheel:

stage('Package') {
  sh """#!/bin/bash
        # Enable venv environment for packaging
        # source ${VENVPATH}/bin/activate

        # Package Python wheel
        cd ${LIBRARYPATH}/python/dbxdemo
        python3 setup.py sdist bdist_wheel
     """
}

To have Jenkins package this library code into a Python wheel, create two file named __init__.py and __main__.py in the same folder as the preceding two files. Also, create a file named setup.py in the python/dbxdemo folder, visualized as follows (ellipses indicate omitted folders, for brevity):

|-- ...
|-- Libraries
|     `-- python
|           `-- dbxdemo
|                 |-- dbxdemo
|                 |     |-- __init__.py
|                 |     |-- __main__.py
|                 |     |-- addcol.py
|                 |     `-- test_addcol.py
|                 `-- setup.py
|-- ...

The __init__.py file contains the library’s version number and author. Replace <my-author-name> with your name:

# Filename: __init__.py
__version__ = '0.0.1'
__author__ = '<my-author-name>'

import sys, os

sys.path.append(os.path.join(os.path.dirname(__file__), "..", ".."))

The __main__.py file contains the library’s entry point:

# Filename: __main__.py
import sys, os

sys.path.append(os.path.join(os.path.dirname(__file__), "..", ".."))

from addcol import *

def main():
  pass

if __name__ == "__main__":
  main()

The setup.py file contains additional settings for building the library into a Python wheel. Replace <my-url>, <my-author-name>@<my-organization>, and <my-package-description> with meaningful values:

# Filename: setup.py
from setuptools import setup, find_packages

import dbxdemo

setup(
  name = "dbxdemo",
  version = dbxdemo.__version__,
  author = dbxdemo.__author__,
  url = "https://<my-url>",
  author_email = "<my-author-name>@<my-organization>",
  description = "<my-package-description>",
  packages = find_packages(include = ["dbxdemo"]),
  entry_points={"group_1": "run=dbxdemo.__main__:main"},
  install_requires = ["setuptools"]
)

Add the notebook

To have Jenkins call this Python wheel from a notebook, create a file named dbxdemo-notebook.py and add it to your your local Git repo’s Workspace folder, visualized as follows (ellipses indicate omitted folders in the repo, for brevity):

|-- ...
|-- Workspace
|     `-- dbxdemo-notebook.py
|-- ...

Add the following content to this notebook. This notebook creates a DataFrame object, passes it to the library’s with_status function, and prints the result:

# Databricks notebook source
from dbxdemo.addcol import with_status

df = (spark.createDataFrame(
  schema = ["first_name", "last_name", "email"],
  data = [
    ("paula", "white", "paula.white@example.com"),
    ("john", "baer", "john.baer@example.com")
  ]
))

new_df = with_status(df)

display(new_df)

# Output:
#
# +------------+-----------+-------------------------+---------+
# | first_name | last_name | email                   | status  |
# +============+===========+=========================+=========+
# | paula      | white     | paula.white@example.com | checked |
# +------------+-----------+-------------------------+---------+
# | john       | baer      | john.baer@example.com   | checked |
# +------------+-----------+-------------------------+---------+

Generate and store the deployment artifact

The Build Artifact builds the Python wheel as well as prepares to add the notebook code to be deployed to the workspace in a later stage, and prepares to add the result summaries for the tests. This stage uses git diff to flag all new files that have been included in the most recent Git merge. This is only an example method, so the implementation in your Pipeline might differ, but the objective is to add all files intended for the current release.

stage('Build Artifact') {
  sh """mkdir -p ${BUILDPATH}/Workspace
        mkdir -p ${BUILDPATH}/Libraries/python
        mkdir -p ${BUILDPATH}/Validation/Output
        #Get modified files
        git diff --name-only --diff-filter=AMR HEAD^1 HEAD | xargs -I '{}' cp -r '{}' ${BUILDPATH}

        # Get packaged libs
        find ${LIBRARYPATH} -name '*.whl' | xargs -I '{}' cp '{}' ${BUILDPATH}/Libraries/python/
      """
}

Deploy artifacts

The Deploy stage instructs Jenkins to copy the notebook over into the local deployment folder, uses the Databricks CLI tutorial to upload the notebook and library from the local deployment folder into the Databricks workspace, and then installs the deployed library onto the Databricks cluster:

stage('Deploy') {
  sh """#!/bin/bash
        # Copy notebooks into build
        cp -R ${NOTEBOOKPATH} ${BUILDPATH}

        # Use Databricks CLI to deploy notebooks
        ${DBCLIPATH}/databricks workspace import-dir ${BUILDPATH}/Workspace ${WORKSPACEPATH} --overwrite

        # Use Databricks CLI to deploy Python wheel
        ${DBCLIPATH}/databricks fs cp -r ${BUILDPATH}/Libraries/python ${DBFSPATH} --overwrite
      """
  sh """#!/bin/bash
        # Enable venv environment for commands
        source ${VENVPATH}/bin/activate

        # Get space-delimited list of libraries (omitted: | paste -sd " ")
        LIBS=\$(find ${BUILDPATH}/Libraries/python/ -name '*.whl' | sed 's#.*/##')

        echo \$LIBS

        # Script to uninstall, reboot if needed and install library
        python3 ${SCRIPTPATH}/installWhlLibrary.py \
                  --libs=\$LIBS \
                  --dbfspath=${DBFSPATH}
      """
}

Installing a new version of a library on a Databricks cluster requires that any previously installed version of the library uninstalled first. The following Python script uses the Databricks SDK for Python to do this, as follows:

  1. Check if the library is installed.

  2. Uninstall the library.

  3. Restart the cluster if any uninstalls were performed.

    1. Wait until the cluster is running again before proceeding.

  4. Install the library.

To have Jenkins run this script, create a file named installWhlLibrary.py and add it to your local Git repository’s Automation/Deployments folder, visualized as follows (ellipses indicate omitted folders, for brevity):

|-- ...
|-- Automation
|     `-- Deployments
|           `-- installWhlLibrary.py
|-- ...
# Filename: installWhlLibrary.py
#!/usr/bin/python3
import sys
import getopt
import time
import os

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.compute import Library

def main():
  libs = ''
  dbfspath = ''
  cluster_id = os.getenv('DATABRICKS_CLUSTER_ID')
  w = WorkspaceClient()

  try:
    opts, args = getopt.getopt(sys.argv[1:], 'hstcld',
                              ['libs=', 'dbfspath='])
  except getopt.GetoptError:
    print(
         'installWhlLibrary.py -l <libs> -d <dbfspath>')
    sys.exit(2)

  for opt, arg in opts:
    if opt == '-h':
      print(
           'installWhlLibrary.py -l <libs> -d <dbfspath>')
      sys.exit()
    elif opt in ('-l', '--libs'):
      libs=arg
    elif opt in ('-d', '--dbfspath'):
      dbfspath=arg

  print('-l is ' + libs)
  print('-d is ' + dbfspath)

  libslist = libs.split()

  # Uninstall the library if it already exists on the cluster.
  i = 0
  for lib in libslist:
    dbfslib = dbfspath + "/" + lib
    print("'" + dbfslib + "' before: " + getLibStatus(w, dbfslib, cluster_id))

    if (getLibStatus(w, dbfslib, cluster_id) != 'not found'):
      print("'" + lib + "' exists. Uninstalling. Please wait...")
      i = i + 1
      values = [Library(whl = dbfslib)]
      w.libraries.uninstall(cluster_id = cluster_id, libraries = values)
      print("'" + dbfslib + "' after: " + getLibStatus(w, dbfslib, cluster_id))

  # Restart the cluster if the libraries were uninstalled.
  if i > 0:
    print("Restarting cluster '" + cluster_id + "'. This could take several minutes. Please wait...")
    w.clusters.restart_and_wait(cluster_id = cluster_id)

    p = 0
    waiting = True
    while waiting:
      time.sleep(30)
      resp = w.clusters.get(cluster_id = cluster_id)
      print("Current cluster state: " + resp.state.name)
      if resp.state.name in ['RUNNING','INTERNAL_ERROR', 'SKIPPED'] or p >= 10:
        break
      p = p + 1

  # Install the libraries.
  for lib in libslist:
    dbfslib = dbfspath + "/" + lib
    print("Installing '" + dbfslib + "'. Please wait...")
    values = [Library(whl = dbfslib)]
    w.libraries.install(cluster_id = cluster_id, libraries = values)
    print("'" + dbfslib + "' after: " + getLibStatus(w, dbfslib, cluster_id))

def getLibStatus(workspace_client, dbfslib, cluster_id):
  resp = workspace_client.libraries.cluster_status(cluster_id = cluster_id)

  if resp.library_statuses:
    for status in resp.library_statuses:
      if (status.library.whl == dbfslib):
        return status.status.name
      else:
        return 'not found'
    else:
      return 'not found'

if __name__ == '__main__':
  main()

Test notebook code by using another notebook

Once the artifact has been deployed, it is important to run integration tests to ensure all the code is working together in the new environment. To do this you can run a notebook containing asserts to test the deployment. In this case you are using the same test you used in the unit test, but now it is importing the installed appendcol library from the Python wheel that was just installed on the cluster.

To automate this test and include it in your CI/CD Pipeline, instruct Jenkins to use the Databricks SDK for Python to run the notebook from the Jenkins server. This allows you to check whether the notebook run passed or failed using pytest. If the asserts in the notebook fail, this shows in the output returned by the SDK and subsequently in the JUnit test results.

stage('Run Integration Tests') {
  sh """#!/bin/bash
        # Enable venv environment for script
        source ${VENVPATH}/bin/activate

        python3 ${SCRIPTPATH}/executenotebook.py \
                --localpath=${NOTEBOOKPATH} \
                --workspacepath=${WORKSPACEPATH} \
                --outfilepath=${OUTFILEPATH}
      """
  sh """#!/bin/bash
        sed -i -e 's #ENV# ${OUTFILEPATH} g' ${SCRIPTPATH}/evaluatenotebookruns.py
        python3 -m pytest --junit-xml=${TESTRESULTPATH}/TEST-notebookout.xml ${SCRIPTPATH}/evaluatenotebookruns.py || true
      """
}

This stage calls two Python scripts. The first script runs the notebook as an anonymous job. Once the job has completed, the JSON output is saved to the path specified by the function arguments passed at invocation.

To have Jenkins run this script, create a file named executenotebook.py and add it to your local Git repository’s Automation/Deployments folder, visualized as follows (ellipses indicate omitted folders, for brevity):

|-- ...
|-- Automation
|     `-- Deployments
|           |-- executenotebook.py
|           `-- installWhlLibrary.py
|-- ...
# Filename: executenotebook.py
#!/usr/bin/python3
import json
import os
import sys
import getopt
import time

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.jobs import SubmitTask, NotebookTask

def main():
  cluster_id = os.getenv('DATABRICKS_CLUSTER_ID')
  localpath = ''
  workspacepath = ''
  outfilepath = ''
  w = WorkspaceClient()

  try:
    opts, args = getopt.getopt(sys.argv[1:], 'hs:t:c:lwo',
                              ['localpath=', 'workspacepath=', 'outfilepath='])
  except getopt.GetoptError:
    print(
         'executenotebook.py -l <localpath> -w <workspacepath> -o <outfilepath>)')
    sys.exit(2)

  for opt, arg in opts:
    if opt == '-h':
      print(
        'executenotebook.py -l <localpath> -w <workspacepath> -o <outfilepath>')
      sys.exit()
    elif opt in ('-l', '--localpath'):
      localpath = arg
    elif opt in ('-w', '--workspacepath'):
      workspacepath = arg
    elif opt in ('-o', '--outfilepath'):
      outfilepath = arg

  print('-l is ' + localpath)
  print('-w is ' + workspacepath)
  print('-o is ' + outfilepath)

  # Generate an array of notebook names and locations by traversing the local path.
  notebooks = []
  for path, subdirs, files in os.walk(localpath):
    for name in files:
      fullpath = path + '/' + name
      # Remove the local path to the repo but keep the workspace path.
      fullworkspacepath = workspacepath + path.replace(localpath, '')

      name, file_extension = os.path.splitext(fullpath)
      if file_extension.lower() in ['.scala', '.sql', '.r', '.py']:
        row = [fullpath, fullworkspacepath, 1]
        notebooks.append(row)

  # For each notebook found:
  for notebook in notebooks:
    nameonly = os.path.basename(notebook[0])
    workspacepath = notebook[1]

    name, file_extension = os.path.splitext(nameonly)

    # Remove the file extension.
    fullworkspacepath = workspacepath + '/' + name

    print("Running job for: '" + fullworkspacepath + "'. This could take several minutes. Please wait...")

    resp = w.jobs.submit_and_wait(
      run_name = name,
      tasks = [SubmitTask(
        task_key = name,
        existing_cluster_id = cluster_id,
        timeout_seconds = 3600,
        notebook_task = NotebookTask(
          notebook_path = fullworkspacepath,
          base_parameters = []
        )
      )]
    )

    i = 0
    waiting = True
    while waiting:
      time.sleep(10)
      print("Current job state: " + resp.state.life_cycle_state.name)
      if resp.state.life_cycle_state.name in ['TERMINATED', 'INTERNAL_ERROR', 'SKIPPED'] or i >= 12:
        break
      i = i + 1

    if outfilepath != '':
      file = open(outfilepath + '/' +  str(resp.run_id) + '.json', 'w')
      file.write(json.dumps(resp.as_dict()))
      file.close()

if __name__ == '__main__':
  main()

The second Python script defines the test_job_run function, which parses and evaluates the JSON to determine if the assert statements within the notebook passed or failed. An additional test, test_performance, catches tests that run longer than expected.

To have Jenkins run this script, create a file named evaluatenotebookruns.py and add it to your local Git repository’s Automation/Deployments folder, visualized as follows (ellipses indicate omitted folders, for brevity):

|-- ...
|-- Automation
|     `-- Deployments
|           |-- evaluatenotebookruns.py
|           |-- executenotebook.py
|           `-- installWhlLibrary.py
|-- ...
# Filename: evaluatenotebookruns.py
import unittest
import json
import glob
import os

class TestJobOutput(unittest.TestCase):

  test_output_path = '#ENV#'

  def test_performance(self):
    path = self.test_output_path
    statuses = []

    for filename in glob.glob(os.path.join(path, '*.json')):
      print('Evaluating: ' + filename)
      data = json.load(open(filename))
      duration = data['execution_duration']
      if duration > 100000:
        status = 'FAILED'
      else:
        status = 'SUCCESS'

      statuses.append(status)

    self.assertFalse('FAILED' in statuses)

  def test_job_run(self):
    path = self.test_output_path
    statuses = []

    for filename in glob.glob(os.path.join(path, '*.json')):
      print('Evaluating: ' + filename)
      data = json.load(open(filename))
      status = data['state']['result_state']
      statuses.append(status)

    self.assertFalse('FAILED' in statuses)

if __name__ == '__main__':
  unittest.main()

As seen earlier in the unit test stage, Jenkins uses pytest to run the tests and generate the result summaries.

Publish test results

The JSON results are archived and the test results are published to Jenkins by using the junit Jenkins plugin. This enables you to visualize reports and dashboards related to the status of the build process.

stage('Report Test Results') {
  sh """find ${OUTFILEPATH} -name '*.json' -exec gzip --verbose {} \\;
        touch ${TESTRESULTPATH}/TEST-*.xml
      """
  junit allowEmptyResults: true, testResults: '**/test-results/*.xml'
}
Jenkins test results

Push your code to version control

You should now push the contents of your local Git repository into your remote Git repository. To do this, you should first add the following entries to the .gitignore file, as you should probably not push local Python build files and Python caches into your remote. Typically you want to regenerate the lastest builds locally and then copy those locally-tested builds into your remote workspace:

.venv
*__pycache__/
*.pytest_cache
Builds/
*.egg-info
Libraries/python/dbxdemo/build/
Libraries/python/dbxdemo/dist/

Run your Jenkins Pipeline

You should now be ready to run your Jenkins Pipeline, if you have not done so already. To do this, from your Jenkins Dashboard:

  1. On the sidebar, click New Item.

  2. Enter a name for the pipeline, for example jenkins-databricks-demo.

  3. Click Pipeline and then click OK.

  4. In the Pipeline section, for Definition, select Pipeline script from SCM.

  5. For SCM, select Git.

  6. For Repositories > Repository URL, enter the URL to your remote Git repository, for example https://github.com/<your-username>/jenkins-databricks-demo.

  7. For Branches to build > Branch Specifer, enter */<release-branch-name>.

  8. Click Save.

  9. Make sure that your target cluster is running in your Databricks workspace.

  10. On the sidebar, click Build Now.

  11. To see the results, click the latest Pipeline run (for example, #1) and then click Console Output.

At this point, the CI/CD pipeline has completed an integration and deployment cycle. By automating this process, you can ensure that your code has been tested and deployed by an efficient, consistent, and repeatable process.