Software engineering best practices for notebooks
This article provides a hands-on walkthrough that demonstrates how to apply software engineering best practices to your Databricks notebooks, including version control, code sharing, testing, and optionally continuous integration and continuous delivery or deployment (CI/CD).
In this walkthrough, you will:
Add notebooks to Databricks Repos for version control.
Extract portions of code from one of the notebooks into a shareable module.
Test the shared code.
Run the notebooks from a Databricks job.
Optionally apply CI/CD to the shared code.
Requirements
To complete this walkthrough, you must provide the following resources:
A remote repository with a Git provider that Databricks supports. This article’s walkthrough uses GitHub. This walkthrough assumes that you have a GitHub repository named
best-notebooks
available. (You can give your repository a different name. If you do, replacebest-notebooks
with your repo’s name throughout this walkthrough.) Create a GitHub repo if you do not already have one.Note
If you create a new repo, be sure to initialize the repository with at least one file, for example a
README
file.A Databricks workspace. Create a workspace if you do not already have one.
A Databricks all-purpose cluster in the workspace. To run notebooks during the design phase, you attach the notebooks to a running all-purpose cluster. Later on, this walkthrough uses a Databricks job to automate running the notebooks on this cluster. (You can also run jobs on job clusters that exist only for the jobs’ lifetimes.) Create an all-purpose cluster if you do not already have one.
Step 1: Set up Databricks Repos
In this step, you connect your existing GitHub repo to Databricks Repos in your existing Databricks workspace.
To enable your workspace to connect to your GitHub repo, you must first provide your workspace with your GitHub credentials, if you have not done so already.
Step 1.1: Provide your GitHub credentials
Click your username at the top right of the workspace, and then click User Settings in the dropdown list.
On the User Settings page, click Linked accounts.
Under Git integration, for Git provider, select GitHub.
Click Personal access token.
For Git provider username or email, enter your GitHub username.
For Token, enter your GitHub personal access token (classic). This personal access token (classic) must have the repo and workflow permissions.
Click Save.
Step 1.2: Connect to your GitHub repo
On the sidebar in the Data Science & Engineering or Databricks Machine Learning environment, click Repos.
In the Repos pane, click Add Repo.
In the Add Repo dialog:
Click Clone remote Git repo.
For Git repository URL, enter the GitHub Clone with HTTPS URL for your GitHub repo. This article assumes that your URL ends with
best-notebooks.git
, for examplehttps://github.com/<your-GitHub-username>/best-notebooks.git
.In the drop-down list next to Git repository URL, select GitHub.
Leave Repo name set to the name of your repo, for example
best-notebooks
.Click Create.
Step 2: Import and run the notebook
In this step, you import an existing external notebook into your repo. You could create your own notebooks for this walkthrough, but to speed things up we provide them for you here.
Step 2.1: Create a working branch in the repo
In this substep, you create a branch named eda
in your repo. This branch enables you to work on files and code independently from your repo’s main
branch, which is a software engineering best practice. (You can give your branch a different name.)
Note
In some repos, the main
branch may be named master
instead. If so, replace main
with master
throughout this walkthrough.
Tip
If you’re not familiar with working in Git branches, see Git Branches - Branches in a Nutshell on the Git website.
If the Repos pane is not showing, then on the sidebar in the Data Science & Engineering or Databricks Machine Learning environment, click Repos.
If the repo that you connected to in the previous step is not showing in the Repos pane, then select your workspace username, and select the name of the repo that you connected to in the previous step.
Click the drop-down arrow next to your repo’s name, and then click Git.
In the best-notebooks dialog, click the + (Create branch) button.
Note
If your repo has a name other than
best-notebooks
, this dialog’s title will be different, here and throughout this walkthrough.Enter
eda
and then press Enter.Close this dialog.
Step 2.2: Import the notebook into the repo
In this substep, you import an existing notebook from another repo into your repo. This notebook does the following:
Copies a CSV file from the owid/covid-19-data GitHub repository onto a cluster in your workspace. This CSV file contains public data about COVID-19 hospitalizations and intensive care metrics from around the world.
Filters the data to contain metrics from only the United States.
Displays a plot of the data.
Saves the pandas DataFrame as a Pandas API on Spark DataFrame.
Performs data cleansing on the Pandas API on Spark DataFrame.
Writes the Pandas API on Spark DataFrame as a Delta table in your workspace.
Displays the Delta table’s contents.
While you could create your own notebook in your repo here, importing an existing notebook here instead helps to speed up this walkthrough. To create a notebook in this branch or move an existing notebook into this branch instead of importing a notebook, see Workspace files basic usage.
In the Repos pane for your repo, click the drop-down arrow next to your repo’s name, and then click Create > Folder.
In the New Folder Name dialog, enter
notebooks
, and then click Create Folder.In the Repos pane, click the name of your repo, click the drop-down arrow next to the notebooks folder, and then click Import.
In the Import Notebooks dialog:
For Import from, select URL.
Enter the URL to the raw contents of the
covid_eda_raw
notebook in thedatabricks/notebook-best-practices
repo in GitHub. To get this URL:Go to https://github.com/databricks/notebook-best-practices.
Click the
notebooks
folder.Click the
covid_eda_raw.py
file.Click Raw.
Copy the full URL from your web browser’s address bar over into the Import Notebooks dialog.
Note
The Import Notebooks dialog works with Git URLs for public repositories only.
Click Import.
Step 2.3: Run the notebook
If the notebook is not already showing, in the Repos pane for your repo, double-click the covid_eda_raw notebook inside of the notebooks folder to open it.
In the notebook, in the drop-down list next to File, select the cluster to attach this notebook to. For instructions on creating a cluster, see Create a cluster.
Click Run All.
If prompted, click Attach & Run or Start, Attach & Run.
Wait while the notebook runs.
After the notebook finishes running, in the notebook you should see a plot of the data as well as over 600 rows of raw data in the Delta table. If the cluster was not already running when you started running this notebook, it could take several minutes for the cluster to start up before displaying the results.
Step 2.4: Check in and merge the notebook
In this substep, you save your work so far to your GitHub repo. You then merge the notebook from your working branch into your repo’s main
branch.
In the Repos pane for your repo, click the eda branch.
In the best-notebooks dialog, on the Changes tab, make sure the notebooks/covid_eda_raw.py file is selected.
For Summary (required), enter
Added raw notebook
.For Description (optional), enter
This is the first version of the notebook.
Click Commit & Push.
Click History, or click Create a pull request on your git provider link in the popup.
In GitHub, click the Pull requests tab, create the pull request, and then merge the pull request into the
main
branch.Back in your Databricks workspace, close the best-notebooks dialog if it is still showing.
Step 5: Create a job to run the notebooks
In previous steps, you tested your shared code manually and ran your notebooks manually. In this step, you use a Databricks job to test your shared code and run your notebooks automatically, either on-demand or on a regular schedule.
Step 5.1: Create a job task to run the testing notebook
On the sidebar in the Data Science & Engineering or Databricks Machine Learning environment, click Workflows.
On the Jobs tab, click Create Job.
For Add a name for your job (which is next to the Runs and Tasks tabs), enter
covid_report
.For Task name, enter
run_notebook_tests
.For Type, select Notebook.
For Source, select Git.
Click Add a git reference.
In the Git information dialog:
For Git repository URL, enter the GitHub Clone with HTTPS URL for your GitHub repo. This article assumes that your URL ends with
best-notebooks.git
, for examplehttps://github.com/<your-GitHub-username>/best-notebooks.git
.For Git provider, select GitHub.
For Git reference (branch / tag / commit), enter
main
.Next to Git reference (branch / tag / commit), select branch.
Click Confirm.
For Path, enter
notebooks/run_unit_tests
. Do not add the.py
file extension.For Cluster, select the cluster from the previous step.
Click Create.
Note
In this scenario, Databricks does not recommend that you use the schedule button in the notebook as described in Create and manage scheduled notebook jobs to schedule a job to run this notebook periodically. This is because the schedule button creates a job by using the latest working copy of the notebook in the workspace repo. Instead, Databricks recommends that you follow the preceding instructions to create a job that uses the latest committed version of the notebook in the repo.
Step 5.2: Create a job task to run the main notebook
Click the + Add task icon.
A pop-up menu appears. Select Notebook.
For Task name, enter
run_main_notebook
.For Type, select Notebook.
For Path, enter
notebooks/covid_eda_modular
. Do not add the.py
file extension.For Cluster, select the cluster from the previous step.
Click Create task.
Step 5.3 Run the job
Click Run now.
In the pop-up, click View run.
Note
If the pop-up disappears too quickly, then do the following:
On the sidebar in the Data Science & Engineering or Databricks Machine Learning environment, click Workflows.
On the Job runs tab, click the Start time value for the latest job with covid_report in the Jobs column.
To see the job results, click on the run_notebook_tests tile, the run_main_notebook tile, or both. The results on each tile are the same as if you ran the notebooks yourself, one by one.
Note
This job ran on-demand. To set up this job to run on a regular basis, see Add a job schedule.
(Optional) Step 6: Set up the repo to test the code and run the notebook automatically whenever the code changes
In the previous step, you used a job to automatically test your shared code and run your notebooks at a point in time or on a recurring basis. However, you may prefer to trigger tests automatically when changes are merged into your GitHub repo. You can perform this automation by using a CI/CD platform such as GitHub Actions.
Step 6.1: Set up GitHub access to your workspace
In this substep, you set up a GitHub Actions workflow that run jobs in the workspace whenever changes are merged into your repository. You do this by giving GitHub a unique Databricks token for access.
For security reasons, Databricks discourages you from giving your Databricks workspace user’s personal access token to GitHub. Instead, Databricks recommends that you give GitHub a Databricks access token that is associated with a Databricks service principal. For instructions, see the AWS section of the Run Databricks Notebook GitHub Action page in the GitHub Actions Marketplace.
Important
Notebooks are run with all of the workspace permissions of the identity that is associated with the token, so Databricks recommends using a service principal. If you really want to give your Databricks workspace user’s personal access token to GitHub for personal exploration purposes only, and you understand that for security reasons Databricks discourages this practice, see the instructions to create your workspace user’s personal access token.
Step 6.2: Add the GitHub Actions workflow
In this substep, you add a GitHub Actions workflow to run the run_unit_tests
notebook whenever there is a pull request to the repo.
This substep stores the GitHub Actions workflow in a file that is stored within multiple folder levels in your GitHub repo. GitHub Actions requires a specific nested folder hierarchy to exist in your repo in order to work properly. To complete this step, you must use the website for your GitHub repo, because the Databricks Repos user interface does not support creating nested folder hierarchies.
In the website for your GitHub repo, click the Code tab.
In the Switch branches or tags drop-down list, select main, if it is not already selected.
If the Switch branches or tags drop-down list does not show the Find or create a branch box, click main again.
In the Find or create a branch box, enter
adding_github_actions
.Click Create branch: adding_github_actions from ‘main’.
Click Add file > Create new file.
For Name your file, enter
.github/workflows/databricks_pull_request_tests.yml
.In the editor window, enter the following code. This code uses declares the pull_request hook to use the Run Databricks Notebook GitHub Action to run the
run_unit_tests
notebook.In the following code, replace:
<your-workspace-instance-URL>
with your Databricks instance name.<your-access-token>
with the token that you generated earlier.<your-cluster-id>
with your target cluster ID.
name: Run pre-merge Databricks tests on: pull_request: env: # Replace this value with your workspace instance name. DATABRICKS_HOST: https://<your-workspace-instance-name> jobs: unit-test-notebook: runs-on: ubuntu-latest timeout-minutes: 15 steps: - name: Checkout repo uses: actions/checkout@v2 - name: Run test notebook uses: databricks/run-notebook@main with: databricks-token: <your-access-token> local-notebook-path: notebooks/run_unit_tests.py existing-cluster-id: <your-cluster-id> git-commit: "${{ github.event.pull_request.head.sha }}" # Grant all users view permission on the notebook's results, so that they can # see the result of the notebook, if they have related access permissions. access-control-list-json: > [ { "group_name": "users", "permission_level": "CAN_VIEW" } ] run-name: "EDA transforms helper module unit tests"
Select Commit directly to the adding_github_actions branch.
Click Commit changes.
On the Code tab, click Compare & pull request, and then create the pull request.
On the pull request page, wait for the icon next to Run pre-merge Databricks tests / unit-test-notebook (pull_request) to display a green check mark. (It may take a few moments for the icon to appear.) If there is a red X instead of a green check mark, click Details to find out why. If the icon or Details are no longer showing, click Show all checks.
If the green check mark appears, merge the pull request into the
main
branch.