Databricks SDK for Python
In this article, you learn how to automate operations in Databricks accounts, workspaces, and related resources with the Databricks SDK for Python. This article supplements the Databricks SDK for Python documentation.
Note
This feature is in Beta and is okay to use in production.
Before you begin
You can use the Databricks SDK for Python from within a Databricks notebook or from your local development machine.
To use the Databricks SDK for Python from within a Databricks notebook, skip ahead to Use the Databricks SDK for Python from within a Databricks notebook.
To use the Databricks SDK for Python from your local development machine, complete the steps in this section.
Before you begin to use the Databricks SDK for Python, your development machine must have:
Databricks authentication configured.
Python 3.8 or higher installed. (Python 3.7 is also supported but only through June 2023.) For automating Databricks compute resources, recommends that you have the major and minor versions of Python installed that match the one that is installed on your target Databricks compute resource. This article’s examples rely on automating clusters with Databricks Runtime 13.0, which has Python 3.10 installed. For the correct version, see Databricks Runtime release notes versions and compatibility for your cluster’s Databricks Runtime version.
Databricks recommends that you create and activate a Python virtual environment for each Python code project that you use with the Databricks SDK for Python. Python virtual environments help to make sure that your code project is using compatible versions of Python and Python packages (in this case, the Databricks SDK for Python package). This article uses venv for Python virtual environments. To create a Python virtual environment with
venv
:From your terminal set to the root directory of your Python code project, instruct
venv
to use Python 3.10 for the virtual environment, and then create the virtual environment’s supporting files in a hidden directory named.venv
within the root directory of your Python code project, by running the following command:# Linux and macOS python3.10 -m venv ./.venv # Windows python3.10 -m venv .\.venv
Use
venv
to activate the virtual environment. See the venv documentation for the correct command to use, based on your operating system and terminal type. For example, on macOS runningzsh
:source ./.venv/bin/activate
You will know that your virtual environment is activated when the virtual environment’s name (for example,
.venv
) displays in parentheses just before your terminal prompt.To deactivate the virtual environment at any time, run the command
deactivate
.You will know that your virtual environment is deactivated when the virtual environment’s name no longer displays in parentheses just before your terminal prompt.
Get started with the Databricks SDK for Python
This section describes how to get started with the Databricks SDK for Python from your local development machine. To use the Databricks SDK for Python from within a Databricks notebook, skip ahead to Use the Databricks SDK for Python from within a Databricks notebook.
On your development machine with Databricks authentication configured, Python already installed, and your Python virtual environment already activated, use
pip
to install the databricks-sdk package from the Python Package Index (PyPI), as follows:pip3 install databricks-sdk
To install a specific version of the
databricks-sdk
package (especially while the Databricks SDK for Python is in an Experimental state), see the package’s Release history. For example, to install version0.1.6
:pip3 install databricks-sdk>=0.1.6
Tip
To upgrade an existing installation of the Databricks SDK for Python package to the latest version, run
pip
as follows:pip3 install --upgrade databricks-sdk
To show the Databricks SDK for Python package’s current
Version
and other details, runpip
as follows:pip3 show databricks-sdk
In your Python virtual environment, create a Python code file that imports the Databricks SDK for Python. The following example, in a file named
main.py
with the following contents, simply lists all the clusters in your Databricks workspace:from databricks.sdk import WorkspaceClient w = WorkspaceClient() for c in w.clusters.list(): print(c.cluster_name)
Run your Python code file, assuming a file named
main.py
, by running thepython
command:python3.10 main.py
Note
By not setting any arguments in the preceding call to
w = WorkspaceClient()
, the Databricks SDK for Python uses its default process for trying to perform Databricks authentication. To override this default behavior, see the following authentication section.
Authenticate the Databricks SDK for Python with your Databricks account or workspace
This section describes how to authenticate the Databricks SDK for Python from your local development machine over to your Databricks account or workspace. To authenticate the Databricks SDK for Python from within a Databricks notebook, skip ahead to Use the Databricks SDK for Python from within a Databricks notebook.
The Databricks SDK for Python implements the Databricks client unified authentication standard, a consolidated and consistent architectural and programmatic approach to authentication. This approach helps make setting up and automating authentication with Databricks more centralized and predictable. It enables you to configure Databricks authentication once and then use that configuration across multiple Databricks tools and SDKs without further authentication configuration changes. For more information, including more complete code examples in Python, see Databricks client unified authentication.
Some of the available coding patterns to initialize Databricks authentication with the Databricks SDK for Python include:
Use Databricks default authentication by doing one of the following:
Create or identify a custom Databricks configuration profile with the required fields for the target Databricks authentication type. Then set the
DATABRICKS_CONFIG_PROFILE
environment variable to the name of the custom configuration profile.Set the required environment variables for the target Databricks authentication type.
from databricks.sdk import WorkspaceClient w = WorkspaceClient() # ...
Hard-coding the required fields is supported but not recommended, as it risks exposing sensitive information in your code, such as Databricks personal access tokens. The following example hard-codes Databricks host and access token values for Databricks token authentication:
from databricks.sdk import WorkspaceClient w = WorkspaceClient( host = 'https://...', token = '...' ) # ...
Use the Databricks SDK for Python from within a Databricks notebook
Step 1: Install the Databricks SDK for Python
Databricks Python notebooks can use the Databricks SDK for Python just like any other Python library. For example, to make the Databricks SDK for Python available to your notebook, you can run the %pip
magic command from a notebook cell as follows:
%pip install databricks-sdk --upgrade
Note
If an error message appears stating that the package could not be found, remove --upgrade
and run the cell again.
After you run the %pip
magic command, restart Python. To do this, run the following command from a notebook cell immediately after the cell with the %pip
magic command:
dbutils.library.restartPython()
Step 2: Set up authentication
By default, the Databricks SDK for Python uses default Databricks notebook authentication. There are no special requirements or code to use Databricks SDK for Python with default Databricks notebook authentication. If you want to use default Databricks notebook authentication, skip ahead to Step 3.
By default, Databricks notebook authentication relies on a Databricks personal access token that Databricks generates on your behalf in the background when the notebook runs and deletes when the notebook stops running. This Databricks personal access token is associated with the signed-in Databricks user account, which means that the Databricks SDK for Python has only whatever access permissions that the signed-in user account has.
You can use Databricks authentication types other than Databricks personal access token authentication if needed, although this requires special setup and coding. For more information, including more complete code examples in Python, see Databricks client unified authentication.
Note
If you choose to set up non-default Databricks notebook authentication, your notebook will not have access to Databricks configuration profiles through a .databrickscfg
file. Instead, Databricks recommends that you use one or more of the following approaches to set up non-default Databricks authentication from your notebook:
In your Databricks cluster, set the required environment variables for your target authentication type. For the names of the specific environment variables, see the documentation for your target authentication type in Databricks client unified authentication. To set environment variables on your cluster, see Environment variables.
Use direct configuration to retrieve the required authentication settings for your authentication type from Databricks widgets in your notebook. You must manually enter the required authentication settings into these widgets before you run your notebook. For the names of the specific settings, see the Python code examples for your target authentication type in Databricks client unified authentication, being sure to use
dbutils.widgets.get(...)
calls instead of those examples’retrieve...()
calls. To learn how to work with widgets programmatically, see Databricks widgets.Use direct configuration to retrieve the required authentication settings for your authentication type from a configuration store such as AWS Systems Manager Parameter Store. For the names of the specific settings, see the Python code examples for your target authentication type in Databricks client unified authentication.
Step 3: Run your code
In your notebook cells, create Python code that imports and then calls the Databricks SDK for Python. The following example simply lists all the clusters in your Databricks workspace:
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
for c in w.clusters.list():
print(c.cluster_name)
When you run this cell, a list of the names of all of the available clusters in your Databricks workspace appears.
Use Databricks Utilities
You can call Databricks Utilities from Databricks SDK for Python code running on your local development machine or from within a Databricks notebook.
From your local development machine, Databricks Utilities has access only to the
dbutils.fs
,dbutils.secrets
, anddbutils.widgets
command groups.From a Databricks notebook that is attached to a Databricks cluster, Databricks Utilities has access to all of the available Databricks Utilities command groups, not just
dbutils.fs
,dbutils.secrets
, anddbutils.widgets
. Additionally, thedbutils.notebook
command group is limited to two levels of commands only, for exampledbutils.notebook.run
ordbutils.notebook.exit
.
To call Databricks Utilities from either your local development machine or a Databricks notebook, use dbutils
within WorkspaceClient
. This code example calls dbutils
within WorkspaceClient
to list the paths of all of the objects in the DBFS root of the workspace.
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
d = w.dbutils.fs.ls('/')
for f in d:
print(f.path)
Alternatively, you can call dbutils
directly. However, you are limited to using default Databricks authentication only. This code example calls dbutils
directly to list all of the objects in the DBFS root of the workspace.
from databricks.sdk.runtime import *
d = dbutils.fs.ls('/')
for f in d:
print(f.path)
See also Interaction with dbutils.
Code examples
The following code examples demonstrate how to use the Databricks SDK for Python to create and delete clusters, run jobs, and list account-level groups. These code examples use the Databricks SDK for Python’s default Databricks authentication process. For details about default notebook authentication, see Use the Databricks SDK for Python from within a Databricks notebook. For details about default authentication outside of notebooks, see Authenticate the Databricks SDK for Python with your Databricks account or workspace.
For additional code examples, see the examples folder in the Databricks SDK for Python repository in GitHub. See also the Databricks Workspace APIs reference and the Databricks Account APIs reference.
Create a cluster
This code example creates a cluster with the specified Databricks Runtime version and cluster node type. This cluster has one worker, and the cluster will automatically terminate after 15 minutes of idle time.
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
print("Attempting to create cluster. Please wait...")
c = w.clusters.create_and_wait(
cluster_name = 'my-cluster',
spark_version = '12.2.x-scala2.12',
node_type_id = 'i3.xlarge',
autotermination_minutes = 15,
num_workers = 1
)
print(f"The cluster is now ready at " \
f"{w.config.host}#setting/clusters/{c.cluster_id}/configuration\n")
Permanently delete a cluster
This code example permanently deletes the cluster with the specified cluster ID from the workspace.
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
c_id = input('ID of cluster to delete (for example, 1234-567890-ab123cd4): ')
w.clusters.permanent_delete(cluster_id = c_id)
Create a job
This code example creates a Databricks job that runs the specified notebook on the specified cluster. As the code runs, it gets the existing notebook’s path, the existing cluster ID, and related job settings from the user at the terminal.
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.jobs import Task, NotebookTask, Source
w = WorkspaceClient()
job_name = input("Some short name for the job (for example, my-job): ")
description = input("Some short description for the job (for example, My job): ")
existing_cluster_id = input("ID of the existing cluster in the workspace to run the job on (for example, 1234-567890-ab123cd4): ")
notebook_path = input("Workspace path of the notebook to run (for example, /Users/someone@example.com/my-notebook): ")
task_key = input("Some key to apply to the job's tasks (for example, my-key): ")
print("Attempting to create the job. Please wait...\n")
j = w.jobs.create(
job_name = job_name,
tasks = [
Task(
description = description,
existing_cluster_id = existing_cluster_id,
notebook_task = NotebookTask(
base_parameters = dict(""),
notebook_path = notebook_path,
source = Source("WORKSPACE")
),
task_key = task_key
)
]
)
print(f"View the job at {w.config.host}/#job/{j.job_id}\n")
List account-level groups
This code example lists the display names for all of the available groups within the Databricks account.
from databricks.sdk import AccountClient
a = AccountClient()
for g in a.groups.list():
print(g.display_name)