Databricks SDK for Python

In this article, you learn how to automate operations in Databricks accounts, workspaces, and related resources with the Databricks SDK for Python. This article supplements the Databricks SDK for Python documentation on Read The Docs and the code examples in the Databricks SDK for Python repository in GitHub.

Note

This feature is in Beta and is okay to use in production.

During the Beta period, Databricks recommends that you pin a dependency on the specific minor version of the Databricks SDK for Python that your code depends on. For example, you can pin dependencies in files such as requirements.txt for venv, or pyproject.toml and poetry.lock for Poetry. For more information about pinning dependencies, see Virtual Environments and Packages for venv, or Installing dependencies for Poetry.

Before you begin

You can use the Databricks SDK for Python from within a Databricks notebook or from your local development machine.

Before you begin to use the Databricks SDK for Python, your development machine must have:

  • Databricks authentication configured.

  • Python 3.8 or higher installed. For automating Databricks compute resources, Databricks recommends that you have the major and minor versions of Python installed that match the one that is installed on your target Databricks compute resource. This article’s examples rely on automating clusters with Databricks Runtime 13.0, which has Python 3.10 installed. For the correct version, see Databricks Runtime release notes versions and compatibility for your cluster’s Databricks Runtime version.

  • Databricks recommends that you create and activate a Python virtual environment for each Python code project that you use with the Databricks SDK for Python. Python virtual environments help to make sure that your code project is using compatible versions of Python and Python packages (in this case, the Databricks SDK for Python package). This article explains how to use venv or Potetry for Python virtual environments.

Create a Python virtual environment with venv

  1. From your terminal set to the root directory of your Python code project, run the following command. This command instructs venv to use Python 3.10 for the virtual environment, and then creates the virtual environment’s supporting files in a hidden directory named .venv within the root directory of your Python code project.

    # Linux and macOS
    python3.10 -m venv ./.venv
    
    # Windows
    python3.10 -m venv .\.venv
    
  2. Use venv to activate the virtual environment. See the venv documentation for the correct command to use, based on your operating system and terminal type. For example, on macOS running zsh:

    source ./.venv/bin/activate
    

    You will know that your virtual environment is activated when the virtual environment’s name (for example, .venv) displays in parentheses just before your terminal prompt.

    To deactivate the virtual environment at any time, run the command deactivate.

    You will know that your virtual environment is deactivated when the virtual environment’s name no longer displays in parentheses just before your terminal prompt.

Skip ahead to Get started with the Databricks SDK for Python.

Create a virtual environment with Poetry

  1. Install Poetry, if you have not done so already.

  2. From your terminal set to the root directory of your Python code project, run the following command to instruct poetry to initialize your Python code project for Poetry.

    poetry init
    
  3. Poetry displays several prompts for you to complete. None of these prompts are specific to the Databricks SDK for Python. For information about these prompts, see init.

  4. After you complete the prompts, Poetry adds a pyproject.toml file to your Python project. For information about the pyproject.toml file, see The pyproject.toml file.

  5. With your terminal still set to the root directory of your Python code project, run the following command. This command instructs poetry to read the pyproject.toml file, install and resolve dependencies, create a poetry.lock file to lock the dependencies, and finally create a virtual environment.

    poetry install
    
  6. From your terminal set to the root directory of your Python code project, run the following command to instruct poetry to activate the virtual environment and enter the shell.

    poetry shell
    

    You will know that your virtual environment is activated and the shell is entered when the virtual environment’s name displays in parentheses just before your terminal prompt.

    To deactivate the virtual environment and exit the shell at any time, run the command exit.

    You will know that you have exited the shell when the virtual environment’s name no longer displays in parentheses just before your terminal prompt.

    For more information about creating and managing Poetry virtual environments, see Managing environments.

Get started with the Databricks SDK for Python

This section describes how to get started with the Databricks SDK for Python from your local development machine. To use the Databricks SDK for Python from within a Databricks notebook, skip ahead to Use the Databricks SDK for Python from a Databricks notebook.

  1. On your development machine with Databricks authentication configured, Python already installed, and your Python virtual environment already activated, install the databricks-sdk package (and its dependencies) from the Python Package Index (PyPI), as follows:

    Use pip to install the databricks-sdk package. (On some systems, you might need to replace pip3 with pip, here and throughout.)

    pip3 install databricks-sdk
    
    poetry add databricks-sdk
    

    To install a specific version of the databricks-sdk package while the Databricks SDK for Python is in Beta, see the package’s Release history. For example, to install version 0.1.6:

    pip3 install databricks-sdk==0.1.6
    
    poetry add databricks-sdk==0.1.6
    

    Tip

    To upgrade an existing installation of the Databricks SDK for Python package to the latest version, run the following command:

    pip3 install --upgrade databricks-sdk
    
    poetry add databricks-sdk@latest
    

    To show the Databricks SDK for Python package’s current Version and other details, run the following command:

    pip3 show databricks-sdk
    
    poetry show databricks-sdk
    
  2. In your Python virtual environment, create a Python code file that imports the Databricks SDK for Python. The following example, in a file named main.py with the following contents, simply lists all the clusters in your Databricks workspace:

    from databricks.sdk import WorkspaceClient
    
    w = WorkspaceClient()
    
    for c in w.clusters.list():
      print(c.cluster_name)
    
  3. Run your Python code file, assuming a file named main.py, by running the python command:

    python3.10 main.py
    

    If you are in the virtual environment’s shell:

    python3.10 main.py
    

    If you are not in the virtual environment’s shell:

    poetry run python3.10 main.py
    

    Note

    By not setting any arguments in the preceding call to w = WorkspaceClient(), the Databricks SDK for Python uses its default process for trying to perform Databricks authentication. To override this default behavior, see the following authentication section.

Authenticate the Databricks SDK for Python with your Databricks account or workspace

This section describes how to authenticate the Databricks SDK for Python from your local development machine over to your Databricks account or workspace. To authenticate the Databricks SDK for Python from within a Databricks notebook, skip ahead to Use the Databricks SDK for Python from a Databricks notebook.

The Databricks SDK for Python implements the Databricks client unified authentication standard, a consolidated and consistent architectural and programmatic approach to authentication. This approach helps make setting up and automating authentication with Databricks more centralized and predictable. It enables you to configure Databricks authentication once and then use that configuration across multiple Databricks tools and SDKs without further authentication configuration changes. For more information, including more complete code examples in Python, see Databricks client unified authentication.

Some of the available coding patterns to initialize Databricks authentication with the Databricks SDK for Python include:

  • Use Databricks default authentication by doing one of the following:

    • Create or identify a custom Databricks configuration profile with the required fields for the target Databricks authentication type. Then set the DATABRICKS_CONFIG_PROFILE environment variable to the name of the custom configuration profile.

    • Set the required environment variables for the target Databricks authentication type.

    Then instantiate for example a WorkspaceClient object with Databricks default authentication as follows:

    from databricks.sdk import WorkspaceClient
    
    w = WorkspaceClient()
    # ...
    
  • Hard-coding the required fields is supported but not recommended, as it risks exposing sensitive information in your code, such as Databricks personal access tokens. The following example hard-codes Databricks host and access token values for Databricks token authentication:

    from databricks.sdk import WorkspaceClient
    
    w = WorkspaceClient(
      host  = 'https://...',
      token = '...'
    )
    # ...
    

See also Authentication in the Databricks SDK for Python documentation.

Use the Databricks SDK for Python from a Databricks notebook

You can call Databricks SDK for Python functionality from a Databricks notebook after you install the Databricks SDK for Python on the Databricks cluster that is attached to the notebook.

The Databricks SDK for Python uses default Databricks notebook authentication for the following cluster types:

  • Unity Catalog clusters with Shared access mode that are running Databricks Runtime 14.1 or above.

  • All other cluster types that are running Databricks Runtime 13.2 or above.

  • All other cluster types that are running Databricks Runtime 13.1 or below and that also have the databricks-sdk package version 0.1.6 or above installed on the cluster.

Default Databricks notebook authentication relies on a temporary Databricks personal access token that Databricks automatically generates in the background for its own use. Databricks deletes this temporary token after the notebook stops running.

Note

You must manually configure Databricks notebook authentication if either of the following conditions are true:

  • Your notebook is running on a cluster type that does not support default Databricks notebook authentication.

  • You want to call Databricks account-level operations with the Databricks SDK for Python from the notebook.

To manually configure Databricks authentication in notebooks, see Supported Databricks authentication types. See also Authentication in the Databricks SDK for Python documentation.

Databricks notebook authentication does not work with Databricks configuration profiles.

Step 1: Install the Databricks SDK for Python

Databricks Python notebooks can use the Databricks SDK for Python just like any other Python library. For example, to make the Databricks SDK for Python available to your notebook, you can run the %pip magic command from a notebook cell as follows:

%pip install databricks-sdk --upgrade

Note

The databricks-sdk package comes preinstalled on Databricks Runtime 13.2 and above. However, older Databricks Runtime versions have older versions of the databricks-sdk installed. Databricks recommends that you run the preceding %pip magic to upgrade the databricks-sdk package to the latest version.

If you are using Databricks Runtime 13.1 or below, an error message appears stating that the package could not be found. Remove --upgrade and run the preceding %pip magic again.

After you run the %pip magic command, restart Python. To do this, run the following command from a notebook cell immediately after the cell with the %pip magic command:

dbutils.library.restartPython()

Step 2: Run your code

In your notebook cells, create Python code that imports and then calls the Databricks SDK for Python. The following example simply lists all the clusters in your Databricks workspace:

from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

for c in w.clusters.list():
  print(c.cluster_name)

When you run this cell, a list of the names of all of the available clusters in your Databricks workspace appears.

Use Databricks Utilities

You can call Databricks Utilities (dbutils) reference from Databricks SDK for Python code running on your local development machine or from within a Databricks notebook.

  • From your local development machine, Databricks Utilities has access only to the dbutils.fs, dbutils.secrets, dbutils.widgets, and dbutils.jobs command groups.

  • From a Databricks notebook that is attached to a Databricks cluster, Databricks Utilities has access to all of the available Databricks Utilities command groups, not just dbutils.fs, dbutils.secrets, and dbutils.widgets. Additionally, the dbutils.notebook command group is limited to two levels of commands only, for example dbutils.notebook.run or dbutils.notebook.exit.

To call Databricks Utilities from either your local development machine or a Databricks notebook, use dbutils within WorkspaceClient. This code example calls dbutils within WorkspaceClient to list the paths of all of the objects in the DBFS root of the workspace.

from databricks.sdk import WorkspaceClient

w = WorkspaceClient()
d = w.dbutils.fs.ls('/')

for f in d:
  print(f.path)

Alternatively, you can call dbutils directly. However, you are limited to using default Databricks authentication only. This code example calls dbutils directly to list all of the objects in the DBFS root of the workspace.

from databricks.sdk.runtime import *

d = dbutils.fs.ls('/')

for f in d:
  print(f.path)

You cannot use dbutils by itself or within WorkspaceClient to access Unity Catalog volumes. Instead, use files within WorkspaceClient. This code example calls files within WorkspaceClient to print the contents of the specified file in the specified volume.

from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

resp = w.files.download('/Volumes/main/default/my-volume/sales.csv')
print(str(resp.contents.read(), encoding='utf-8'))

See also Interaction with dbutils.

Code examples

The following code examples demonstrate how to use the Databricks SDK for Python to create and delete clusters, run jobs, and list account-level groups. These code examples use the Databricks SDK for Python’s default Databricks authentication process. For details about default notebook authentication, see Use the Databricks SDK for Python from a Databricks notebook. For details about default authentication outside of notebooks, see Authenticate the Databricks SDK for Python with your Databricks account or workspace.

For additional code examples, see the examples in the Databricks SDK for Python repository in GitHub. See also:

Create a cluster

This code example creates a cluster with the specified Databricks Runtime version and cluster node type. This cluster has one worker, and the cluster will automatically terminate after 15 minutes of idle time.

from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

print("Attempting to create cluster. Please wait...")

c = w.clusters.create_and_wait(
  cluster_name             = 'my-cluster',
  spark_version            = '12.2.x-scala2.12',
  node_type_id             = 'i3.xlarge',
  autotermination_minutes = 15,
  num_workers              = 1
)

print(f"The cluster is now ready at " \
      f"{w.config.host}#setting/clusters/{c.cluster_id}/configuration\n")

Permanently delete a cluster

This code example permanently deletes the cluster with the specified cluster ID from the workspace.

from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

c_id = input('ID of cluster to delete (for example, 1234-567890-ab123cd4): ')

w.clusters.permanent_delete(cluster_id = c_id)

Create a job

This code example creates a Databricks job that runs the specified notebook on the specified cluster. As the code runs, it gets the existing notebook’s path, the existing cluster ID, and related job settings from the user at the terminal.

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.jobs import Task, NotebookTask, Source

w = WorkspaceClient()

job_name            = input("Some short name for the job (for example, my-job): ")
description         = input("Some short description for the job (for example, My job): ")
existing_cluster_id = input("ID of the existing cluster in the workspace to run the job on (for example, 1234-567890-ab123cd4): ")
notebook_path       = input("Workspace path of the notebook to run (for example, /Users/someone@example.com/my-notebook): ")
task_key            = input("Some key to apply to the job's tasks (for example, my-key): ")

print("Attempting to create the job. Please wait...\n")

j = w.jobs.create(
  name = job_name,
  tasks = [
    Task(
      description = description,
      existing_cluster_id = existing_cluster_id,
      notebook_task = NotebookTask(
        base_parameters = dict(""),
        notebook_path = notebook_path,
        source = Source("WORKSPACE")
      ),
      task_key = task_key
    )
  ]
)

print(f"View the job at {w.config.host}/#job/{j.job_id}\n")

List account-level groups

This code example lists the display names for all of the available groups within the Databricks account.

from databricks.sdk import AccountClient

a = AccountClient()

for g in a.groups.list():
  print(g.display_name)