Install Databricks Connect for Python

Note

This article covers Databricks Connect for Databricks Runtime 13.0 and above.

This article describes how to install Databricks Connect for Python. See What is Databricks Connect?. For the Scala version of this article, see Install Databricks Connect for Scala.

Requirements

  • Your target Databricks workspace and cluster must meet the requirements for Cluster configuration for Databricks Connect.

  • You must install Python 3 on your development machine, and the minor version of your client Python installation must be the same as the minor Python version of your Databricks cluster. To find the minor Python version of your cluster, refer to the “System environment” section of the Databricks Runtime release notes for your cluster. See Databricks Runtime release notes versions and compatibility.

    Note

    If you want to use PySpark UDFs, it’s important that your development machine’s installed minor version of Python match the minor version of Python that is included with Databricks Runtime installed on the cluster.

  • The Databricks Connect major and minor package version should match your Databricks Runtime version. Databricks recommends that you always use the most recent package of Databricks Connect that matches your Databricks Runtime version. For example, when you use a Databricks Runtime 14.0 cluster, you should also use the 14.0 version of the databricks-connect package.

    Note

    See the Databricks Connect release notes for a list of available Databricks Connect releases and maintenance updates.

    Using the most recent package of Databricks Connect that matches your Databricks Runtime version is not a requirement. For Databricks Runtime 13.3 LTS and above, you can use the Databricks Connect package against all versions of Databricks Runtime at or above the version of the Databricks Connect package. However, if you want to use features that are available in later versions of the Databricks Runtime, you must upgrade the Databricks Connect package accordingly.

  • Databricks strongly recommends that you have a Python virtual environment activated for each Python version that you use with Databricks Connect. Python virtual environments help to make sure that you are using the correct versions of Python and Databricks Connect together. This can help to reduce or shorten resolving related technical issues. See how to activate a Python virtual environment for venv or Poetry in the following sections. For more information about these tools, see venv or Poetry.

Activate a Python virtual environment with venv

If you’re using venv on your development machine and your cluster is running Python 3.10, you must create a venv environment with that version. The following example command generates the scripts to activate a venv environment with Python 3.10, and this command then places those scripts within a hidden folder named .venv within the current working directory:

# Linux and macOS
python3.10 -m venv ./.venv

# Windows
python3.10 -m venv .\.venv

To use these scripts to activate this venv environment, see How venvs work.

Skip ahead to Set up the client.

Activate a Python virtual environment with Poetry

  1. Install Poetry, if you have not done so already.

  2. If you’re using Poetry on your development machine and your cluster is running Python 3.10, you must create a Poetry virtual environment with that version. From the root directory of your existing Python code project, instruct poetry to initialize your Python code project for Poetry, by running the following command:

    poetry init
    
  3. Poetry displays several prompts for you to complete. None of these prompts are specific to Databricks Connect. For information about these prompts, see init.

  4. After you complete the prompts, Poetry adds a pyproject.toml file to your Python project. For information about the pyproject.toml file, see The pyproject.toml file.

  5. From the root directory of your Python code project, instruct poetry to read the pyproject.toml file, resolve the dependencies and install them, create a poetry.lock file to lock the dependencies, and finally to create a virtual environment. To do this, run the following command:

    poetry install
    
  6. From the root directory of your Python code project, instruct poetry to activate the virtual environment and enter the shell. To do this, run the following command:

    poetry shell
    

You will know that your virtual environment is activated and the shell is entered when the virtual environment’s name displays in parentheses just before your terminal prompt, for example (my-project-py3.10).

To deactivate the virtual environment and exit the shell at any time, run the command exit.

You will know that you have exited the shell when the virtual environment’s name no longer displays in parentheses just before your terminal prompt.

For more information about creating and managing Poetry virtual environments, see Managing environments.

Set up the client

Tip

If you already have the Databricks extension for Visual Studio Code installed, you do not need to follow these setup instructions.

The Databricks extension for Visual Studio Code already has built-in support for Databricks Connect for Databricks Runtime 13.0 and above. Skip ahead to Debug code by using Databricks Connect for the Databricks extension for Visual Studio Code.

After you meet the requirements for Databricks Connect, complete the following steps to set up the Databricks Connect client.

Step 1: Install the Databricks Connect client

This sections describes how to install the Databricks Connect client with venv or Poetry.

Install the Databricks Connect client with venv

  1. With your virtual environment activated, uninstall PySpark, if it is already installed, by running the uninstall command. This is required because the databricks-connect package conflicts with PySpark. For details, see Conflicting PySpark installations. To check whether PySpark is already installed, run the show command.

    # Is PySpark already installed?
    pip3 show pyspark
    
    # Uninstall PySpark
    pip3 uninstall pyspark
    
  2. With your virtual environment still activated, install the Databricks Connect client by running the install command. Use the --upgrade option to upgrade any existing client installation to the specified version.

    pip3 install --upgrade "databricks-connect==14.0.*"  # Or X.Y.* to match your cluster version.
    

    Note

    Databricks recommends that you append the “dot-asterisk” notation to specify databricks-connect==X.Y.* instead of databricks-connect=X.Y, to make sure that the most recent package is installed. While this is not a requirement, it helps make sure that you can use the latest supported features for that cluster.

Skip ahead to Step 2: Configure connection properties.

Install the Databricks Connect client with Poetry

  1. With your virtual environment activated, uninstall PySpark, if it is already installed, by running the remove command. This is required because the databricks-connect package conflicts with PySpark. For details, see Conflicting PySpark installations. To check whether PySpark is already installed, run the show command.

    # Is PySpark already installed?
    poetry show pyspark
    
    # Uninstall PySpark
    poetry remove pyspark
    
  2. With your virtual environment still activated, install the Databricks Connect client by running the add command.

    poetry add databricks-connect@~14.0  # Or X.Y to match your cluster version.
    

    Note

    Databricks recommends that you use the “at-tilde” notation to specify databricks-connect@~14.0 instead of databricks-connect==14.0, to make sure that the most recent package is installed. While this is not a requirement, it helps make sure that you can use the latest supported features for that cluster.

Step 2: Configure connection properties

In this section, you configure properties to establish a connection between Databricks Connect and your remote Databricks cluster. These properties include settings to authenticate Databricks Connect with your cluster.

For Databricks Connect for Databricks Runtime 13.1 and above, Databricks Connect includes the Databricks SDK for Python. This SDK implements the Databricks client unified authentication standard, a consolidated and consistent architectural and programmatic approach to authentication. This approach makes setting up and automating authentication with Databricks more centralized and predictable. It enables you to configure Databricks authentication once and then use that configuration across multiple Databricks tools and SDKs without further authentication configuration changes.

Note

  1. Collect the following configuration properties.

  2. Configure the connection within your code. Databricks Connect searches for configuration properties in the following order until it finds them. Once it finds them, it stops searching through the remaining options. The details for each option appear after the following table:

    Configuration properties option

    Applies to

    1. The DatabricksSession class’s remote() method

    Databricks personal access token authentication only

    1. A Databricks configuration profile

    All Databricks authentication types

    1. The SPARK_REMOTE environment variable

    Databricks personal access token authentication only

    1. The DATABRICKS_CONFIG_PROFILE environment variable

    All Databricks authentication types

    1. An environment variable for each configuration property

    All Databricks authentication types

    1. A Databricks configuration profile named DEFAULT

    All Databricks authentication types

    1. The DatabricksSession class’s remote() method

      For this option, which applies to Databricks personal access token authentication only, specify the workspace instance name, the Databricks personal access token, and the ID of the cluster.

      You can initialize the DatabricksSession class in several ways, as follows:

      • Set the host, token, and cluster_id fields in DatabricksSession.builder.remote().

      • Use the Databricks SDK’s Config class.

      • Specify a Databricks configuration profile along with the cluster_id field.

      • Set the Spark Connect connection string in DatabricksSession.builder.remote().

      Databricks does not recommend that you directly specify these connection properties in your code. Instead, Databricks recommends configuring properties through environment variables or configuration files, as described throughout this section. The following code examples assume that you provide some implementation of the proposed retrieve_* functions yourself to get the necessary properties from the user or from some other configuration store, such as AWS Systems Manager Parameter Store.

      The code for each of these approaches is as follows:

      # Set the host, token, and cluster_id fields in DatabricksSession.builder.remote.
      # If you have already set the DATABRICKS_CLUSTER_ID environment variable with the
      # cluster's ID, you do not also need to set the cluster_id field here.
      from databricks.connect import DatabricksSession
      
      spark = DatabricksSession.builder.remote(
        host       = f"https://{retrieve_workspace_instance_name()}",
        token      = retrieve_token(),
        cluster_id = retrieve_cluster_id()
      ).getOrCreate()
      
      # Use the Databricks SDK's Config class.
      # If you have already set the DATABRICKS_CLUSTER_ID environment variable with the
      # cluster's ID, you do not also need to set the cluster_id field here.
      from databricks.connect import DatabricksSession
      from databricks.sdk.core import Config
      
      config = Config(
        host       = f"https://{retrieve_workspace_instance_name()}",
        token      = retrieve_token(),
        cluster_id = retrieve_cluster_id()
      )
      
      spark = DatabricksSession.builder.sdkConfig(config).getOrCreate()
      
      # Specify a Databricks configuration profile along with the `cluster_id` field.
      # If you have already set the DATABRICKS_CLUSTER_ID environment variable with the
      # cluster's ID, you do not also need to set the cluster_id field here.
      from databricks.connect import DatabricksSession
      from databricks.sdk.core import Config
      
      config = Config(
        profile    = "<profile-name>",
        cluster_id = retrieve_cluster_id()
      )
      
      spark = DatabricksSession.builder.sdkConfig(config).getOrCreate()
      
      # Set the Spark Connect connection string in DatabricksSession.builder.remote.
      from databricks.connect import DatabricksSession
      
      workspace_instance_name = retrieve_workspace_instance_name()
      token                   = retrieve_token()
      cluster_id              = retrieve_cluster_id()
      
      spark = DatabricksSession.builder.remote(
        f"sc://{workspace_instance_name}:443/;token={token};x-databricks-cluster-id={cluster_id}"
      ).getOrCreate()
      
    2. A Databricks configuration profile

      For this option, create or identify a Databricks configuration profile containing the field cluster_id and any other fields that are necessary for the Databricks authentication type that you want to use.

      The required configuration profile fields for each authentication type are as follows:

      Then set the name of this configuration profile through the Config class.

      Note

      You can use the auth login command’s --configure-cluster option to automatically add the cluster_id field to a new or existing configuration profile. For more information, run the command databricks auth login -h.

      You can specify cluster_id in a few ways, as follows:

      • Include the cluster_id field in your configuration profile, and then just specify the configuration profile’s name.

      • Specify the configuration profile name along with the cluster_id field.

      If you have already set the DATABRICKS_CLUSTER_ID environment variable with the cluster’s ID, you do not also need to specify cluster_id.

      The code for each of these approaches is as follows:

      # Include the cluster_id field in your configuration profile, and then
      # just specify the configuration profile's name:
      from databricks.connect import DatabricksSession
      
      spark = DatabricksSession.builder.profile("<profile-name>").getOrCreate()
      
      # Specify the configuration profile name along with the cluster_id field.
      # In this example, retrieve_cluster_id() assumes some custom implementation that
      # you provide to get the cluster ID from the user or from some other
      # configuration store:
      from databricks.connect import DatabricksSession
      from databricks.sdk.core import Config
      
      config = Config(
        profile    = "<profile-name>",
        cluster_id = retrieve_cluster_id()
      )
      
      spark = DatabricksSession.builder.sdkConfig(config).getOrCreate()
      
    3. The SPARK_REMOTE environment variable

      For this option, which applies to Databricks personal access token authentication only, set the SPARK_REMOTE environment variable to the following string, replacing the placeholders with the appropriate values.

      sc://<workspace-instance-name>:443/;token=<access-token-value>;x-databricks-cluster-id=<cluster-id>
      

      Then initialize the DatabricksSession class as follows:

      from databricks.connect import DatabricksSession
      
      spark = DatabricksSession.builder.getOrCreate()
      

      To set environment variables, see your operating system’s documentation.

    4. The DATABRICKS_CONFIG_PROFILE environment variable

      For this option, create or identify a Databricks configuration profile containing the field cluster_id and any other fields that are necessary for the Databricks authentication type that you want to use.

      If you have already set the DATABRICKS_CLUSTER_ID environment variable with the cluster’s ID, you do not also need to specify cluster_id.

      The required configuration profile fields for each authentication type are as follows:

      Note

      You can use the auth login command’s --configure-cluster to automatically add the cluster_id field to a new or existing configuration profile. For more information, run the command databricks auth login -h.

      Set the DATABRICKS_CONFIG_PROFILE environment variable to the name of this configuration profile. Then initialize the DatabricksSession class as follows:

      from databricks.connect import DatabricksSession
      
      spark = DatabricksSession.builder.getOrCreate()
      

      To set environment variables, see your operating system’s documentation.

    5. An environment variable for each configuration property

      For this option, set the DATABRICKS_CLUSTER_ID environment variable and any other environment variables that are necessary for the Databricks authentication type that you want to use.

      The required environment variables for each authentication type are as follows:

      Then initialize the DatabricksSession class as follows:

      from databricks.connect import DatabricksSession
      
      spark = DatabricksSession.builder.getOrCreate()
      

      To set environment variables, see your operating system’s documentation.

    6. A Databricks configuration profile named DEFAULT

      For this option, create or identify a Databricks configuration profile containing the field cluster_id and any other fields that are necessary for the Databricks authentication type that you want to use.

      If you have already set the DATABRICKS_CLUSTER_ID environment variable with the cluster’s ID, you do not also need to specify cluster_id.

      The required configuration profile fields for each authentication type are as follows:

      Name this configuration profile DEFAULT.

      Note

      You can use the auth login command’s --configure-cluster option to automatically add the cluster_id field to the DEFAULT configuration profile. For more information, run the command databricks auth login -h.

      Then initialize the DatabricksSession class as follows:

      from databricks.connect import DatabricksSession
      
      spark = DatabricksSession.builder.getOrCreate()
      
  3. Validate your environment and the connection to the Databricks cluster

    • The following command will verify that your environment, default credentials and connection to the cluster are all correctly set up for Databricks Connect.

      databricks-connect test
      

      This command picks up the default credentials configured on the environment (such as the DEFAULT configuration profile or through environment variables).

      The command fails with a non-zero exit code and a corresponding error message when it detects any incompatibility in the setup.

    • Additionally, you can also use the pyspark shell that is included as part of Databricks Connect for Python. Start the shell by running:

      pyspark
      

      The Spark shell appears, for example:

      Python 3.10 ...
      [Clang ...] on darwin
      Type "help", "copyright", "credits" or "license" for more information.
      Welcome to
            ____              __
           / __/__  ___ _____/ /__
          _\ \/ _ \/ _ `/ __/  '_/
         /__ / .__/\_,_/_/ /_/\_\   version 13.0
            /_/
      
      Using Python version 3.10 ...
      Client connected to the Spark Connect server at sc://...:.../;token=...;x-databricks-cluster-id=...
      SparkSession available as 'spark'.
      >>>
      

      At the >>> prompt, run a simple PySpark command, such as spark.range(1,10).show(). If there are no errors, you have successfully connected.

      If you have successfully connected, to stop the Spark shell, press Ctrl + d or Ctrl + z, or run the command quit() or exit().

      For more details on the databricks-connect binary, see Advanced usage of Databricks Connect for Python