Install Databricks Connect for Python

Note

This article covers Databricks Connect for Databricks Runtime 13.3 LTS and above.

This article describes how to install Databricks Connect for Python. See What is Databricks Connect?. For the Scala version of this article, see Install Databricks Connect for Scala.

Requirements

To install Databricks Connect for Python, the following requirements must be met:

  • If you are connecting to serverless compute, your workspace must meet the requirements for serverless compute.

    Note

    Serverless compute is supported in Databricks Connect version 15.1 and above. In addition, Databricks Connect versions at or lower than the Databricks Runtime release on serverless are fully compatible. See Release notes. To verify if the Databricks Connect version is compatible with serverless compute, see Validate the connection to Databricks.

  • If you are connecting to a cluster, your target cluster must meet the cluster configuration requirements, which includes Databricks Runtime version requirements.

  • You must have Python 3 installed on your development machine, and the minor version of Python installed on your development machine must meet the version requirements in the table below.

    Compute type

    Databricks Connect version

    Compatible Python version

    Serverless

    15.1 and above

    3.11

    Cluster

    15.1 and above

    3.11

    Cluster

    13.3 LTS to 14.3 LTS

    3.10

  • If you want to use PySpark UDFs, your development machine’s installed minor version of Python must match the minor version of Python that is included with the Databricks Runtime installed on the cluster or serverless compute. To find the minor Python version of your cluster, refer to the System environment section of the Databricks Runtime release notes for your cluster or serverless compute. See Databricks Runtime release notes versions and compatibility and Serverless compute release notes.

Activate a Python virtual environment

Databricks strongly recommends that you have a Python virtual environment activated for each Python version that you use with Databricks Connect. Python virtual environments help to make sure that you are using the correct versions of Python and Databricks Connect together. For more information about these tools and how to activate them, see venv or Poetry.

Install the Databricks Connect client

This section describes how to install the Databricks Connect client with venv or Poetry.

Note

If you already have the Databricks extension for Visual Studio Code installed, you do not need to follow these setup instructions, because the Databricks extension for Visual Studio Code already has built-in support for Databricks Connect for Databricks Runtime 13.3 LTS and above. Skip to Debug code using Databricks Connect for the Databricks extension for Visual Studio Code.

Install the Databricks Connect client with venv

  1. With your virtual environment activated, uninstall PySpark, if it is already installed, by running the uninstall command. This is required because the databricks-connect package conflicts with PySpark. For details, see Conflicting PySpark installations. To check whether PySpark is already installed, run the show command.

    # Is PySpark already installed?
    pip3 show pyspark
    
    # Uninstall PySpark
    pip3 uninstall pyspark
    
  2. With your virtual environment still activated, install the Databricks Connect client by running the install command. Use the --upgrade option to upgrade any existing client installation to the specified version.

    pip3 install --upgrade "databricks-connect==15.4.*"  # Or X.Y.* to match your cluster version.
    

    Note

    Databricks recommends that you append the “dot-asterisk” notation to specify databricks-connect==X.Y.* instead of databricks-connect=X.Y, to make sure that the most recent package is installed. While this is not a requirement, it helps make sure that you can use the latest supported features for that cluster.

Install the Databricks Connect client with Poetry

  1. With your virtual environment activated, uninstall PySpark, if it is already installed, by running the remove command. This is required because the databricks-connect package conflicts with PySpark. For details, see Conflicting PySpark installations. To check whether PySpark is already installed, run the show command.

    # Is PySpark already installed?
    poetry show pyspark
    
    # Uninstall PySpark
    poetry remove pyspark
    
  2. With your virtual environment still activated, install the Databricks Connect client by running the add command.

    poetry add databricks-connect@~15.4  # Or X.Y to match your cluster version.
    

    Note

    Databricks recommends that you use the “at-tilde” notation to specify databricks-connect@~15.4 instead of databricks-connect==15.4, to make sure that the most recent package is installed. While this is not a requirement, it helps make sure that you can use the latest supported features for that cluster.

Next steps

After you have installed Databricks Connect, you need to configure a connection to Databricks. See Compute configuration for Databricks Connect.