Install Databricks Connect for Python
Note
This article covers Databricks Connect for Databricks Runtime 13.0 and above.
This article describes how to install Databricks Connect for Python. See What is Databricks Connect?. For the Scala version of this article, see Install Databricks Connect for Scala.
Requirements
Your target Databricks workspace and cluster must meet the requirements for Cluster configuration for Databricks Connect.
You must install Python 3 on your development machine, and the minor version of your client Python installation must be the same as the minor Python version of your Databricks cluster. To find the minor Python version of your cluster, refer to the “System environment” section of the Databricks Runtime release notes for your cluster. See Databricks Runtime release notes versions and compatibility.
Note
If you want to use PySpark UDFs, it’s important that your development machine’s installed minor version of Python match the minor version of Python that is included with Databricks Runtime installed on the cluster.
The Databricks Connect major and minor package version should match your Databricks Runtime version. Databricks recommends that you always use the most recent package of Databricks Connect that matches your Databricks Runtime version. For example, when you use a Databricks Runtime 14.0 cluster, you should also use the
14.0
version of thedatabricks-connect
package.Note
See the Databricks Connect release notes for a list of available Databricks Connect releases and maintenance updates.
Using the most recent package of Databricks Connect that matches your Databricks Runtime version is not a requirement. For Databricks Runtime 13.0 and above, you can use the Databricks Connect package against all versions of Databricks Runtime at or above the version of the Databricks Connect package. However, if you want to use features that are available in later versions of the Databricks Runtime, you must upgrade the Databricks Connect package accordingly.
Databricks strongly recommends that you have a Python virtual environment activated for each Python version that you use with Databricks Connect. Python virtual environments help to make sure that you are using the correct versions of Python and Databricks Connect together. This can help to reduce or shorten resolving related technical issues. See how to activate a Python virtual environment for
venv
or Poetry in the following sections. For more information about these tools, see venv or Poetry.
Activate a Python virtual environment with venv
If you’re using venv
on your development machine and your cluster is running Python 3.10, you must create a venv
environment with that version. The following example command generates the scripts to activate a venv
environment with Python 3.10, and this command then places those scripts within a hidden folder named .venv
within the current working directory:
# Linux and macOS
python3.10 -m venv ./.venv
# Windows
python3.10 -m venv .\.venv
To use these scripts to activate this venv
environment, see How venvs work.
Skip ahead to Set up the client.
Activate a Python virtual environment with Poetry
Install Poetry, if you have not done so already.
If you’re using Poetry on your development machine and your cluster is running Python 3.10, you must create a Poetry virtual environment with that version. From the root directory of your existing Python code project, instruct
poetry
to initialize your Python code project for Poetry, by running the following command:poetry init
Poetry displays several prompts for you to complete. None of these prompts are specific to Databricks Connect. For information about these prompts, see init.
After you complete the prompts, Poetry adds a
pyproject.toml
file to your Python project. For information about thepyproject.toml
file, see The pyproject.toml file.From the root directory of your Python code project, instruct
poetry
to read thepyproject.toml
file, resolve the dependencies and install them, create apoetry.lock
file to lock the dependencies, and finally to create a virtual environment. To do this, run the following command:poetry install
From the root directory of your Python code project, instruct
poetry
to activate the virtual environment and enter the shell. To do this, run the following command:poetry shell
You will know that your virtual environment is activated and the shell is entered when the virtual environment’s name displays in parentheses just before your terminal prompt, for example (my-project-py3.10)
.
To deactivate the virtual environment and exit the shell at any time, run the command exit
.
You will know that you have exited the shell when the virtual environment’s name no longer displays in parentheses just before your terminal prompt.
For more information about creating and managing Poetry virtual environments, see Managing environments.
Set up the client
Tip
If you already have the Databricks extension for Visual Studio Code installed, you do not need to follow these setup instructions.
The Databricks extension for Visual Studio Code already has built-in support for Databricks Connect for Databricks Runtime 13.0 and above. Skip ahead to Debug code by using Databricks Connect for the Databricks extension for Visual Studio Code.
After you meet the requirements for Databricks Connect, complete the following steps to set up the Databricks Connect client.
Step 1: Install the Databricks Connect client
This sections describes how to install the Databricks Connect client with venv or Poetry.
Install the Databricks Connect client with venv
With your virtual environment activated, uninstall PySpark, if it is already installed, by running the
uninstall
command. This is required because thedatabricks-connect
package conflicts with PySpark. For details, see Conflicting PySpark installations. To check whether PySpark is already installed, run theshow
command.# Is PySpark already installed? pip3 show pyspark # Uninstall PySpark pip3 uninstall pyspark
With your virtual environment still activated, install the Databricks Connect client by running the
install
command. Use the--upgrade
option to upgrade any existing client installation to the specified version.pip3 install --upgrade "databricks-connect==14.0.*" # Or X.Y.* to match your cluster version.
Note
Databricks recommends that you append the “dot-asterisk” notation to specify
databricks-connect==X.Y.*
instead ofdatabricks-connect=X.Y
, to make sure that the most recent package is installed. While this is not a requirement, it helps make sure that you can use the latest supported features for that cluster.
Skip ahead to Step 2: Configure connection properties.
Install the Databricks Connect client with Poetry
With your virtual environment activated, uninstall PySpark, if it is already installed, by running the
remove
command. This is required because thedatabricks-connect
package conflicts with PySpark. For details, see Conflicting PySpark installations. To check whether PySpark is already installed, run theshow
command.# Is PySpark already installed? poetry show pyspark # Uninstall PySpark poetry remove pyspark
With your virtual environment still activated, install the Databricks Connect client by running the
add
command.poetry add databricks-connect@~14.0 # Or X.Y to match your cluster version.
Note
Databricks recommends that you use the “at-tilde” notation to specify
databricks-connect@~14.0
instead ofdatabricks-connect==14.0
, to make sure that the most recent package is installed. While this is not a requirement, it helps make sure that you can use the latest supported features for that cluster.
Step 2: Configure connection properties
In this section, you configure properties to establish a connection between Databricks Connect and your remote Databricks cluster. These properties include settings to authenticate Databricks Connect with your cluster.
For Databricks Connect for Databricks Runtime 13.1 and above, Databricks Connect includes the Databricks SDK for Python. This SDK implements the Databricks client unified authentication standard, a consolidated and consistent architectural and programmatic approach to authentication. This approach makes setting up and automating authentication with Databricks more centralized and predictable. It enables you to configure Databricks authentication once and then use that configuration across multiple Databricks tools and SDKs without further authentication configuration changes.
Note
Databricks Connect for Databricks Runtime 13.0 supports only Databricks personal access token authentication for authentication.
Collect the following configuration properties.
The Databricks workspace instance name. This is the same as the Server Hostname value for your cluster; see Get connection details for a cluster.
The ID of your cluster. You can obtain the cluster ID from the URL. See Cluster URL and ID.
Any other properties that are necessary for the supported Databricks authentication type that you want to use. These properties are described throughout this section.
Configure the connection within your code. Databricks Connect searches for configuration properties in the following order until it finds them. Once it finds them, it stops searching through the remaining options. The details for each option appear after the following table:
Configuration properties option
Applies to
The
DatabricksSession
class’sremote()
method
Databricks personal access token authentication only
A Databricks configuration profile
All Databricks authentication types
The
SPARK_REMOTE
environment variable
Databricks personal access token authentication only
The
DATABRICKS_CONFIG_PROFILE
environment variable
All Databricks authentication types
An environment variable for each configuration property
All Databricks authentication types
A Databricks configuration profile named
DEFAULT
All Databricks authentication types
The
DatabricksSession
class’sremote()
methodFor this option, which applies to Databricks personal access token authentication only, specify the workspace instance name, the Databricks personal access token, and the ID of the cluster.
You can initialize the
DatabricksSession
class in several ways, as follows:Set the
host
,token
, andcluster_id
fields inDatabricksSession.builder.remote()
.Use the Databricks SDK’s
Config
class.Specify a Databricks configuration profile along with the
cluster_id
field.Set the Spark Connect connection string in
DatabricksSession.builder.remote()
.
Databricks does not recommend that you directly specify these connection properties in your code. Instead, Databricks recommends configuring properties through environment variables or configuration files, as described throughout this section. The following code examples assume that you provide some implementation of the proposed
retrieve_*
functions yourself to get the necessary properties from the user or from some other configuration store, such as AWS Systems Manager Parameter Store.The code for each of these approaches is as follows:
# Set the host, token, and cluster_id fields in DatabricksSession.builder.remote. # If you have already set the DATABRICKS_CLUSTER_ID environment variable with the # cluster's ID, you do not also need to set the cluster_id field here. from databricks.connect import DatabricksSession spark = DatabricksSession.builder.remote( host = f"https://{retrieve_workspace_instance_name()}", token = retrieve_token(), cluster_id = retrieve_cluster_id() ).getOrCreate() # Use the Databricks SDK's Config class. # If you have already set the DATABRICKS_CLUSTER_ID environment variable with the # cluster's ID, you do not also need to set the cluster_id field here. from databricks.connect import DatabricksSession from databricks.sdk.core import Config config = Config( host = f"https://{retrieve_workspace_instance_name()}", token = retrieve_token(), cluster_id = retrieve_cluster_id() ) spark = DatabricksSession.builder.sdkConfig(config).getOrCreate() # Specify a Databricks configuration profile along with the `cluster_id` field. # If you have already set the DATABRICKS_CLUSTER_ID environment variable with the # cluster's ID, you do not also need to set the cluster_id field here. from databricks.connect import DatabricksSession from databricks.sdk.core import Config config = Config( profile = "<profile-name>", cluster_id = retrieve_cluster_id() ) spark = DatabricksSession.builder.sdkConfig(config).getOrCreate() # Set the Spark Connect connection string in DatabricksSession.builder.remote. from databricks.connect import DatabricksSession workspace_instance_name = retrieve_workspace_instance_name() token = retrieve_token() cluster_id = retrieve_cluster_id() spark = DatabricksSession.builder.remote( f"sc://{workspace_instance_name}:443/;token={token};x-databricks-cluster-id={cluster_id}" ).getOrCreate()
A Databricks configuration profile
For this option, create or identify a Databricks configuration profile containing the field
cluster_id
and any other fields that are necessary for the Databricks authentication type that you want to use.The required configuration profile fields for each authentication type are as follows:
For Databricks personal access token authentication:
host
andtoken
.For basic authentication (legacy):
host
,username
, andpassword
.For OAuth machine-to-machine (M2M) authentication:
host
,client_id
, andclient_secret
.For OAuth user-to-machine (U2M) authentication:
host
.
Then set the name of this configuration profile through the
Config
class.Note
You can use the
auth login
command’s--configure-cluster
option to automtically add thecluster_id
field to a new or existing configuration profile. For more information, run the commanddatabricks auth login -h
.You can specify
cluster_id
in a few ways, as follows:Include the
cluster_id
field in your configuration profile, and then just specify the configuration profile’s name.Specify the configuration profile name along with the
cluster_id
field.
If you have already set the
DATABRICKS_CLUSTER_ID
environment variable with the cluster’s ID, you do not also need to specifycluster_id
.The code for each of these approaches is as follows:
# Include the cluster_id field in your configuration profile, and then # just specify the configuration profile's name: from databricks.connect import DatabricksSession spark = DatabricksSession.builder.profile("<profile-name>").getOrCreate() # Specify the configuration profile name along with the cluster_id field. # In this example, retrieve_cluster_id() assumes some custom implementation that # you provide to get the cluster ID from the user or from some other # configuration store: from databricks.connect import DatabricksSession from databricks.sdk.core import Config config = Config( profile = "<profile-name>", cluster_id = retrieve_cluster_id() ) spark = DatabricksSession.builder.sdkConfig(config).getOrCreate()
The
SPARK_REMOTE
environment variableFor this option, which applies to Databricks personal access token authentication only, set the
SPARK_REMOTE
environment variable to the following string, replacing the placeholders with the appropriate values.sc://<workspace-instance-name>:443/;token=<access-token-value>;x-databricks-cluster-id=<cluster-id>
Then initialize the
DatabricksSession
class as follows:from databricks.connect import DatabricksSession spark = DatabricksSession.builder.getOrCreate()
To set environment variables, see your operating system’s documentation.
The
DATABRICKS_CONFIG_PROFILE
environment variableFor this option, create or identify a Databricks configuration profile containing the field
cluster_id
and any other fields that are necessary for the Databricks authentication type that you want to use.If you have already set the
DATABRICKS_CLUSTER_ID
environment variable with the cluster’s ID, you do not also need to specifycluster_id
.The required configuration profile fields for each authentication type are as follows:
For Databricks personal access token authentication:
host
andtoken
.For basic authentication (legacy):
host
,username
, andpassword
.For OAuth machine-to-machine (M2M) authentication:
host
,client_id
, andclient_secret
.For OAuth user-to-machine (U2M) authentication:
host
.
Note
You can use the
auth login
command’s--configure-cluster
to automtically add thecluster_id
field to a new or existing configuration profile. For more information, run the commanddatabricks auth login -h
.Set the
DATABRICKS_CONFIG_PROFILE
environment variable to the name of this configuration profile. Then initialize theDatabricksSession
class as follows:from databricks.connect import DatabricksSession spark = DatabricksSession.builder.getOrCreate()
To set environment variables, see your operating system’s documentation.
An environment variable for each configuration property
For this option, set the
DATABRICKS_CLUSTER_ID
environment variable and any other environment variables that are necessary for the Databricks authentication type that you want to use.The required environment variables for each authentication type are as follows:
For Databricks personal access token authentication:
DATABRICKS_HOST
andDATABRICKS_TOKEN
.For basic authentication (legacy):
DATABRICKS_HOST
,DATABRICKS_USERNAME
, andDATABRICKS_PASSWORD
.For OAuth machine-to-machine (M2M) authentication:
DATABRICKS_HOST
,DATABRICKS_CLIENT_ID
, andDATABRICKS_CLIENT_SECRET
.For OAuth user-to-machine (U2M) authentication:
DATABRICKS_HOST
.
Then initialize the
DatabricksSession
class as follows:from databricks.connect import DatabricksSession spark = DatabricksSession.builder.getOrCreate()
To set environment variables, see your operating system’s documentation.
A Databricks configuration profile named
DEFAULT
For this option, create or identify a Databricks configuration profile containing the field
cluster_id
and any other fields that are necessary for the Databricks authentication type that you want to use.If you have already set the
DATABRICKS_CLUSTER_ID
environment variable with the cluster’s ID, you do not also need to specifycluster_id
.The required configuration profile fields for each authentication type are as follows:
For Databricks personal access token authentication:
host
andtoken
.For basic authentication (legacy):
host
,username
, andpassword
.For OAuth machine-to-machine (M2M) authentication:
host
,client_id
, andclient_secret
.For OAuth user-to-machine (U2M) authentication:
host
.
Name this configuration profile
DEFAULT
.Note
You can use the
auth login
command’s--configure-cluster
option to automtically add thecluster_id
field to theDEFAULT
configuration profile. For more information, run the commanddatabricks auth login -h
.Then initialize the
DatabricksSession
class as follows:from databricks.connect import DatabricksSession spark = DatabricksSession.builder.getOrCreate()
If you choose to use Databricks personal access token authentication authentication, you can use the included
pyspark
utility to test connectivity to your Databricks cluster as follows.With your virtual environment still activated, run the following command:
If you set the
SPARK_REMOTE
environment variable earlier, run the following command:pyspark
If you did not set the
SPARK_REMOTE
environment variable earlier, run the following command instead:pyspark --remote "sc://<workspace-instance-name>:443/;token=<access-token-value>;x-databricks-cluster-id=<cluster-id>"
The Spark shell appears, for example:
Python 3.10 ... [Clang ...] on darwin Type "help", "copyright", "credits" or "license" for more information. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 13.0 /_/ Using Python version 3.10 ... Client connected to the Spark Connect server at sc://...:.../;token=...;x-databricks-cluster-id=... SparkSession available as 'spark'. >>>
At the
>>>
prompt, run a simple PySpark command, such asspark.range(1,10).show()
. If there are no errors, you have successfully connected.If you have successfully connected, to stop the Spark shell, press
Ctrl + d
orCtrl + z
, or run the commandquit()
orexit()
.