Databricks Connect reference
Note
This article covers Databricks Connect for Databricks Runtime 13.0 and above.
Python support is generally available in Databricks Connect for Databricks Runtime 13.0 and above. Scala support is in Public Preview and is available only in Databricks Connect for Databricks Runtime 13.3 LTS and above.
To learn how to quickly get started with Databricks Connect for Databricks Runtime 13.0 and above, see Databricks Connect.
For information about Databricks Connect for prior Databricks Runtime versions, see Databricks Connect for Databricks Runtime 12.2 LTS and below.
Databricks Connect allows you to connect popular IDEs such as Visual Studio Code, PyCharm, and IntelliJ IDEA, notebook servers, and other custom applications to Databricks clusters.
This article explains how Databricks Connect works, walks you through the steps to get started with Databricks Connect, and explains how to troubleshoot issues that may arise when using Databricks Connect.
Overview
Databricks Connect is a client library for the Databricks Runtime. It allows you to write code using Spark APIs and run them remotely on a Databricks cluster instead of in the local Spark session.
For example, when you run the DataFrame command spark.read.format(...).load(...).groupBy(...).agg(...).show()
using Databricks Connect, the logical representation of the command is sent to the Spark server running in Databricks for execution on the remote cluster.
With Databricks Connect, you can:
Run large-scale Spark code from any Python or Scala application. Anywhere you can
import pyspark
for Python orimport org.apache.spark
for Scala, you can now run Spark code directly from your application, without needing to install any IDE plugins or use Spark submission scripts.Note
Databricks Connect for Databricks Runtime 13.0 and above support running Python applications. Scala is supported only in Databricks Connect for Databricks Runtime 13.3 LTS and above.
Step through and debug code in your IDE even when working with a remote cluster.
Iterate quickly when developing libraries. You do not need to restart the cluster after changing Python or Scala library dependencies in Databricks Connect, because each client session is isolated from each other in the cluster.
Shut down idle clusters without losing work. Because the client application is decoupled from the cluster, it is unaffected by cluster restarts or upgrades, which would normally cause you to lose all the variables, RDDs, and DataFrame objects defined in a notebook.
For Databricks Runtime 13.0 and above, Databricks Connect is now built on open-source Spark Connect. Spark Connect introduces a decoupled client-server architecture for Apache Spark that allows remote connectivity to Spark clusters using the DataFrame API and unresolved logical plans as the protocol. With this “V2” architecture based on Spark Connect, Databricks Connect becomes a thin client that is simple and easy to use. Spark Connect can be embedded everywhere to connect to Databricks: in IDEs, notebooks, and applications, allowing individual users and partners alike to build new (interactive) user experiences based on the Databricks Lakehouse. For more information about Spark Connect, see Introducing Spark Connect.
Databricks Connect determines where your code runs and debugs, as shown in the following figure.

For running code: All Python code runs locally, while all PySpark code involving DataFrame operations runs on the cluster in the remote Databricks workspace and run responses are sent back to the local caller.
For debugging code: All Python code is debugged locally, while all PySpark code continues to run on the cluster in the remote Databricks workspace. The core Spark engine code cannot be debugged directly from the client.
Requirements
This section lists the requirements for Databricks Connect.
Python requirements
A Databricks workspace and its corresponding account that are enabled for Unity Catalog. See Get started using Unity Catalog and Enable a workspace for Unity Catalog.
A cluster with Databricks Runtime 13.0 or higher installed.
Only clusters that are compatible with Unity Catalog are supported. These include clusters with assigned or shared access modes. See Access modes.
You must install Python 3 on your development machine, and the minor version of your client Python installation must be the same as the minor Python version of your Databricks cluster. To find the minor Python version of your cluster, refer to the “System environment” section of the Databricks Runtime release notes for your cluster. See Databricks Runtime release notes versions and compatibility.
Note
If you want to use PySpark UDFs, it’s important that your development machine’s installed minor version of Python match the minor version of Python that is included with Databricks Runtime installed on the cluster.
Databricks strongly recommends that you have a Python virtual environment activated for each Python version that you use with Databricks Connect. Python virtual environments help to make sure that you are using the correct versions of Python and Databricks Connect together. This can help to reduce or shorten resolving related technical issues.
For example, if you’re using venv on your development machine and your cluster is running Python 3.10, you must create a
venv
environment with that version. The following example command generates the scripts to activate avenv
environment with Python 3.10, and this command then places those scripts within a hidden folder named.venv
within the current working directory:# Linux and macOS python3.10 -m venv ./.venv # Windows python3.10 -m venv .\.venv
To use these scripts to activate this
venv
environment, see How venvs work.The Databricks Connect major and minor package version should match your Databricks Runtime version. Databricks recommends that you always use the most recent package of Databricks Connect that matches your Databricks Runtime version. For example, when you use a Databricks Runtime 13.3 LTS cluster, you should also use the
databricks-connect==13.3.*
package.Note
See the Databricks Connect release notes for a list of available Databricks Connect releases and maintenance updates.
Using the most recent package of Databricks Connect that matches your Databricks Runtime version is not a requirement. For Databricks Runtime 13.0 and above, you can use the Databricks Connect package against all versions of Databricks Runtime at or above the version of the Databricks Connect package. However, if you want to use features that are available in later versions of the Databricks Runtime, you must upgrade the Databricks Connect package accordingly.
Skip ahead to Set up the client.
Scala requirements
A Databricks workspace and its corresponding account that are enabled for Unity Catalog. See Get started using Unity Catalog and Enable a workspace for Unity Catalog.
A cluster with Databricks Runtime 13.3 LTS or above installed. The cluster must use a cluster access mode of Single User or Shared. See Access modes.
Note
Scala is not supported on Databricks Connect for Databricks Runtime 13.2 and below.
The Java Development Kit (JDK) installed on your development machine. Databricks recommends that the version of your JDK installation that you use matches the JDK version on your Databricks cluster. To find the JDK version on your cluster, refer to the “System environment” section of the Databricks Runtime release notes for your cluster. For instance,
Zulu 8.70.0.23-CA-linux64
corresponds to JDK 8. See Databricks Runtime release notes versions and compatibility.Scala installed on your development machine. Databricks recommends that the version of your Scala installation you use matches the Scala version on your Databricks cluster. To find the Scala version on your cluster, refer to the “System environment” section of the Databricks Runtime release notes for your cluster. See Databricks Runtime release notes versions and compatibility.
A Scala build tool on your development machine, such as
sbt
.
Set up the client
Complete the following steps to set up the local client for Databricks Connect.
Python client setup
Note
Before you begin to set up the local Databricks Connect client, you must meet the requirements for Databricks Connect.
Tip
If you already have the Databricks extension for Visual Studio Code installed, you do not need to follow these setup instructions.
The Databricks extension for Visual Studio Code already has built-in support for Databricks Connect for Databricks Runtime 13.0 and above. Skip ahead to Debug code by using Databricks Connect for the Databricks extension for Visual Studio Code.
Step 1: Install the Databricks Connect client
With your virtual environment activated, uninstall PySpark, if it is already installed, by running the
uninstall
command. This is required because thedatabricks-connect
package conflicts with PySpark. For details, see Conflicting PySpark installations. To check whether PySpark is already installed, run theshow
command.# Is PySpark already installed? pip3 show pyspark # Uninstall PySpark pip3 uninstall pyspark
With your virtual environment still activated, install the Databricks Connect client by running the
install
command. Use the--upgrade
option to upgrade any existing client installation to the specified version.pip3 install --upgrade "databricks-connect==13.3.*" # Or X.Y.* to match your cluster version.
Note
Databricks recommends that you append the “dot-asterisk” notation to specify
databricks-connect==X.Y.*
instead ofdatabricks-connect=X.Y
, to make sure that the most recent package is installed. While this is not a requirement, it helps make sure that you can use the latest supported features for that cluster.
Step 2: Configure connection properties
In this section, you configure properties to establish a connection between Databricks Connect and your remote Databricks cluster. These properties include settings to authenticate Databricks Connect with your cluster.
For Databricks Connect for Databricks Runtime 13.1 and above, Databricks Connect includes the Databricks SDK for Python. This SDK implements the Databricks client unified authentication standard, a consolidated and consistent architectural and programmatic approach to authentication. This approach makes setting up and automating authentication with Databricks more centralized and predictable. It enables you to configure Databricks authentication once and then use that configuration across multiple Databricks tools and SDKs without further authentication configuration changes.
Note
Databricks Connect for Databricks Runtime 13.0 supports only Databricks personal access token authentication for authentication.
Collect the following configuration properties.
The Databricks workspace instance name. This is the same as the Server Hostname value for your cluster; see Get connection details for a cluster.
The ID of your cluster. You can obtain the cluster ID from the URL. See Cluster URL and ID.
Any other properties that are necessary for the supported Databricks authentication type that you want to use. These properties are described throughout this section.
Configure the connection within your code. Databricks Connect searches for configuration properties in the following order until it finds them. Once it finds them, it stops searching through the remaining options:
For Databricks personal access token authentication only, direct configuration of connection properties, specified through the
DatabricksSession
classFor this option, which applies to Databricks personal access token authentication only, specify the workspace instance name, the Databricks personal access token, and the ID of the cluster.
The following code examples demonstrate how to initialize the
DatabricksSession
class for Databricks personal access token authentication.Databricks does not recommend that you directly specify these connection properties in your code. Instead, Databricks recommends configuring properties through environment variables or configuration files, as described throughout this section. The following code examples assume that you provide some implementation of the proposed
retrieve_*
functions yourself to get the necessary properties from the user or from some other configuration store, such as AWS Systems Manager Parameter Store.# By setting fields in builder.remote: from databricks.connect import DatabricksSession spark = DatabricksSession.builder.remote( host = f"https://{retrieve_workspace_instance_name()}", token = retrieve_token(), cluster_id = retrieve_cluster_id() ).getOrCreate() # Or, by using the Databricks SDK's Config class: from databricks.connect import DatabricksSession from databricks.sdk.core import Config config = Config( host = f"https://{retrieve_workspace_instance_name()}", token = retrieve_token(), cluster_id = retrieve_cluster_id() ) spark = DatabricksSession.builder.sdkConfig(config).getOrCreate() # Or, specify a Databricks configuration profile and # the cluster_id field separately: from databricks.connect import DatabricksSession from databricks.sdk.core import Config config = Config( profile = "<profile-name>", cluster_id = retrieve_cluster_id() ) spark = DatabricksSession.builder.sdkConfig(config).getOrCreate() # Or, by setting the Spark Connect connection string in builder.remote: from databricks.connect import DatabricksSession workspace_instance_name = retrieve_workspace_instance_name() token = retrieve_token() cluster_id = retrieve_cluster_id() spark = DatabricksSession.builder.remote( f"sc://{workspace_instance_name}:443/;token={token};x-databricks-cluster-id={cluster_id}" ).getOrCreate()
For all Databricks authentication types, a Databricks configuration profile name, specified using
profile()
For this option, create or identify a Databricks configuration profile containing the field
cluster_id
and any other fields that are necessary for the Databricks authentication type that you want to use.The required configuration profile fields for each authentication type are as follows:
For Databricks personal access token authentication:
host
andtoken
.For basic authentication:
host
,username
, andpassword
.For OAuth machine-to-machine (M2M) authentication:
host
,client_id
, andclient_secret
.For OAuth user-to-machine (U2M) authentication:
host
.
Then set the name of this configuration profile through the
Config
class.Note
You can use the
auth login
command’s--configure-cluster
option to automtically add thecluster_id
field to a new or existing configuration profile. For more information, run the commanddatabricks auth login -h
.Instead of specifying
cluster_id
in your configuration profile, you can specify the cluster ID in your code, separately from the configuration profile. To do so, the second code example in the following code block assumes that you provide some implementation of the proposedretrieve_cluster_id
function yourself to get the cluster ID from the user or from some other configuration store, such as AWS Systems Manager Parameter Store.For example:
# Specify a Databricks configuration profile that contains the # cluster_id field: from databricks.connect import DatabricksSession spark = DatabricksSession.builder.profile("<profile-name>").getOrCreate() # Or, specify the cluster ID separate from the configuration profile: from databricks.connect import DatabricksSession from databricks.sdk.core import Config config = Config( profile = "<profile-name>", cluster_id = retrieve_cluster_id() ) spark = DatabricksSession.builder.sdkConfig(config).getOrCreate()
For Databricks personal access token authentication only, the
SPARK_REMOTE
environment variableFor this option, which applies to Databricks personal access token authentication only, set the
SPARK_REMOTE
environment variable to the following string, replacing the placeholders with the appropriate values.sc://<workspace-instance-name>:443/;token=<access-token-value>;x-databricks-cluster-id=<cluster-id>
Then initialize the
DatabricksSession
class as follows:from databricks.connect import DatabricksSession spark = DatabricksSession.builder.getOrCreate()
To set environment variables, see your operating system’s documentation.
For all Databricks authentication types, the
DATABRICKS_CONFIG_PROFILE
environment variableFor this option, create or identify a Databricks configuration profile containing the field
cluster_id
and any other fields that are necessary for the Databricks authentication type that you want to use.The required configuration profile fields for each authentication type are as follows:
For Databricks personal access token authentication:
host
andtoken
.For basic authentication:
host
,username
, andpassword
.For OAuth machine-to-machine (M2M) authentication:
host
,client_id
, andclient_secret
.For OAuth user-to-machine (U2M) authentication:
host
.
Note
You can use the
auth login
command’s--configure-cluster
to automtically add thecluster_id
field to a new or existing configuration profile. For more information, run the commanddatabricks auth login -h
.Set the
DATABRICKS_CONFIG_PROFILE
environment variable to the name of this configuration profile. Then initialize theDatabricksSession
class as follows:from databricks.connect import DatabricksSession spark = DatabricksSession.builder.getOrCreate()
To set environment variables, see your operating system’s documentation.
For all Databricks authentication types, an environment variable for each connection property
For this option, set the
DATABRICKS_CLUSTER_ID
environment variable and any other environment variables that are necessary for the Databricks authentication type that you want to use.The required environment variables for each authentication type are as follows:
For Databricks personal access token authentication:
DATABRICKS_HOST
andDATABRICKS_TOKEN
.For basic authentication:
DATABRICKS_HOST
,DATABRICKS_USERNAME
, andDATABRICKS_PASSWORD
.For OAuth machine-to-machine (M2M) authentication:
DATABRICKS_HOST
,DATABRICKS_CLIENT_ID
, andDATABRICKS_CLIENT_SECRET
.For OAuth user-to-machine (U2M) authentication:
DATABRICKS_HOST
.
Then initialize the
DatabricksSession
class as follows:from databricks.connect import DatabricksSession spark = DatabricksSession.builder.getOrCreate()
To set environment variables, see your operating system’s documentation.
For all Databricks authentication types, a Databricks configuration profile named
DEFAULT
For this option, create or identify a Databricks configuration profile containing the field
cluster_id
and any other fields that are necessary for the Databricks authentication type that you want to use.The required configuration profile fields for each authentication type are as follows:
For Databricks personal access token authentication:
host
andtoken
.For basic authentication:
host
,username
, andpassword
.For OAuth machine-to-machine (M2M) authentication:
host
,client_id
, andclient_secret
.For OAuth user-to-machine (U2M) authentication:
host
.
Name this configuration profile
DEFAULT
.Note
You can use the
auth login
command’s--configure-cluster
option to automtically add thecluster_id
field to theDEFAULT
configuration profile. For more information, run the commanddatabricks auth login -h
.Then initialize the
DatabricksSession
class as follows:from databricks.connect import DatabricksSession spark = DatabricksSession.builder.getOrCreate()
If you choose to use Databricks personal access token authentication authentication, you can use the included
pyspark
utility to test connectivity to your Databricks cluster as follows.With your virtual environment still activated, run the following command:
If you set the
SPARK_REMOTE
environment variable earlier, run the following command:pyspark
If you did not set the
SPARK_REMOTE
environment variable earlier, run the following command instead:pyspark --remote "sc://<workspace-instance-name>:443/;token=<access-token-value>;x-databricks-cluster-id=<cluster-id>"
The Spark shell appears, for example:
Python 3.10 ... [Clang ...] on darwin Type "help", "copyright", "credits" or "license" for more information. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 13.0 /_/ Using Python version 3.10 ... Client connected to the Spark Connect server at sc://...:.../;token=...;x-databricks-cluster-id=... SparkSession available as 'spark'. >>>
At the
>>>
prompt, run a simple PySpark command, such asspark.range(1,10).show()
. If there are no errors, you have successfully connected.If you have successfully connected, to stop the Spark shell, press
Ctrl + d
orCtrl + z
, or run the commandquit()
orexit()
.
Skip ahead to Use Databricks Connect.
Scala client setup
Note
Before you begin to set up the local Databricks Connect client, you must meet the requirements for Databricks Connect.
Step 1: Add a reference to the Databricks Connect client
In your Scala project’s build file such as
build.sbt
forsbt
,pom.xml
for Maven, orbuild.gradle
for Gradle, add the following reference to the Databricks Connect client:libraryDependencies += "com.databricks" % "databricks-connect" % "13.3.0"
<dependency> <groupId>com.databricks</groupId> <artifactId>databricks-connect</artifactId> <version>13.3.0</version> </dependency>
implementation 'com.databricks.databricks-connect:13.3.0'
Replace
13.3.0
with the version of the Databricks Connect library that matches the Databricks Runtime version on your cluster. You can find the Databricks Connect library version numbers in the Maven central repository.
Step 2: Configure connection properties
In this section, you configure properties to establish a connection between Databricks Connect and your remote Databricks cluster. These properties include settings to authenticate Databricks Connect with your cluster.
For Databricks Connect for Databricks Runtime 13.3 LTS and above, for Scala, Databricks Connect includes the Databricks SDK for Java. This SDK implements the Databricks client unified authentication standard, a consolidated and consistent architectural and programmatic approach to authentication. This approach makes setting up and automating authentication with Databricks more centralized and predictable. It enables you to configure Databricks authentication once and then use that configuration across multiple Databricks tools and SDKs without further authentication configuration changes.
Collect the following configuration properties.
The Databricks workspace instance name. This is the same as the Server Hostname value for your cluster; see Get connection details for a cluster.
The ID of your cluster. You can obtain the cluster ID from the URL. See Cluster URL and ID.
Any other properties that are necessary for the supported Databricks authentication type. These properties are described throughout this section.
Configure the connection within your code. Databricks Connect searches for configuration properties in the following order until it finds them. Once it finds them, it stops searching through the remaining options:
For Databricks personal access token authentication only, direct configuration of connection properties, specified through the
DatabricksSession
classFor this option, which applies to Databricks personal access token authentication only, specify the workspace instance name, the Databricks personal access token, and the ID of the cluster.
The following code examples demonstrate how to initialize the
DatabricksSession
class for Databricks personal access token authentication.Databricks does not recommend that you directly specify these connection properties in your code. Instead, Databricks recommends configuring properties through environment variables or configuration files, as described throughout this section. The following code examples assume that you provide some implementation of the proposed
retrieve*
functions yourself to get the necessary properties from the user or from some other configuration store, such as AWS Systems Manager Parameter Store.// By setting fields in builder(): import com.databricks.connect.DatabricksSession val spark = DatabricksSession.builder() .host(retrieveWorkspaceInstanceName()) .token(retrieveToken()) .clusterId(retrieveClusterId()) .getOrCreate() // Or, by using the Databricks SDK's Config class: import com.databricks.connect.DatabricksSession import com.databricks.sdk.core.DatabricksConfig val config = new DatabricksConfig() .setHost(retrieveWorkspaceInstanceName()) .setToken(retrieveToken()) val spark = DatabricksSession.builder() .sdkConfig(config) .clusterId(retrieveClusterId()) .getOrCreate() // Or, specify a Databricks configuration profile and // the clusterId field separately: import com.databricks.connect.DatabricksSession import com.databricks.sdk.core.DatabricksConfig val config = new DatabricksConfig() .setProfile("<profile-name>") val spark = DatabricksSession.builder() .sdkConfig(config) .clusterId(retrieveClusterId()) .getOrCreate()
For all Databricks authentication types, a Databricks configuration profile name, specified using
setProfile()
For this option, create or identify a Databricks configuration profile containing the field
cluster_id
and any other fields that are necessary for the supported Databricks authentication type that you want to use.The required configuration profile fields for each authentication type are as follows:
For Databricks personal access token authentication:
host
andtoken
.For basic authentication:
host
,username
, andpassword
.For OAuth machine-to-machine (M2M) authentication:
host
,client_id
, andclient_secret
.For OAuth user-to-machine (U2M) authentication:
host
.
Then set the name of this configuration profile through the
DatabricksConfig
class.Note
You can use the
auth login
command’s--configure-cluster
option in Databricks CLI versions 0.200.1 and above to automtically add thecluster_id
field to a new or existing configuration profile. For more information, run the commanddatabricks auth login -h
.Alternatively, you can specify
cluster_id
separately from the configuration profile. Instead of directly specifying the cluster ID in your code, the following code example assumes that you provide some implementation of the proposedretrieveClusterId
function yourself to get the cluster ID from the user or from some other configuration store, such as AWS Systems Manager Parameter Store.For example:
// Specify a Databricks configuration profile that contains the // cluster_id field: import com.databricks.connect.DatabricksSession import com.databricks.sdk.core.DatabricksConfig val config = new DatabricksConfig() .setProfile("<profile-name>") val spark = DatabricksSession.builder() .sdkConfig(config) .getOrCreate() // Or, specify the cluster ID separate from the configuration profile: import com.databricks.connect.DatabricksSession import com.databricks.sdk.core.DatabricksConfig val config = new DatabricksConfig() .setProfile("<profile-name>") val spark = DatabricksSession.builder() .sdkConfig(config) .clusterId(retrieveClusterId()) .getOrCreate()
For Databricks personal access token authentication only, the
SPARK_REMOTE
environment variableFor this option, which applies to Databricks personal access token authentication only, set the
SPARK_REMOTE
environment variable to the following string, replacing the placeholders with the appropriate values.sc://<workspace-instance-name>:443/;token=<access-token-value>;x-databricks-cluster-id=<cluster-id>
Then initialize the
DatabricksSession
class as follows:import com.databricks.connect.DatabricksSession val spark = DatabricksSession.builder().getOrCreate()
To set environment variables, see your operating system’s documentation.
For all Databricks authentication types, the
DATABRICKS_CONFIG_PROFILE
environment variableFor this option, create or identify a Databricks configuration profile containing the field
cluster_id
and any other fields that are necessary for the supported Databricks authentication type that you want to use.The required configuration profile fields for each authentication type are as follows:
For Databricks personal access token authentication:
host
andtoken
.For basic authentication:
host
,username
, andpassword
.For OAuth machine-to-machine (M2M) authentication:
host
,client_id
, andclient_secret
.For OAuth user-to-machine (U2M) authentication:
host
.
Note
You can use the
auth login
command’s--configure-cluster
option in Databricks CLI versions 0.200.1 and above to automtically add thecluster_id
field to a new or existing configuration profile. For more information, run the commanddatabricks auth login -h
.Set the
DATABRICKS_CONFIG_PROFILE
environment variable to the name of this configuration profile. Then initialize theDatabricksSession
class as follows:import com.databricks.connect.DatabricksSession val spark = DatabricksSession.builder().getOrCreate()
To set environment variables, see your operating system’s documentation.
For all Databricks authentication types, an environment variable for each connection property
For this option, set the
DATABRICKS_CLUSTER_ID
environment variable and any other environment variables that are necessary for the supported Databricks authentication type that you want to use.The required environment variables for each authentication type are as follows:
For Databricks personal access token authentication:
DATABRICKS_HOST
andDATABRICKS_TOKEN
.For basic authentication:
DATABRICKS_HOST
,DATABRICKS_USERNAME
, andDATABRICKS_PASSWORD
.For OAuth machine-to-machine (M2M) authentication:
DATABRICKS_HOST
,DATABRICKS_CLIENT_ID
, andDATABRICKS_CLIENT_SECRET
.For OAuth user-to-machine (U2M) authentication:
DATABRICKS_HOST
.
Then initialize the
DatabricksSession
class as follows:import com.databricks.connect.DatabricksSession val spark = DatabricksSession.builder().getOrCreate()
To set environment variables, see your operating system’s documentation.
For all Databricks authentication types, a Databricks configuration profile named
DEFAULT
For this option, create or identify a Databricks configuration profile containing the field
cluster_id
and any other fields that are necessary for the supported Databricks authentication type that you want to use.The required configuration profile fields for each authentication type are as follows:
For Databricks personal access token authentication:
host
andtoken
.For basic authentication:
host
,username
, andpassword
.For OAuth machine-to-machine (M2M) authentication:
host
,client_id
, andclient_secret
.For OAuth user-to-machine (U2M) authentication:
host
.
Name this configuration profile
DEFAULT
.Note
You can use the
auth login
command’s--configure-cluster
option in Databricks CLI versions 0.200.1 and above to automtically add thecluster_id
field to theDEFAULT
configuration profile. For more information, run the commanddatabricks auth login -h
.Then initialize the
DatabricksSession
class as follows:scala import com.databricks.connect.DatabricksSession val spark = DatabricksSession.builder().getOrCreate()
Use Databricks Connect
These sections describes how to configure many popular IDEs and notebook servers to use the Databricks Connect client. Or, for Python, you can use the built-in Spark shell for Python.
In this section:
JupyterLab with Python
Note
Before you begin to use Databricks Connect, you must meet the requirements and set up the client for Databricks Connect.
To use Databricks Connect with JupyterLab and Python, follow these instructions.
To install JupyterLab, with your Python virtual environment activated, run the following command from your terminal or Command Prompt:
pip3 install jupyterlab
To start JupyterLab in your web browser, run the following command from your activated Python virtual environment:
jupyter lab
If JupyterLab does not appear in your web browser, copy the URL that starts with
localhost
or127.0.0.1
from your virtual environment, and enter it in your web browser’s address bar.Create a new notebook: in JupyterLab, click File > New > Notebook on the main menu, select Python 3 (ipykernel) and click Select.
In the notebook’s first cell, enter either the example code or your own code. If you use your own code, at minimum you must initialize
DatabricksSession
as shown in the example code.To run the notebook, click Run > Run All Cells. All Python code runs locally, while all PySpark code involving DataFrame operations runs on the cluster in the remote Databricks workspace and run responses are sent back to the local caller.
To debug the notebook, click the bug (Enable Debugger) icon next to Python 3 (ipykernel) in the notebook’s toolbar. Set one or more breakpoints, and then click Run > Run All Cells. All Python code is debugged locally, while all PySpark code continues to run on the cluster in the remote Databricks workspace. The core Spark engine code cannot be debugged directly from the client.
To shut down JupyterLab, click File > Shut Down. If the JupyterLab process is still running in your terminal or Command Prompt, stop this process by pressing
Ctrl + c
and then enteringy
to confirm.
For more specific debug instructions, see Debugger.
Classic Jupyter Notebook with Python
Note
Before you begin to use Databricks Connect, you must meet the requirements and set up the client for Databricks Connect.
To use Databricks Connect with classic Jupyter Notebook and Python, follow these instructions.
To install classic Jupyter Notebook, with your Python virtual environment activated, run the following command from your terminal or Command Prompt:
pip3 install notebook
To start classic Jupyter Notebook in your web browser, run the following command from your activated Python virtual environment:
jupyter notebook
If classic Jupyter Notebook does not appear in your web browser, copy the URL that starts with
localhost
or127.0.0.1
from your virtual environment, and enter it in your web browser’s address bar.Create a new notebook: in classic Jupyter Notebook, on the Files tab, click New > Python 3 (ipykernel).
In the notebook’s first cell, enter either the example code or your own code. If you use your own code, at minimum you must initialize
DatabricksSession
as shown in the example code.To run the notebook, click Cell > Run All. All Python code runs locally, while all PySpark code involving DataFrame operations runs on the cluster in the remote Databricks workspace and run responses are sent back to the local caller.
To debug the notebook, add the following line of code at the beginning of your notebook:
from IPython.core.debugger import set_trace
And then call
set_trace()
to enter debug statements at that point of notebook execution. All Python code is debugged locally, while all PySpark code continues to run on the cluster in the remote Databricks workspace. The core Spark engine code cannot be debugged directly from the client.To shut down classic Jupyter Notebook, click File > Close and Halt. If the classic Jupyter Notebook process is still running in your terminal or Command Prompt, stop this process by pressing
Ctrl + c
and then enteringy
to confirm.
Visual Studio Code with Python
Note
Before you begin to use Databricks Connect, you must meet the requirements and set up the client for Databricks Connect.
Tip
The Databricks extension for Visual Studio Code already has built-in support for Databricks Connect for Databricks Runtime 13.0 and above. See Debug code by using Databricks Connect for the Databricks extension for Visual Studio Code.
To use Databricks Connect with Visual Studio Code and Python, follow these instructions.
Start Visual Studio Code.
Open the folder that contains your Python virtual environment (File > Open Folder).
In the Visual Studio Code Terminal (View > Terminal), activate the virtual environment.
Set the current Python interpreter to be the one that is referenced from the virtual environment:
On the Command Palette (View > Command Palette), type
Python: Select Interpreter
, and then press Enter.Select the path to the Python interpreter that is referenced from the virtual environment.
Add to the folder a Python code (
.py
) file that contains either the example code or your own code. If you use your own code, at minimum you must initializeDatabricksSession
as shown in the example code.To run the code, click Run > Run Without Debugging on the main menu. All Python code runs locally, while all PySpark code involving DataFrame operations runs on the cluster in the remote Databricks workspace and run responses are sent back to the local caller.
To debug the code:
With the Python code file open, set any breakpoints where you want your code to pause while running.
Click the Run and Debug icon on the sidebar, or click View > Run on the main menu.
In the Run and Debug view, click the Run and Debug button.
Follow the on-screen instructions to start running and debugging the code.
All Python code is debugged locally, while all PySpark code continues to run on the cluster in the remote Databricks workspace. The core Spark engine code cannot be debugged directly from the client.
For more specific run and debug instructions, see Configure and run the debugger and Python debugging in VS Code.
Visual Studio Code with Scala
Note
Before you begin to use Databricks Connect, you must meet the requirements and set up the client for Databricks Connect.
To use Databricks Connect and Visual Studio Code with the Scala (Metals) extension to create, run, and debug a sample Scala sbt
project, follow these instructions. You can also adapt this sample to your existing Scala projects.
Make sure that the Java Development Kit (JDK) and Scala are installed locally. Databricks recommends that your local JDK and Scala version match the version of the JDK and Scala on your Databricks cluster.
Make sure that the latest version of
sbt
is installed locally.Install the Scala (Metals) extension for Visual Studio Code.
In Visual Studio Code, create a Scala project: In the Command Palette (View > Command Palette), run the command >Metals: New Scala Project.
In the Command Palette, choose the template named scala/hello-world.g8, and complete the on-screen instructions to finish creating the Scala project.
Add project build settings: In the Explorer view (View > Explorer), open the
build.sbt
file from the project’s root, replace the file’s contents with the following, and save the file:scalaVersion := "2.12.15" libraryDependencies += "com.databricks" % "databricks-connect" % "13.3.0"
Replace
2.12.15
with your installed version of Scala, which should match the version that is included with the Databricks Runtime version on your cluster.Replace
13.3.0
with the version of the Databricks Connect library that matches the Databricks Runtime version on your cluster. You can find the Databricks Connect library version numbers in the Maven central repository.Add Scala code: Open the
src/main/scala/Main.scala
file relative to the project’s root, replace the file’s contents with the following, and save the file:import com.databricks.connect.DatabricksSession import org.apache.spark.sql.SparkSession object Main extends App { val spark = DatabricksSession.builder().remote().getOrCreate() val df = spark.read.table("samples.nyctaxi.trips") df.limit(5).show() }
Build the project: Run the command >Metals: Import build from the Command Palette.
Add project run settings: In the Run & Debug view (View > Run), click the gear (Open ‘launch.json’) icon.
Add the following run configuration to the
launch.json
file, and then save the file:{ // Use IntelliSense to learn about possible attributes. // Hover to view descriptions of existing attributes. // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387 "version": "0.2.0", "configurations": [ { "type": "scala", "request": "launch", "name": "Scala: Run main class", "mainClass": "Main", "args": [], "jvmOptions": [], "env": { "DATABRICKS_HOST": "<workspace-instance-name>", "DATABRICKS_TOKEN": "<personal-access-token>", "DATABRICKS_CLUSTER_ID": "<cluster-id>" } } ] }
Replace the following placeholders:
Replace
<workspace-instance-name>
with your workspace instance name, for exampledbc-a1b2345c-d6e7.cloud.databricks.com
.Replace
<personal-access-token>
with the value of the Databricks personal access token for your Databricks workspace user. To create a personal access token for your workspace user, see Databricks personal access token authentication.Replace
<cluster-id>
with the value of your cluster’s ID. To get a cluster’s ID, see Cluster URL and ID.
Note
This example uses Databricks personal access token authentication. For other supported Databricks authentication types that you can use, see Step 2: Configure connection properties.
Run the project: Click the play (Start Debugging) icon next to Scala: Run main class. In the Debug Console view (View > Debug Console), the first 5 rows of the
samples.nyctaxi.trips
table appear. All Scala code runs locally, while all Scala code involving DataFrame operations runs on the cluster in the remote Databricks workspace and run responses are sent back to the local caller.Debug the project: Set breakpoints in your code, and then click the play icon again. All Scala code is debugged locally, while all Scala code continues to run on the cluster in the remote Databricks workspace. The core Spark engine code cannot be debugged directly from the client.
PyCharm with Python
Note
Before you begin to use Databricks Connect, you must meet the requirements and set up the client for Databricks Connect.
IntelliJ IDEA Ultimate provides plugin support for PyCharm with Python also. For details, see Python plug-in for IntelliJ IDEA Ultimate.
To use Databricks Connect with PyCharm and Python, follow these instructions.
Start PyCharm.
Create a project: click File > New Project.
For Location, click the folder icon, and then select the path to your Python virtual environment.
Select Previously configured interpreter.
For Interpreter, click the ellipses.
Click System Interpreter.
For Interpreter, click the ellipses, and select the full path to the Python interpreter that is referenced from the virtual environment. Then click OK.
Click OK again.
Click Create.
Click Create from Existing Sources.
Add to the project a Python code (
.py
) file that contains either the example code or your own code. If you use your own code, at minimum you must initializeDatabricksSession
as shown in the example code.With the Python code file open, set any breakpoints where you want your code to pause while running.
To run the code, click Run > Run. All Python code runs locally, while all PySpark code involving DataFrame operations runs on the cluster in the remote Databricks workspace and run responses are sent back to the local caller.
To debug the code, click Run > Debug. All Python code is debugged locally, while all PySpark code continues to run on the cluster in the remote Databricks workspace. The core Spark engine code cannot be debugged directly from the client.
Follow the on-screen instructions to start running or debugging the code.
For more specific run and debug instructions, see Run without any previous configuring and Debug.
IntelliJ IDEA with Scala
Note
Before you begin to use Databricks Connect, you must meet the requirements and set up the client for Databricks Connect.
To use Databricks Connect and IntelliJ IDEA with the Scala plugin to create, run, and debug a sample Scala sbt
project, follow these instructions.
Make sure that the Java Development Kit (JDK) is installed locally. Databricks recommends that your local JDK version match the version of the JDK on your Databricks cluster.
Start IntelliJ IDEA.
Click File > New > Project.
Give your project some meaningful Name.
For Location, click the folder icon, and complete the on-screen directions to specify the path to your new Scala project.
For Language, click Scala.
For Build system, click sbt.
In the JDK drop-down list, select an existing installation of the JDK on your development machine that matches the JDK version on your cluster, or select Download JDK and follow the on-screen instructions to download a JDK that matches the JDK version on your cluster.
Note
Choosing a JDK install that is above or below the JDK version on your cluster might produce unexpected results, or your code might not run at all.
In the sbt drop-down list, select the latest version.
In the Scala drop-down list, select the version of Scala that matches the Scala version on your cluster.
Note
Choosing a Scala version that is below or above the Scala version on your cluster might produce unexpected results, or your code might not run at all.
For Package prefix, enter some package prefix value for your project’s sources, for example
org.example.application
.Make sure the Add sample code box is checked.
Click Create.
Add the Databricks Connect package: with your new Scala project open, in your Project tool window (View > Tool Windows > Project), open the file named
build.sbt
, in project-name > target.Add the following code to the end of the
build.sbt
file, which declares your project’s dependency on a specific version of the Databricks Connect library for Scala:libraryDependencies += "com.databricks" % "databricks-connect" % "13.3.0"
Replace
13.3.0
with the version of the Databricks Connect library that matches the Databricks Runtime version on your cluster. You can find the Databricks Connect library version numbers in the Maven central repository.Click the Load sbt changes notification icon to update your Scala project with the new library location and dependency.
Wait until the
sbt
progress indicator at the bottom of the IDE disappears. Thesbt
load process might take a few minutes to complete.Add code: in your Project tool window, open the file named
Main.scala
, in project-name > src > main > scala.Replace any existing code in the file with the following code and then save the file:
package org.example.application import com.databricks.connect.DatabricksSession import org.apache.spark.sql.SparkSession object Main { def main(args: Array[String]): Unit = { val spark = DatabricksSession.builder().remote().getOrCreate() val df = spark.read.table("samples.nyctaxi.trips") df.limit(5).show() } }
Add environment variables: in the Project tool window, right-click the
Main.scala
file and click Modify Run Configuration.For Environment variables, enter the following string:
DATABRICKS_HOST=<workspace-instance-name>;DATABRICKS_TOKEN=<personal-access-token>;DATABRICKS_CLUSTER_ID=<cluster-id>
In the preceding string, replace the following placeholders:
Replace
<workspace-instance-name>
with your workspace instance name, for exampledbc-a1b2345c-d6e7.cloud.databricks.com
.Replace
<personal-access-token>
with the value of the Databricks personal access token for your Databricks workspace user. To create a personal access token for your workspace user, see Databricks personal access token authentication.Replace
<cluster-id>
with the value of your cluster’s ID. To get a cluster’s ID, see Cluster URL and ID.
Note
This example uses Databricks personal access token authentication. For other supported Databricks authentication types that you can use, see Step 2: Configure connection properties.
Click OK.
Run the code: start the target cluster in your remote Databricks workspace.
After the cluster has started, on the main menu, click Run > Run ‘Main’.
In the Run tool window (View > Tool Windows > Run), in the Main tab, the first 5 rows of the
samples.nyctaxi.trips
table appear. All Scala code runs locally, while all Scala code involving DataFrame operations runs on the cluster in the remote Databricks workspace and run responses are sent back to the local caller.Debug the code: start the target cluster in your remote Databricks workspace, if it is not already running.
In the preceding code, click the gutter next to
df.limit(5).show()
to set a breakpoint.After the cluster has started, on the main menu, click Run > Debug ‘Main’.
In the Debug tool window (View > Tool Windows > Debug), in the Console tab, click the calculator (Evaluate Expression) icon.
Enter the expression
df.schema
and click Evaluate to show the DataFrame’s schema.In the Debug tool window’s sidebar, click the green arrow (Resume Program) icon.
In the Console pane, the first 5 rows of the
samples.nyctaxi.trips
table appear. All Scala code runs locally, while all Scala code involving DataFrame operations runs on the cluster in the remote Databricks workspace and run responses are sent back to the local caller.
Eclipse with PyDev
Note
Before you begin to use Databricks Connect, you must meet the requirements and set up the client for Databricks Connect.
To use Databricks Connect and Eclipse with PyDev, follow these instructions.
Start Eclipse.
Create a project: click File > New > Project > PyDev > PyDev Project, and then click Next.
Specify a Project name.
For Project contents, specify the path to your Python virtual environment.
Click Please configure an interpreter before proceding.
Click Manual config.
Click New > Browse for python/pypy exe.
Browse to and select select the full path to the Python interpreter that is referenced from the virtual environment, and then click Open.
In the Select interpreter dialog, click OK.
In the Selection needed dialog, click OK.
In the Preferences dialog, click Apply and Close.
In the PyDev Project dialog, click Finish.
Click Open Perspective.
Add to the project a Python code (
.py
) file that contains either the example code or your own code. If you use your own code, at minimum you must initializeDatabricksSession
as shown in the example code.With the Python code file open, set any breakpoints where you want your code to pause while running.
To run the code, click Run > Run. All Python code runs locally, while all PySpark code involving DataFrame operations runs on the cluster in the remote Databricks workspace and run responses are sent back to the local caller.
To debug the code, click Run > Debug. All Python code is debugged locally, while all PySpark code continues to run on the cluster in the remote Databricks workspace. The core Spark engine code cannot be debugged directly from the client.
For more specific run and debug instructions, see Running a Program.
Spark shell with Python
Note
Before you begin to use Databricks Connect, you must meet the requirements and set up the client for Databricks Connect.
The Spark shell works with Databricks personal access token authentication authentication only.
To use Databricks Connect with the Spark shell and Python, follow these instructions.
To start the Spark shell and to connect it to your running cluster, run one of the following commands from your activated Python virtual environment:
If you set the
SPARK_REMOTE
environment variable earlier, run the following command:pyspark
If you did not set the
SPARK_REMOTE
environment variable earlier, run the following command instead:pyspark --remote "sc://<workspace-instance-name>:443/;token=<access-token-value>;x-databricks-cluster-id=<cluster-id>"
The Spark shell appears, for example:
Python 3.10 ... [Clang ...] on darwin Type "help", "copyright", "credits" or "license" for more information. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 13.x.dev0 /_/ Using Python version 3.10 ... Client connected to the Spark Connect server at sc://...:.../;token=...;x-databricks-cluster-id=... SparkSession available as 'spark'. >>>
Refer to Interactive Analysis with the Spark Shell for information about how to use the Spark shell with Python to run commands on your cluster.
Use the built-in
spark
variable to represent theSparkSession
on your running cluster, for example:>>> df = spark.read.table("samples.nyctaxi.trips") >>> df.show(5) +--------------------+---------------------+-------------+-----------+----------+-----------+ |tpep_pickup_datetime|tpep_dropoff_datetime|trip_distance|fare_amount|pickup_zip|dropoff_zip| +--------------------+---------------------+-------------+-----------+----------+-----------+ | 2016-02-14 16:52:13| 2016-02-14 17:16:04| 4.94| 19.0| 10282| 10171| | 2016-02-04 18:44:19| 2016-02-04 18:46:00| 0.28| 3.5| 10110| 10110| | 2016-02-17 17:13:57| 2016-02-17 17:17:55| 0.7| 5.0| 10103| 10023| | 2016-02-18 10:36:07| 2016-02-18 10:41:45| 0.8| 6.0| 10022| 10017| | 2016-02-22 14:14:41| 2016-02-22 14:31:52| 4.51| 17.0| 10110| 10282| +--------------------+---------------------+-------------+-----------+----------+-----------+ only showing top 5 rows
All Python code runs locally, while all PySpark code involving DataFrame operations runs on the cluster in the remote Databricks workspace and run responses are sent back to the local caller.
To stop the Spark shell, press
Ctrl + d
orCtrl + z
, or run the commandquit()
orexit()
.
Code examples
Databricks provides several example applications that show how to use Databricks Connect. See the databricks-demos/dbconnect-examples repository in GitHub.
You can also use the following simpler code examples to experiment with Databricks Connect. These examples assume that you are using default authentication for Databricks Connect client setup.
This simple code example queries the specified table and then shows the specified table’s first 5 rows. To use a different table, adjust the call to spark.read.table
.
from databricks.connect import DatabricksSession
spark = DatabricksSession.builder.getOrCreate()
df = spark.read.table("samples.nyctaxi.trips")
df.show(5)
import com.databricks.connect.DatabricksSession
import org.apache.spark.sql.SparkSession
object Main {
def main(args: Array[String]): Unit = {
val spark = DatabricksSession.builder().getOrCreate()
val df = spark.read.table("samples.nyctaxi.trips")
df.limit(5).show()
}
}
This longer code example does the following:
Creates an in-memory DataFrame.
Creates a table with the name
zzz_demo_temps_table
within thedefault
schema. If the table with this name already exists, the table is deleted first. To use a different schema or table, adjust the calls tospark.sql
,temps.write.saveAsTable
, or both.Saves the DataFrame’s contents to the table.
Runs a
SELECT
query on the table’s contents.Shows the query’s result.
Deletes the table.
from databricks.connect import DatabricksSession
from pyspark.sql.types import *
from datetime import date
spark = DatabricksSession.builder.getOrCreate()
# Create a Spark DataFrame consisting of high and low temperatures
# by airport code and date.
schema = StructType([
StructField('AirportCode', StringType(), False),
StructField('Date', DateType(), False),
StructField('TempHighF', IntegerType(), False),
StructField('TempLowF', IntegerType(), False)
])
data = [
[ 'BLI', date(2021, 4, 3), 52, 43],
[ 'BLI', date(2021, 4, 2), 50, 38],
[ 'BLI', date(2021, 4, 1), 52, 41],
[ 'PDX', date(2021, 4, 3), 64, 45],
[ 'PDX', date(2021, 4, 2), 61, 41],
[ 'PDX', date(2021, 4, 1), 66, 39],
[ 'SEA', date(2021, 4, 3), 57, 43],
[ 'SEA', date(2021, 4, 2), 54, 39],
[ 'SEA', date(2021, 4, 1), 56, 41]
]
temps = spark.createDataFrame(data, schema)
# Create a table on the Databricks cluster and then fill
# the table with the DataFrame's contents.
# If the table already exists from a previous run,
# delete it first.
spark.sql('USE default')
spark.sql('DROP TABLE IF EXISTS zzz_demo_temps_table')
temps.write.saveAsTable('zzz_demo_temps_table')
# Query the table on the Databricks cluster, returning rows
# where the airport code is not BLI and the date is later
# than 2021-04-01. Group the results and order by high
# temperature in descending order.
df_temps = spark.sql("SELECT * FROM zzz_demo_temps_table " \
"WHERE AirportCode != 'BLI' AND Date > '2021-04-01' " \
"GROUP BY AirportCode, Date, TempHighF, TempLowF " \
"ORDER BY TempHighF DESC")
df_temps.show()
# Results:
#
# +-----------+----------+---------+--------+
# |AirportCode| Date|TempHighF|TempLowF|
# +-----------+----------+---------+--------+
# | PDX|2021-04-03| 64| 45|
# | PDX|2021-04-02| 61| 41|
# | SEA|2021-04-03| 57| 43|
# | SEA|2021-04-02| 54| 39|
# +-----------+----------+---------+--------+
# Clean up by deleting the table from the Databricks cluster.
spark.sql('DROP TABLE zzz_demo_temps_table')
import com.databricks.connect.DatabricksSession
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import java.time.LocalDate
object Main {
def main(args: Array[String]): Unit = {
val spark = DatabricksSession.builder().getOrCreate()
// Create a Spark DataFrame consisting of high and low temperatures
// by airport code and date.
val schema = StructType(
Seq(
StructField("AirportCode", StringType, false),
StructField("Date", DateType, false),
StructField("TempHighF", IntegerType, false),
StructField("TempLowF", IntegerType, false)
)
)
val data = Seq(
( "BLI", LocalDate.of(2021, 4, 3), 52, 43 ),
( "BLI", LocalDate.of(2021, 4, 2), 50, 38),
( "BLI", LocalDate.of(2021, 4, 1), 52, 41),
( "PDX", LocalDate.of(2021, 4, 3), 64, 45),
( "PDX", LocalDate.of(2021, 4, 2), 61, 41),
( "PDX", LocalDate.of(2021, 4, 1), 66, 39),
( "SEA", LocalDate.of(2021, 4, 3), 57, 43),
( "SEA", LocalDate.of(2021, 4, 2), 54, 39),
( "SEA", LocalDate.of(2021, 4, 1), 56, 41)
)
val temps = spark.createDataFrame(data).toDF(schema.fieldNames: _*)
// Create a table on the Databricks cluster and then fill
// the table with the DataFrame 's contents.
// If the table already exists from a previous run,
// delete it first.
spark.sql("USE default")
spark.sql("DROP TABLE IF EXISTS zzz_demo_temps_table")
temps.write.saveAsTable("zzz_demo_temps_table")
// Query the table on the Databricks cluster, returning rows
// where the airport code is not BLI and the date is later
// than 2021-04-01.Group the results and order by high
// temperature in descending order.
val df_temps = spark.sql("SELECT * FROM zzz_demo_temps_table " +
"WHERE AirportCode != 'BLI' AND Date > '2021-04-01' " +
"GROUP BY AirportCode, Date, TempHighF, TempLowF " +
"ORDER BY TempHighF DESC")
df_temps.show()
// Results:
// +------------+-----------+---------+--------+
// | AirportCode| Date|TempHighF|TempLowF|
// +------------+-----------+---------+--------+
// | PDX | 2021-04-03| 64 | 45 |
// | PDX | 2021-04-02| 61 | 41 |
// | SEA | 2021-04-03| 57 | 43 |
// | SEA | 2021-04-02| 54 | 39 |
// +------------+-----------+---------+--------+
// Clean up by deleting the table from the Databricks cluster.
spark.sql("DROP TABLE zzz_demo_temps_table")
}
}
Migrate to the latest Databricks Connect
Follow these guidelines to migrate your existing Python code project or coding environment from Databricks Connect for Databricks Runtime 12.2 LTS and below to Databricks Connect for Databricks Runtime 13.0 and above.
In this section:
Python migration to the latest Databricks Connect
Install the correct version of Python as listed in the requirements to match your Databricks cluster, if it is not already installed locally.
Upgrade your Python virtual environment to use the correct version of Python to match your cluster, if needed. For instructions, see your virtual environment provider’s documentation.
With your virtual environment activated, uninstall PySpark from your virtual environment:
pip3 uninstall pyspark
With your virtual environment still activated, uninstall Databricks Connect for Databricks Runtime 12.2 LTS and below:
pip3 uninstall databricks-connect
With your virtual environment still activated, install Databricks Connect for Databricks Runtime 13.0 and above:
pip3 install --upgrade "databricks-connect==13.1.*" # Or X.Y.* to match your cluster version.
Note
Databricks recommends that you append the “dot-asterisk” notation to specify
databricks-connect==X.Y.*
instead ofdatabricks-connect=X.Y
, to make sure that the most recent package is installed. While this is not a requirement, it helps make sure that you can use the latest supported features for that cluster.Update your Python code to initialize the
spark
variable (which represents an instantiation of theDatabricksSession
class, similar toSparkSession
in PySpark). For code examples, see Step 2: Configure connection properties.
Scala migration to the latest Databricks Connect
Install the correct version of the Java Development Kit (JDK) and Scala as listed in the requirements to match your Databricks cluster, if it is not already installed locally.
In your Scala project’s build file such as
build.sbt
forsbt
,pom.xml
for Maven, orbuild.gradle
for Gradle, update the following reference to the Databricks Connect client:libraryDependencies += "com.databricks" % "databricks-connect" % "13.3.0"
<dependency> <groupId>com.databricks</groupId> <artifactId>databricks-connect</artifactId> <version>13.3.0</version> </dependency>
implementation 'com.databricks.databricks-connect:13.3.0'
Replace
13.3.0
with the version of the Databricks Connect library that matches the Databricks Runtime version on your cluster. You can find the Databricks Connect library version numbers in the Maven central repository.Update your Scala code to initialize the
spark
variable (which represents an instantiation of theDatabricksSession
class, similar toSparkSession
in Spark). For code examples, see Code examples.
Access Databricks Utilities
The following sections describes how to use Databricks Connect to access Databricks Utilities.
Access Databricks Utilities for Python
This section describes how to use Databricks Connect for Python to access Databricks Utilities.
Use the
WorkspaceClient
class’sdbfs
variable to access the Databricks File System (DBFS) utility through Databricks Utilities. This approach is similar to calling Databricks Utilities through thedbfs
variable from a notebook within a workspace. TheWorkspaceClient
class belongs to the Databricks SDK for Python, which is included in Databricks Connect.Use
WorkspaceClient.secrets
to access the Databricks Utilities secrets utility.Use
WorkspaceClient.jobs
to aceess the jobs utility.Use
WorkspaceClient.libraries
to access library utility through.No Databricks Utilities functionality other than the preceding utilities are available for Python projects.
Tip
You can also use the included Databricks SDK for Python to access any available Databricks REST API, not just the preceding Databricks Utilities APIs. See databricks-sdk on PyPI.
To initialize WorkspaceClient
, you must provide enough information to authenticate an Databricks SDK with the workspace. For example, you can:
Hard-code the workspace URL and your access token directly within your code, and then intialize
WorkspaceClient
as follows. Although this option is supported, Databricks does not recommend this option, as it can expose sensitive information, such as access tokens, if your code is checked into version control or otherwise shared:from databricks.sdk import WorkspaceClient w = WorkspaceClient(host = "https://<workspace-instance-name>", token = "<access-token-value>")
Create or specify a configuration profile that contains the fields
host
andtoken
, and then intialize theWorkspaceClient
as follows:from databricks.sdk import WorkspaceClient w = WorkspaceClient(profile = "<profile-name>")
Set the environment variables
DATABRICKS_HOST
andDATABRICKS_TOKEN
in the same way you set them for Databricks Connect, and then initializeWorkspaceClient
as follows:from databricks.sdk import WorkspaceClient w = WorkspaceClient()
The Databricks SDK for Python does not recognize the SPARK_REMOTE
environment variable for Databricks Connect.
For additional Databricks authentication options for the Databricks SDK for Python, as well as how to initialize AccountClient
within the Databricks SDKs to access available Databricks REST APIs at the account level instead of at the workspace level, see databricks-sdk on PyPI.
The following example shows how to use the Databricks SDK for Python to automate DBFS. This example creates a file named zzz_hello.txt
in the DBFS root within the workspace, writes data into the file, closes the file, reads the data from the file, and then deletes the file. This example assumes that the environment variables DATABRICKS_HOST
and DATABRICKS_TOKEN
have already been set:
from databricks.sdk import WorkspaceClient
import base64
w = WorkspaceClient()
file_path = "/zzz_hello.txt"
file_data = "Hello, Databricks!"
# The data must be base64-encoded before being written.
file_data_base64 = base64.b64encode(file_data.encode())
# Create the file.
file_handle = w.dbfs.create(
path = file_path,
overwrite = True
).handle
# Add the base64-encoded version of the data.
w.dbfs.add_block(
handle = file_handle,
data = file_data_base64.decode()
)
# Close the file after writing.
w.dbfs.close(handle = file_handle)
# Read the file's contents and then decode and print it.
response = w.dbfs.read(path = file_path)
print(base64.b64decode(response.data).decode())
# Delete the file.
w.dbfs.delete(path = file_path)
Access Databricks Utilities for Scala
This section describes how to use Databricks Connect for Scala to access Databricks Utilities.
Use
DBUtils.getDBUtils
to access the Databricks File System (DBFS) and secrets through Databricks Utilities.DBUtils.getDBUtils
belongs to the Databricks Utilities for Scala library. The Databricks Utilities for Scala library must be included in your Scala projects, separate from the Databricks Connect library for Scala. The Databricks Utilities for Scala library works only with Databricks Connect for Databricks Runtime 13.3 LTS and above.No Databricks Utilities functionality other than the preceding utilities are available for Scala projects.
Authentication for the Databricks Utilities for Scala library is determined through initiatlizing the
DatabricksSession
class in your Databricks Connect project for Scala.In your Scala project’s build file such as
build.sbt
forsbt
,pom.xml
for Maven, orbuild.gradle
for Gradle, add the following reference to the Databricks Utilities for Scala library:
libraryDependencies += "com.databricks" % "dbutils-scala" % "0.0.1"
<dependency>
<groupId>com.databricks</groupId>
<artifactId>dbutils-scala</artifactId>
<version>0.0.1</version>
</dependency>
implementation 'com.databricks.dbutils-scala:0.0.1'
Replace 0.0.1
with the version of the Databricks Utilities for Scala library that corresponds to the Databricks Runtime version on your cluster. You can find the list of Databricks Utilities for Scala library version numbers and their corresponding Databricks Runtime versions in the Maven central repository.
Tip
You can also use the Databricks SDK for Java from Scala to access any available Databricks REST API, not just the preceding Databricks Utilities APIs. See the databricks/databricks-sdk-java repository in GitHub and also Use Scala with the Databricks SDK for Java.
Disabling Databricks Connect
Databricks Connect (and the underlying Spark Connect) services can be disabled on any given cluster. To disable the Databricks Connect service, set the following Spark configuration on the cluster.
spark.databricks.service.server.enabled false
Once disabled, any Databricks Connect queries reaching the cluster are rejected with an appropriate error message.
Asynchronous queries and interruptions
For Databricks Connect for Databricks Runtime 14.0 and above, query execution is more resilient to network and other interrupts when executing long running queries. When the client program receives an interruption or the process is paused (up to 5 minutes) by the operating system, such as when the laptop lid is shut, the client reconnects to the running query. This also allows queries to run for longer times (previously only 1 hour).
Databricks Connect now also comes with the ability to interrupt running queries, if desired, such as for cost saving.
The following Python program interrupts a long running query by using the interruptTag()
API.
from databricks.connect import DatabricksSession
from time import sleep
import threading
session = DatabricksSession.builder.getOrCreate()
def thread_fn():
sleep(5)
session.interruptTag("interrupt-me")
# All subsequent DataFrame queries that use session will have this tag.
session.addTag("interrupt-me")
t = threading.Thread(target=thread_fn).start()
df = <a long running DataFrame query>
df.show()
t.join()
The interruptAll()
API can also be used to interrupt all running queries in a given session.
Set Hadoop configurations
On the client you can set Hadoop configurations using the spark.conf.set
API, which applies to SQL and DataFrame operations. Hadoop configurations set on the sparkContext
must be set in the cluster configuration or using a notebook. This is because configurations set on sparkContext
are not tied to user sessions but apply to the entire cluster.
Troubleshooting
This section describes some common issues that you might encounter with Databricks Connect and how to resolve them.
In this section:
Python version mismatch
Check the Python version you are using locally has at least the same minor release as the version on the cluster (for example, 3.10.11
versus 3.10.10
is OK, 3.10
versus 3.9
is not).
If you have multiple Python versions installed locally, ensure that Databricks Connect is using the right one by setting the PYSPARK_PYTHON
environment variable (for example, PYSPARK_PYTHON=python3
).
Conflicting PySpark installations
The databricks-connect
package conflicts with PySpark. Having both installed will cause errors when initializing the Spark context in Python. This can manifest in several ways, including “stream corrupted” or “class not found” errors. If you have PySpark installed in your Python environment, ensure it is uninstalled before installing databricks-connect. After uninstalling PySpark, make sure to fully re-install the Databricks Connect package:
pip3 uninstall pyspark
pip3 uninstall databricks-connect
pip3 install --upgrade "databricks-connect==13.1.*" # or X.Y.* to match your specific cluster version.
Conflicting or Missing PATH
entry for binaries
It is possible your PATH is configured so that commands like spark-shell
will be running some other previously installed binary instead of the one provided with Databricks Connect. You should make sure either the Databricks Connect binaries take precedence, or remove the previously installed ones.
If you can’t run commands like spark-shell
, it is also possible your PATH was not automatically set up by pip3 install
and you’ll need to add the installation bin
dir to your PATH manually. It’s possible to use Databricks Connect with IDEs even if this isn’t set up.
The filename, directory name, or volume label syntax is incorrect on Windows
If you are using Databricks Connect on Windows and see:
The filename, directory name, or volume label syntax is incorrect.
Databricks Connect was installed into a directory with a space in your path. You can work around this by either installing into a directory path without spaces, or configuring your path using the short name form.
Limitations
Databricks Connect does not support the following Databricks features and third-party platforms.
Python limitations
The following features are not supported for Databricks Connect for Databricks Runtime 13.0 and above and above unless otherwise specified.
DataSet
objectsPandas UDF: 13.0 only
Structured Streaming (except for
forEachBatch
): 13.0 onlyDatabricks Utilities: 13.0 only
Databricks authentication types except for Databricks personal access tokens: 13.0 only
SparkContext
RDDs
MLflow model inference with
mlflow.pyfunc.spark_udf(spark...)
(you can load the model locally withmlflow.pyfunc.load_model(<model>)
, or you can wrap it as a custom Pandas UDF)Mosaic geospatials
CREATE TABLE <table-name> AS SELECT
(instead, usespark.sql("SELECT ...").write.saveAsTable("table")
)applyInPandas()
andcogroup()
running on single user clusters: 13.0 onlyapplyInPandas()
andcogroup()
running on shared clusters
Scala limitations
The following features are not supported for Databricks Connect for Databricks Runtime 13.3 LTS and above unless otherwise specified. Scala is not supported for Databricks Connect for Databricks Runtime 13.2 and below.
UDFs
SparkContext
RDDs
CREATE TABLE <table-name> AS SELECT
(instead, usespark.sql("SELECT ...").write.saveAsTable("table")
)
Additionally:
The Scala typed APIs
reduce()
,groupByKey()
,filter()
,map()
,mapPartitions()
,flatMap()
, andforeach()
require the user to install the JAR containing the function as a library to the cluster.