Skip to main content

What is Databricks Connect?

note

This article covers Databricks Connect for Databricks Runtime 13.3 LTS and above.

For information about the legacy version of Databricks Connect, see Databricks Connect for Databricks Runtime 12.2 LTS and below.

Databricks Connect is a client library for the Databricks Runtime that allows you to connect popular IDEs such as Visual Studio Code, PyCharm, RStudio Desktop, IntelliJ IDEA, notebook servers, and other custom applications to Databricks compute.

For Databricks Runtime 13.3 LTS and above, Databricks Connect is built on open-source Spark Connect, which has a decoupled client-server architecture for Apache Spark that allows remote connectivity to Spark clusters using the DataFrame API and unresolved logical plans as the protocol.

Databricks Connect is available for the following languages:

note

The Databricks extension for Visual Studio Code includes Databricks Connect, so you do not need to install Databricks Connect if you have installed the Databricks extension for Visual Studio Code. See Debug code using Databricks Connect for the Databricks extension for Visual Studio Code.

What can I do with Databricks Connect?

Using Databricks Connect, you can write code using Spark APIs and run them remotely on Databricks compute instead of in the local Spark session.

For example, when you run the DataFrame command spark.read.format(...).load(...).groupBy(...).agg(...).show() using Databricks Connect, the logical representation of the command is sent to the Spark server running in Databricks for execution on the remote compute.

Databricks Connect enables you to:

  • Run large-scale Spark code from any Python, R, or Scala application. Anywhere you can import pyspark for Python, library(sparklyr) for R, or import org.apache.spark for Scala, you can now run Spark code directly from your application, without needing to install any IDE plugins or use Spark submission scripts.

    note

    Databricks Connect for Databricks Runtime 13.3 LTS and above support running Python applications. R and Scala are supported only in Databricks Connect for Databricks Runtime 13.3 LTS and above.

  • Step through and debug code in your IDE even when working with a remote cluster.

  • Iterate quickly when developing libraries. You do not need to restart the cluster after changing Python or Scala library dependencies in Databricks Connect, because each client session is isolated from each other in the cluster.

  • Shut down idle clusters without losing work. Because the client application is decoupled from the cluster, it is unaffected by cluster restarts or upgrades, which would normally cause you to lose all the variables, RDDs, and DataFrame objects defined in a notebook.

Where does code run?

Databricks Connect determines where your code runs and debugs, as shown in the following figure.

Figure showing were Databricks Connect code runs and debugs

For running code: All code runs locally, while all code involving DataFrame operations runs on the cluster in the remote Databricks workspace and run responses are sent back to the local caller.

For debugging code: All code is debugged locally, while all Spark code continues to run on the cluster in the remote Databricks workspace. The core Spark engine code cannot be debugged directly from the client.

Next steps