This article describes how to migrate from Databricks Connect for Databricks Runtime 12.2 LTS and below to Databricks Connect for Databricks Runtime 13.0 and above for Python. Databricks Connect enables you to connect popular IDEs, notebook servers, and custom applications to Databricks clusters. See What is Databricks Connect?. For the Scala version of this article, see Migrate to Databricks Connect for Scala.
Before you begin to use Databricks Connect, you must set up the Databricks Connect client.
Follow these guidelines to migrate your existing Python code project or coding environment from Databricks Connect for Databricks Runtime 12.2 LTS and below to Databricks Connect for Databricks Runtime 13.0 and above.
Install the correct version of Python as listed in the installation requirements to match your Databricks cluster, if it is not already installed locally.
Upgrade your Python virtual environment to use the correct version of Python to match your cluster, if needed. For instructions, see your virtual environment provider’s documentation.
With your virtual environment activated, uninstall PySpark from your virtual environment:
pip3 uninstall pyspark
With your virtual environment still activated, uninstall Databricks Connect for Databricks Runtime 12.2 LTS and below:
pip3 uninstall databricks-connect
With your virtual environment still activated, install Databricks Connect for Databricks Runtime 13.0 and above:
pip3 install --upgrade "databricks-connect==14.0.*" # Or X.Y.* to match your cluster version.
Databricks recommends that you append the “dot-asterisk” notation to specify
databricks-connect=X.Y, to make sure that the most recent package is installed. While this is not a requirement, it helps make sure that you can use the latest supported features for that cluster.
Update your Python code to initialize the
sparkvariable (which represents an instantiation of the
DatabricksSessionclass, similar to
SparkSessionin PySpark). For code examples, see Install Databricks Connect for Python.
Migrate your RDD APIs to use DataFrame APIs, and migrate your
SparkContextto use alternatives.
On the client you can set Hadoop configurations using the
spark.conf.set API, which applies to SQL and DataFrame operations. Hadoop configurations set on the
sparkContext must be set in the cluster configuration or using a notebook. This is because configurations set on
sparkContext are not tied to user sessions but apply to the entire cluster.