Migrate single node workloads to Databricks

This article answers typical questions that come up when you migrate single node workloads to Databricks.

I just created a 20 node Spark cluster and my pandas code doesn’t run any faster. What is going wrong?

If you are working with any single-node libraries, they will not inherently become distributed when you switch to using Databricks. You will need to re-write your code using PySpark, the Apache Spark Python API.

Alternatively, you can consider installing Koalas, which allows you to use the pandas DataFrame API to access data in Apache Spark DataFrames.

I see there is MLLib and SparkML. What is the difference?

MLLib is the RDD-based API, while SparkML is the DataFrame-based API. We recommend you use SparkML as all active development is focused on SparkML. However, sometimes people use the term MLLib to more generically refer to the distributed ML libraries for Spark.

There is an algorithm in sklearn that I love, but SparkML doesn’t support it (such as DBSCAN). What are my alternatives?

Spark-sklearn. See the spark_sklearn documentation.

What are my deployment options for SparkML?

Why aren’t my matplotlib images displaying?

In most cases you must wrap any figures inside the display() function. If you are running Databricks Runtime 6.3 and above you can configure the cluster to display images inline. See Matplotlib.

How can I install or upgrade pandas or another library?

There are a few options:

  • Use the Databricks library UI or API. These install the library on every node in the cluster.

  • Use Library utilities.

  • %sh /databricks/python/bin/pip install to install the library.

    This command installs the library only on the Apache Spark driver, not the workers and if you restart the cluster, this library will be removed. To install a library on all nodes and on cluster restarts using a shell command, use an init script.

Why does %sh pip install <library-name> install the Python 2 version, even if I’m running on a Python 3 cluster?

The default pip is for Python 2, so you must use %sh /databricks/python/bin/pip to use Python 3.

How can I view data on DBFS with just the driver?

Add /dbfs/ to the beginning of the file path. See Local file APIs.

How can I get data into Databricks?

  • Mounting. See Mount object storage to DBFS.

  • Data tab. See Data overview.

  • %sh wget

    If you have a data file at a URL, you can use the %sh wget <url>/<filename> to import data to a Spark driver node.

    Note

    The cell output prints Saving to: '<filename>', but the file is actually saved to file:/databricks/driver/<filename>.

    For example if you download the file https://data.cityofnewyork.us/api/views/25th-nujf/rows.csv?accessType=DOWNLOAD with the command:

    %sh wget https://data.cityofnewyork.us/api/views/25th-nujf/rows.csv?accessType=DOWNLOAD
    

    To load this data, run:

    pandas_df = pd.read_csv("file:/databricks/driver/rows.csv?accessType=DOWNLOAD", header='infer')