Migrate single node workloads to Databricks

This article answers typical questions that come up when you migrate single node workloads to Databricks.

I just created a 20 node Spark cluster and my pandas code doesn’t run any faster. What is going wrong?

If you are working with any single-node libraries, they will not inherently become distributed when you switch to using Databricks. You will need to re-write your code using PySpark, the Apache Spark Python API.

Alternatively, you can use Koalas, which allows you to use the pandas DataFrame API to access data in Apache Spark DataFrames.

There is an algorithm in sklearn that I love, but Spark ML doesn’t support it (such as DBSCAN). How can I use this algorithm and still take advantage of Spark?

What are my deployment options for Spark ML?

The best deployment option depends on the latency requirement of the application.

How can I install or update pandas or another library?

There are several ways to install or update a library.

How can I view data on DBFS with just the driver?

Add /dbfs/ to the beginning of the file path. See Local file APIs.

How can I get data into Databricks?

  • Mounting. See Mount object storage to DBFS.

  • Data tab. See Data overview.

  • %sh wget

    If you have a data file at a URL, you can use the %sh wget <url>/<filename> to import data to a Spark driver node.

    Note

    The cell output prints Saving to: '<filename>', but the file is actually saved to file:/databricks/driver/<filename>.

    For example if you download the file https://data.cityofnewyork.us/api/views/25th-nujf/rows.csv?accessType=DOWNLOAD with the command:

    %sh wget https://data.cityofnewyork.us/api/views/25th-nujf/rows.csv?accessType=DOWNLOAD
    

    To load this data, run:

    pandas_df = pd.read_csv("file:/databricks/driver/rows.csv?accessType=DOWNLOAD", header='infer')