Migrate Single Node Workloads to Databricks

This article answers typical questions that come up when you migrate single node workloads to Databricks.

I just created a 20 node Spark cluster and my pandas code doesn’t run any faster. What is going wrong?
If you are working with any single-node libraries, they will not inherently become distributed when you switch to using Databricks. You will need to re-write your code using PySpark and consider installing koalas.
I see there is MLLib and SparkML. What is the difference?
MLLib is the RDD-based API, while SparkML is the DataFrame-based API. We recommend you use SparkML as all active development is focused on SparkML. However, sometimes people use the term MLLib to more generically refer to the distributed ML libraries for Spark.
There is an algorithm in sklearn that I love, but SparkML doesn’t support it (such as DBSCAN). What are my alternatives?
Spark-sklearn. See the spark_sklearn documentation.
What are my deployment options for SparkML?
Why aren’t my matplotlib images displaying?
You must wrap any figures inside the display() function. See Matplotlib and ggplot in Python Notebooks.
How can I install or upgrade my pandas or <library-name> version?

There are a few options:

  • Use the Databricks library UI or API. These install the library on every node in the cluster.

  • Use Library utilities.

  • %sh /databricks/python/bin/pip install to install the library.

    This command installs the library on the driver only, and not the workers. Also, if you restart the cluster, this library will be removed.

Why does %sh pip install <library-name> install the Python 2 version, even if I’m running on a Python 3 cluster?
The default pip is for Python 2, so you must use %sh /databricks/python/bin/pip to use Python 3.
How can I view data on DBFS with just the driver?
Add /dbfs/ to the beginning of the file path. See Local file APIs.
How can I get data into Databricks?
  • Mounting. See Mount object storage to DBFS.

  • Data tab. See Data Overview.

  • %sh wget

    If you have a data file at a URL, you can use the %sh wget <url>/<filename> to import data to a Spark driver node.

    Note

    The cell output prints Saving to: '<filename>', but the file is actually saved to file:/databricks/driver/<filename>.

    For example if you download the file https://data.cityofnewyork.us/api/views/25th-nujf/rows.csv?accessType=DOWNLOAD with the command:

    %sh wget https://data.cityofnewyork.us/api/views/25th-nujf/rows.csv?accessType=DOWNLOAD
    

    To load this data, run:

    pandas_df = pd.read_csv("file:/databricks/driver/rows.csv?accessType=DOWNLOAD", header='infer')