This article answers typical questions that come up when you migrate single node workloads to Databricks.
I just created a 20 node Spark cluster and my pandas code doesn’t run any faster. What is going wrong?
If you are working with any single-node libraries, they will not inherently become distributed when you switch to using Databricks. You will need to re-write your code using PySpark, the Apache Spark Python API.
Alternatively, you can use Pandas API on Spark, which allows you to use the pandas DataFrame API to access data in Apache Spark DataFrames.
There is an algorithm in sklearn that I love, but Spark ML doesn’t support it (such as DBSCAN). How can I use this algorithm and still take advantage of Spark?
Use joblib-spark, an Apache Spark backend for joblib to distribute tasks on a Spark cluster.
Use a pandas user-defined function.
For hyperparameter tuning, use Hyperopt.
What are my deployment options for Spark ML?
The best deployment option depends on the latency requirement of the application.
For batch predictions, see Deploy models for inference and prediction.
For streaming applications, see What is Apache Spark Structured Streaming?.
Serverless Real-Time Inference offers 1-click deployment and endpoints that automatically scale based on the volume of scoring requests.
How can I install or update pandas or another library?
There are several ways to install or update a library.
To install or update a library for all users on a cluster, see Cluster libraries.
To make a Python library or a library version available only for a specific notebook, see Notebook-scoped Python libraries.
How can I view data on DBFS with just the driver?
/dbfs/ to the beginning of the file path. See What is the Databricks File System (DBFS)?.
How can I get data into Databricks?
Mounting. See Mounting cloud object storage on Databricks.
Data tab. See Explore and create tables in DBFS.
If you have a data file at a URL, you can use the
%sh wget <url>/<filename>to import data to a Spark driver node.
The cell output prints
Saving to: '<filename>', but the file is actually saved to
For example if you download the file
https://data.cityofnewyork.us/api/views/25th-nujf/rows.csv?accessType=DOWNLOADwith the command:
%sh wget https://data.cityofnewyork.us/api/views/25th-nujf/rows.csv?accessType=DOWNLOAD
To load this data, run:
pandas_df = pd.read_csv("file:/databricks/driver/rows.csv?accessType=DOWNLOAD", header='infer')