This article answers typical questions that come up when you migrate single node workloads to Databricks.
I just created a 20 node Spark cluster and my pandas code doesn’t run any faster. What is going wrong?
If you are working with any single-node libraries, they will not inherently become distributed when you switch to using Databricks. You will need to re-write your code using PySpark, the Apache Spark Python API.
Alternatively, you can use Pandas API on Spark, which allows you to use the pandas DataFrame API to access data in Apache Spark DataFrames.
There is an algorithm in sklearn that I love, but Spark ML doesn’t support it (such as DBSCAN). How can I use this algorithm and still take advantage of Spark?
- Use joblib-spark, an Apache Spark backend for joblib to distribute tasks on a Spark cluster.
- Use a pandas user-defined function.
- For hyperparameter tuning, use Hyperopt.
What are my deployment options for Spark ML?
The best deployment option depends on the latency requirement of the application.
- For batch predictions, see Deploy and serve models and Model inference.
- For streaming applications, see Structured Streaming.
- For low-latency model inference, consider MLflow Model Serving or a cloud provider-based solution such as Amazon Sagemaker.
How can I install or update pandas or another library?
There are several ways to install or update a library.
- To install or update a library for all users on a cluster, see Cluster libraries.
- To make a Python library or a library version available only for a specific notebook, see Notebook-scoped Python libraries.
How can I view data on DBFS with just the driver?
/dbfs/ to the beginning of the file path. See Local file APIs.
How can I get data into Databricks?
Mounting. See Mount object storage to DBFS.
Data tab. See Introduction to importing, reading, and modifying data.
If you have a data file at a URL, you can use the
%sh wget <url>/<filename>to import data to a Spark driver node.
The cell output prints
Saving to: '<filename>', but the file is actually saved to
For example if you download the file
https://data.cityofnewyork.us/api/views/25th-nujf/rows.csv?accessType=DOWNLOADwith the command:
%sh wget https://data.cityofnewyork.us/api/views/25th-nujf/rows.csv?accessType=DOWNLOAD
To load this data, run:
pandas_df = pd.read_csv("file:/databricks/driver/rows.csv?accessType=DOWNLOAD", header='infer')