This article answers typical questions that come up when you migrate single node workloads to Databricks.
I just created a 20 node Spark cluster and my pandas code doesn’t run any faster. What is going wrong?
If you are working with any single-node libraries, they will not inherently become distributed when you switch to using Databricks. You will need to re-write your code using PySpark, the Apache Spark Python API.
Alternatively, you can consider installing Koalas, which allows you to use the pandas DataFrame API to access data in Apache Spark DataFrames.
I see there is MLLib and SparkML. What is the difference?
MLLib is the RDD-based API, while SparkML is the DataFrame-based API. We recommend you use SparkML as all active development is focused on SparkML. However, sometimes people use the term MLLib to more generically refer to the distributed ML libraries for Spark.
There is an algorithm in sklearn that I love, but SparkML doesn’t support it (such as DBSCAN). What are my alternatives?
Spark-sklearn. See the spark_sklearn documentation.
What are my deployment options for SparkML?
- Batch pre-compute
- Structured Streaming. See Structured Streaming.
- Real-time inference with MLeap. See MLeap ML Model Export.
Why aren’t my matplotlib images displaying?
In most cases you must wrap any figures inside the
display() function. If you are running Databricks Runtime 6.3 and above you can configure the cluster to display images inline. See Matplotlib.
How can I install or upgrade pandas or another library?
There are a few options:
Use the Databricks library UI or API. These install the library on every node in the cluster.
Use Library utilities.
%sh /databricks/python/bin/pip installto install the library.
This command installs the library only on the Apache Spark driver, not the workers and if you restart the cluster, this library will be removed. To install a library on all nodes and on cluster restarts using a shell command, use an init script.
%sh pip install <library-name> install the Python 2 version, even if I’m running on a Python 3 cluster?
The default pip is for Python 2, so you must use
%sh /databricks/python/bin/pip to use Python 3.
How can I view data on DBFS with just the driver?
/dbfs/ to the beginning of the file path. See Local file APIs.
How can I get data into Databricks?
Mounting. See Mount object storage to DBFS.
Data tab. See Data overview.
If you have a data file at a URL, you can use the
%sh wget <url>/<filename>to import data to a Spark driver node.
The cell output prints
Saving to: '<filename>', but the file is actually saved to
For example if you download the file
https://data.cityofnewyork.us/api/views/25th-nujf/rows.csv?accessType=DOWNLOADwith the command:
%sh wget https://data.cityofnewyork.us/api/views/25th-nujf/rows.csv?accessType=DOWNLOAD
To load this data, run:
pandas_df = pd.read_csv("file:/databricks/driver/rows.csv?accessType=DOWNLOAD", header='infer')