Can you use pandas on Databricks?

Databricks Runtime includes pandas as one of the standard Python packages, allowing you to create and leverage pandas DataFrames in Databricks notebooks and jobs.

In Databricks Runtime 10.0 and above, Pandas API on Spark provides familiar pandas commands on top of PySpark DataFrames. You can also convert DataFrames between pandas and PySpark.

Apache Spark includes Arrow-optimized execution of Python logic in the form of pandas function APIs, which allow users to apply pandas transformations directly to PySpark DataFrames. Apache Spark also supports pandas UDFs, which use similar Arrow-optimizations for arbitrary user functions defined in Python.

Where does pandas store data on Databricks?

You can use pandas to store data in many different locations on Databricks. Your ability to store and load data from some locations depends on configurations set by workspace administrators.

Note

Databricks recommends storing production data on cloud object storage. See Working with data in Amazon S3.

If you’re in a Unity Catalog-enabled workspace, you can access cloud storage with external locations. See Manage external locations and storage credentials.

For quick exploration and data without sensitive information, you can safely save data using either relative paths or the DBFS, as in the following examples:

import pandas as pd

df = pd.DataFrame([["a", 1], ["b", 2], ["c", 3]])

df.to_csv("./relative_path_test.csv")
df.to_csv("/dbfs/dbfs_test.csv")

You can explore files written to the DBFS with the %fs magic command, as in the following example. Note that the /dbfs directory is the root path for these commands.

%fs ls

When you save to a relative path, the location of your file depends on where you execute your code. If you’re using a Databricks notebook, your data file saves to the volume storage attached to the driver of your cluster. Data stored in this location is permanently deleted when the cluster terminates. If you’re using Databricks Repos with arbitrary file support enabled, your data saves to the root of your current project. In either case, you can explore the files written using the %sh magic command, which allows simple bash operations relative to your current root directory, as in the following example:

%sh ls

For more information on how Databricks stores various files, see How to work with files on Databricks.

How do you load data with pandas on Databricks?

Databricks provides a number of options to facilitate uploading data to the workspace for exploration. The preferred method to load data with pandas varies depending on how you load your data to the workspace.

If you have small data files stored alongside notebooks on your local machine, you can upload your data and code together with Repos. You can then use relative paths to load data files.

Databricks provides extensive UI-based options for data loading. Most of these options store your data as Delta tables. You can read a Delta table to a Spark DataFrame, and then convert that to a pandas DataFrame.

If you have saved data files using DBFS or relative paths, you can use DBFS or relative paths to reload those data files. The following code provides an example:

import pandas as pd

df = pd.read_csv("./relative_path_test.csv")
df = pd.read_csv("/dbfs/dbfs_test.csv")

Databricks recommends storing production data on cloud object storage. See Working with data in Amazon S3.

If you’re in a Unity Catalog-enabled workspace, you can access cloud storage with external locations. See Manage external locations and storage credentials.

You can load data directly from S3 using pandas and a fully qualified URL. You need to provide cloud credentials to access cloud data.

df = pd.read_csv(
  f"s3://{bucket_name}/{file_path}",
  storage_options={
    "key": aws_access_key_id,
    "secret": aws_secret_access_key,
    "token": aws_session_token
  }
)