Databricks Runtime includes pandas as one of the standard Python packages, allowing you to create and leverage pandas DataFrames in Databricks notebooks and jobs.
Apache Spark includes Arrow-optimized execution of Python logic in the form of pandas function APIs, which allow users to apply pandas transformations directly to PySpark DataFrames. Apache Spark also supports pandas UDFs, which use similar Arrow-optimizations for arbitrary user functions defined in Python.
You can use pandas to store data in many different locations on Databricks. Your ability to store and load data from some locations depends on configurations set by workspace administrators.
Databricks recommends storing production data on cloud object storage. See Connect to Google Cloud Storage.
For quick exploration and data without sensitive information, you can safely save data using either relative paths or the DBFS, as in the following examples:
import pandas as pd
df = pd.DataFrame([["a", 1], ["b", 2], ["c", 3]])
You can explore files written to the DBFS with the
%fs magic command, as in the following example. Note that the
/dbfs directory is the root path for these commands.
When you save to a relative path, the location of your file depends on where you execute your code. If you’re using a Databricks notebook, your data file saves to the volume storage attached to the driver of your cluster. Data stored in this location is permanently deleted when the cluster terminates. If you’re using Databricks Repos with arbitrary file support enabled, your data saves to the root of your current project. In either case, you can explore the files written using the
%sh magic command, which allows simple bash operations relative to your current root directory, as in the following example:
For more information on how Databricks stores various files, see Work with files on Databricks.
Databricks provides a number of options to facilitate uploading data to the workspace for exploration. The preferred method to load data with pandas varies depending on how you load your data to the workspace.
If you have small data files stored alongside notebooks on your local machine, you can upload your data and code together with Repos. You can then use relative paths to load data files.
Databricks provides extensive UI-based options for data loading. Most of these options store your data as Delta tables. You can read a Delta table to a Spark DataFrame, and then convert that to a pandas DataFrame.
If you have saved data files using DBFS or relative paths, you can use DBFS or relative paths to reload those data files. The following code provides an example:
import pandas as pd
df = pd.read_csv("./relative_path_test.csv")
df = pd.read_csv("/dbfs/dbfs_test.csv")
Databricks recommends storing production data on cloud object storage. See Connect to Amazon S3.
If you’re in a Unity Catalog-enabled workspace, you can access cloud storage with external locations. See Create an external location to connect cloud storage to Databricks.
You can load data directly from S3 using pandas and a fully qualified URL. You need to provide cloud credentials to access cloud data.
df = pd.read_csv(