Where does Databricks write data?

This article details locations Databricks writes data with common operations and configurations. Because Databricks provides a suite of tools that span many technologies and interact with cloud resources in a shared-responsibility model, the default locations used to store data vary based on the execution environment, configurations, and libraries.

The information in this article is meant to help you understand default paths for various operations and how configurations might alter these defaults. Data stewards and administrators looking for guidance on configuring and controlling access to data should see Data governance guide.

Where does Unity Catalog store data files?

Unity Catalog relies on administrators to configure relationships between cloud storage and relational objects. The exact location where data resides depends on how administrators have configured relations.

Data written or uploaded to objects governed by Unity Catalog is stored in one of the following locations:

  • A managed storage location associated with a metastore, catalog, or schema. Data written or uploaded to managed tables and managed volumes use managed storage. See Managed storage.

  • An external location configured with storage credentials. Data written or uploaded to external tables and external volumes use external storage. See Manage external locations and storage credentials.

Where does Databricks SQL store data backing tables?

When you run a CREATE TABLE statement with Databricks SQL configured with Unity Catalog, the default behavior is to store data files in a managed storage location configured with Unity Catalog. See Where does Unity Catalog store data files?.

The legacy hive_metastore catalog follows different rules. See Work with Unity Catalog and the legacy Hive metastore.

Where does Delta Live Tables store data files?

Databricks recommends using Unity Catalog when creating DLT pipelines. Data is stored in directories within the managed storage location associated with the target schema.

You can optionally configure DLT pipelines using Hive metastore. When configured with Hive metastore, you can specify a storage location on DBFS or cloud object storage. If you do not specify a location, a location on the DBFS root is assigned to your pipeline.

Where does Apache Spark write data files?

Databricks recommends using object names with Unity Catalog for reading and writing data. You can also write files to Unity Catalog volumes using the following pattern: /Volumes/<catalog>/<schema>/<volume>/<path>/<file-name>. You must have sufficient privileges to upload, create, update, or insert data to Unity Catalog-governed objects.

You can optionally use universal resource indicators (URIs) to specify paths to data files. URIs vary depending on the cloud provider. You must also have write permissions configured for your current compute to write to cloud object storage.

Databricks uses the Databricks Filesystem to map Apache Spark read and write commands back to cloud object storage. Each Databricks workspace comes with a DBFS root storage location configured in the cloud account allocated for the workspace, which all users can access for reading and writing data. Databricks does not recommend using the DBFS root for storing any production data. See What is the Databricks File System (DBFS)? and Recommendations for working with DBFS root.

Where does pandas write data files on Databricks?

In Databricks Runtime 14.0 and above, the default current working directory (CWD) for all local Python read and write operations is the directory containing the notebook. If you provide only a filename when saving a data file, pandas saves that data file as a workspace file parallel to your currently running notebook.

Not all Databricks Runtime versions support workspace files, and some Databricks Runtime versions have differing behavior depending on whether you use notebooks or Repos. See What is the default current working directory in Databricks Runtime 14.0 and above?.