Work with files on Databricks

Databricks has multiple utilities and APIs for interacting with files in the following locations:

Unity Catalog volumes
Workspace files
Cloud object storage
DBFS mounts and DBFS root
Ephemeral storage attached to the driver node of the cluster

This article has examples for interacting with files in these locations for the following tools:

Apache Spark
Spark SQL and Databricks SQL
Databricks file system utilities (dbutils.fs or %fs)
Databricks CLI
Databricks REST API
Bash shell commands (%sh)
Notebook-scoped library installs using %pip
pandas
OSS Python file management and processing utilities

important

Some operations in Databricks, especially those using Java or Scala libraries, run as JVM processes, for example:

Specifying a JAR file dependency using --jars in Spark configurations
Calling cat or java.io.File in Scala notebooks
Custom data sources, such as spark.read.format("com.mycompany.datasource")
Libraries that load files using Java’s FileInputStream or Paths.get()

These operations do not support reading from or writing to Unity Catalog volumes or workspace files using standard file paths, such as /Volumes/my-catalog/my-schema/my-volume/my-file.csv. If you need to access volume files or workspace files from JAR dependencies or JVM-based libraries, copy the files first to compute local storage using Python or %sh commands, such as %sh mv.. Do not use %fs and dbutils.fs which use the JVM. To access files already copied locally, use language-specific commands such as Python shutil or use %sh commands. If a file needs to be present during cluster start, use an init script to move the file first. See What are init scripts?.

Do I need to provide a URI scheme to access data?

Data access paths in Databricks follow one of the following standards:

URI-style paths include a URI scheme. For Databricks-native data access solutions, URI schemes are optional for most use cases. When directly accessing data in cloud object storage, you must provide the correct URI scheme for the storage type.
POSIX-style paths provide data access relative to the driver root (/). POSIX-style paths never require a scheme. You can use Unity Catalog volumes or DBFS mounts to provide POSIX-style access to data in cloud object storage. Many ML frameworks and other OSS Python modules require FUSE and can only use POSIX-style paths.

note

File operations requiring FUSE data access cannot directly access cloud object storage using URIs. Databricks recommends using Unity Catalog volumes to configure access to these locations for FUSE.

On compute configured with dedicated access mode (formerly single user access mode) and Databricks Runtime 14.3 and above, Scala supports FUSE for Unity Catalog volumes and workspace files, except for subprocesses that originate from Scala, such as the Scala command "cat /Volumes/path/to/file".!!.

Work with files in Unity Catalog volumes

Databricks recommends using Unity Catalog volumes to configure access to non-tabular data files stored in cloud object storage. See What are Unity Catalog volumes?.

Tool	Example
Apache Spark	`spark.read.format("json").load("/Volumes/my_catalog/my_schema/my_volume/data.json").show()`
Spark SQL and Databricks SQL	SELECT * FROM csv.`/Volumes/my_catalog/my_schema/my_volume/data.csv`; `LIST '/Volumes/my_catalog/my_schema/my_volume/';`
Databricks file system utilities	`dbutils.fs.ls("/Volumes/my_catalog/my_schema/my_volume/")` `%fs ls /Volumes/my_catalog/my_schema/my_volume/`
Databricks CLI	`databricks fs cp /path/to/local/file dbfs:/Volumes/my_catalog/my_schema/my_volume/`
Databricks REST API	`POST https://<databricks-instance>/api/2.1/jobs/create` `{"name": "A multitask job", "tasks": [{..."libraries": [{"jar": "/Volumes/dev/environment/libraries/logging/Logging.jar"}],},...]}`
Bash shell commands	`%sh curl http://<address>/text.zip -o /Volumes/my_catalog/my_schema/my_volume/tmp/text.zip`
Library installs	`%pip install /Volumes/my_catalog/my_schema/my_volume/my_library.whl`
Pandas	`df = pd.read_csv('/Volumes/my_catalog/my_schema/my_volume/data.csv')`
OSS Python	`os.listdir('/Volumes/my_catalog/my_schema/my_volume/path/to/directory')`

note

The dbfs:/ scheme is required when working with the Databricks CLI.

Volumes limitations

Volumes have the following limitations:

Direct-append or non-sequential (random) writes, such as writing Zip and Excel files are not supported. For direct-append or random-write workloads, perform the operations on a local disk first and then copy the results to Unity Catalog volumes. For example:

Python
# python
import xlsxwriter
from shutil import copyfile

workbook = xlsxwriter.Workbook('/local_disk0/tmp/excel.xlsx')
worksheet = workbook.add_worksheet()
worksheet.write(0, 0, "Key")
worksheet.write(0, 1, "Value")
workbook.close()

copyfile('/local_disk0/tmp/excel.xlsx', '/Volumes/my_catalog/my_schema/my_volume/excel.xlsx')

Sparse files are not supported. To copy sparse files, use cp --sparse=never:

Bash
$ cp sparse.file /Volumes/my_catalog/my_schema/my_volume/sparse.file
error writing '/dbfs/sparse.file': Operation not supported
$ cp --sparse=never sparse.file /Volumes/my_catalog/my_schema/my_volume/sparse.file

Work with workspace files

Databricks workspace files are the files in a workspace, stored in the workspace storage account. You can use workspace files to store and access files such as notebooks, source code files, data files, and other workspace assets. Because workspace files have size restrictions, Databricks recommends only storing small data files here primarily for development and testing.

Tool	Example
Apache Spark	`spark.read.format("json").load("file:/Workspace/Users/<user-folder>/data.json").show()`
Spark SQL and Databricks SQL	SELECT * FROM json.`file:/Workspace/Users/<user-folder>/file.json`;
Databricks file system utilities	`dbutils.fs.ls("file:/Workspace/Users/<user-folder>/")` `%fs ls file:/Workspace/Users/<user-folder>/`
Databricks CLI	`databricks workspace list`
Databricks REST API	`POST https://<databricks-instance>/api/2.0/workspace/delete` `{"path": "/Workspace/Shared/code.py", "recursive": "false"}`
Bash shell commands	`%sh curl http://<address>/text.zip -o /Workspace/Users/<user-folder>/text.zip`
Library installs	`%pip install /Workspace/Users/<user-folder>/my_library.whl`
Pandas	`df = pd.read_csv('/Workspace/Users/<user-folder>/data.csv')`
OSS Python	`os.listdir('/Workspace/Users/<user-folder>/path/to/directory')`

note

The file:/ schema is required when working with Databricks Utilities, Apache Spark, or SQL.

For the limitations in working with workspace files, see Limitations.

Where do deleted workspace files go?

Deleting a workspace file sends it to the trash. You can recover or permanently delete files from the trash using the UI.

See Delete an object.

Work with files in cloud object storage

Databricks recommends using Unity Catalog volumes to configure secure access to files in cloud object storage. You must configure permissions if you choose to access data directly in cloud object storage using URIs. See Managed and external volumes.

The following examples use URIs to access data in cloud object storage:

Tool	Example
Apache Spark	`spark.read.format("json").load("s3://<bucket>/path/file.json").show()`
Spark SQL and Databricks SQL	SELECT * FROM csv.`s3://<bucket>/path/file.json`; `LIST 's3://<bucket>/path';`
Databricks file system utilities	`dbutils.fs.ls("s3://<bucket>/path/")` `%fs ls s3://<bucket>/path/`
Databricks CLI	Not supported
Databricks REST API	Not supported
Bash shell commands	Not supported
Library installs	`%pip install s3://bucket-name/path/to/library.whl`
Pandas	Not supported
OSS Python	Not supported

note

Cloud object storage does not support Amazon S3 mounts with client-side encryption enabled.

Work with files in DBFS mounts and DBFS root

DBFS mounts are not securable using Unity Catalog and are no longer recommended by Databricks. Data stored in the DBFS root is accessible by all users in the workspace. Databricks recommends against storing any sensitive or production code or data in the DBFS root. See What is DBFS?.

Tool	Example
Apache Spark	`spark.read.format("json").load("/mnt/path/to/data.json").show()`
Spark SQL and Databricks SQL	SELECT * FROM json.`/mnt/path/to/data.json`;
Databricks file system utilities	`dbutils.fs.ls("/mnt/path")` `%fs ls /mnt/path`
Databricks CLI	`databricks fs cp dbfs:/mnt/path/to/remote/file /path/to/local/file`
Databricks REST API	`POST https://<host>/api/2.0/dbfs/delete --data '{ "path": "/tmp/HelloWorld.txt" }'`
Bash shell commands	`%sh curl http://<address>/text.zip > /dbfs/mnt/tmp/text.zip`
Library installs	`%pip install /dbfs/mnt/path/to/my_library.whl`
Pandas	`df = pd.read_csv('/dbfs/mnt/path/to/data.csv')`
OSS Python	`os.listdir('/dbfs/mnt/path/to/directory')`

note

The dbfs:/ scheme is required when working with the Databricks CLI.

Work with files in ephemeral storage attached to the driver node

The ephemeral storage attached to the driver node is block storage with built-in POSIX-based path access. Any data stored in this location disappears when a cluster terminates or restarts.

Tool	Example
Apache Spark	Not supported
Spark SQL and Databricks SQL	Not supported
Databricks file system utilities	`dbutils.fs.ls("file:/path")` `%fs ls file:/path`
Databricks CLI	Not supported
Databricks REST API	Not supported
Bash shell commands	`%sh curl http://<address>/text.zip > /tmp/text.zip`
Library installs	Not supported
Pandas	`df = pd.read_csv('/path/to/data.csv')`
OSS Python	`os.listdir('/path/to/directory')`

note

The file:/ schema is required when working with Databricks Utilities.

Move data from ephemeral storage to volumes

You might want to access data downloaded or saved to ephemeral storage using Apache Spark. Because ephemeral storage is attached to the driver and Spark is a distributed processing engine, not all operations can directly access data here. Suppose you must move data from the driver filesystem to Unity Catalog volumes. In that case, you can copy files using magic commands or the Databricks utilities, as in the following examples:

Python
dbutils.fs.cp ("file:/<path>", "/Volumes/<catalog>/<schema>/<volume>/<path>")

Bash
%sh cp /<path> /Volumes/<catalog>/<schema>/<volume>/<path>

Bash
%fs cp file:/<path> /Volumes/<catalog>/<schema>/<volume>/<path>

Additional resources

For information about uploading local files or downloading internet files to Databricks, see Upload files to Databricks.

Do I need to provide a URI scheme to access data?​

Work with files in Unity Catalog volumes​

Volumes limitations​

Work with workspace files​

Where do deleted workspace files go?​

Work with files in cloud object storage​

Work with files in DBFS mounts and DBFS root​

Work with files in ephemeral storage attached to the driver node​

Move data from ephemeral storage to volumes​

Additional resources​

Do I need to provide a URI scheme to access data?

Work with files in Unity Catalog volumes

Volumes limitations

Work with workspace files

Where do deleted workspace files go?

Work with files in cloud object storage

Work with files in DBFS mounts and DBFS root

Work with files in ephemeral storage attached to the driver node

Move data from ephemeral storage to volumes

Additional resources