How to work with files on Databricks

You can work with files on DBFS, the local driver node of the cluster, cloud object storage, external locations, and in Databricks Repos. You can integrate other systems, but many of these do not provide direct file access to Databricks.

This article focuses on understanding the differences between interacting with files stored in the ephemeral volume storage attached to a running cluster and files stored in the DBFS root. You can directly apply the concepts shown for the DBFS root to mounted cloud object storage, because the /mnt directory is under the DBFS root. Most examples can also be applied to direct interactions with cloud object storage and external locations if you have the required privileges.

What is the root path for Databricks?

The root path on Databricks depends on the code executed.

The DBFS root is the root path for Spark and DBFS commands. These include:

  • Spark SQL

  • DataFrames

  • dbutils.fs

  • %fs

The block storage volume attached to the driver is the root path for code executed locally. This includes:

  • %sh

  • Most Python code (not PySpark)

  • Most Scala code (not Spark)

Note

If you are working in Databricks Repos, the root path for %sh is your current repo directory. For more details, see Create and edit files and directories programmatically.

Access files on the DBFS root

When using commands that default to the DBFS root, you can use the relative path or include dbfs:/.

SELECT * FROM parquet.`<path>`;
SELECT * FROM parquet.`dbfs:/<path>`
df = spark.read.load("<path>")
df.write.save("<path>")
dbutils.fs.<command> ("<path>")
%fs <command> /<path>

When using commands that default to the driver volume, you must use /dbfs before the path.

%sh <command> /dbfs/<path>/
import os
os.<command>('/dbfs/<path>')

Access files on the driver filesystem

When using commands that default to the driver storage, you can provide a relative or absolute path.

%sh <command> /<path>
import os
os.<command>('/<path>')

When using commands that default to the DBFS root, you must use file:/.

dbutils.fs.<command> ("file:/<path>")
%fs <command> file:/<path>

Because these files live on the attached driver volumes and Spark is a distributed processing engine, not all operations can directly access data here. If you need to move data from the driver filesystem to DBFS, you can copy files using magic commands or the Databricks utilities.

dbutils.fs.cp ("file:/<path>", "dbfs:/<path>")
%sh cp /<path> /dbfs/<path>
%fs cp file:/<path> /<path>

Understand default locations with examples

The table and diagram summarize and illustrate the commands described in this section and when to use each syntax.

Command

Default location

To read from DBFS root

To read from local filesystem

%fs

DBFS root

Add file:/ to path

%sh

Local driver node

Add /dbfs to path

dbutils.fs

DBFS root

Add file:/ to path

os.<command> or other local code

Local driver node

Add /dbfs to path

spark.[read/write]

DBFS root

Not supported

File paths diagram
# Default location for %fs is root
%fs ls /tmp/
%fs mkdirs /tmp/my_cloud_dir
%fs cp /tmp/test_dbfs.txt /tmp/file_b.txt
# Default location for dbutils.fs is root
dbutils.fs.ls ("/tmp/")
dbutils.fs.put("/tmp/my_new_file", "This is a file in cloud storage.")
# Default location for %sh is the local filesystem
%sh ls /dbfs/tmp/
# Default location for os commands is the local filesystem
import os
os.listdir('/dbfs/tmp')
# With %fs and dbutils.fs, you must use file:/ to read from local filesystem
%fs ls file:/tmp
%fs mkdirs file:/tmp/my_local_dir
dbutils.fs.ls ("file:/tmp/")
dbutils.fs.put("file:/tmp/my_new_file", "This is a file on the local driver node.")
# %sh reads from the local filesystem by default
%sh ls /tmp

Access files on mounted object storage

Mounting object storage to DBFS allows you to access objects in object storage as if they were on the local file system.

dbutils.fs.ls("/mnt/mymount")
df = spark.read.format("text").load("dbfs:/mymount/my_file.txt")

Local file API limitations

The following lists the limitations in local file API usage with FUSE in Databricks Runtime.

  • Does not support Amazon S3 mounts with client-side encryption enabled.

  • Does not support random writes. For workloads that require random writes, perform the operations on local disk first and then copy the result to /dbfs. For example:

# python
import xlsxwriter
from shutil import copyfile

workbook = xlsxwriter.Workbook('/local_disk0/tmp/excel.xlsx')
worksheet = workbook.add_worksheet()
worksheet.write(0, 0, "Key")
worksheet.write(0, 1, "Value")
workbook.close()

copyfile('/local_disk0/tmp/excel.xlsx', '/dbfs/tmp/excel.xlsx')
  • No sparse files. To copy sparse files, use cp --sparse=never:

$ cp sparse.file /dbfs/sparse.file
error writing '/dbfs/sparse.file': Operation not supported
$ cp --sparse=never sparse.file /dbfs/sparse.file