Databricks File System

Databricks File System (DBFS) is a distributed file system mounted into a Databricks workspace and available on Databricks clusters. DBFS is an abstraction on top of scalable object storage and offers the following benefits:

  • Allows you to mount storage objects so that you can seamlessly access data without requiring credentials.
  • Allows you to interact with object storage using directory and file semantics instead of storage URLs.
  • Persists files to object storage, so you won’t lose data after you terminate a cluster.

DBFS root

The default storage location in DBFS is known as the DBFS root. Several types of data are stored in the following DBFS root locations:

  • /FileStore: Imported data files, generated plots, and uploaded libraries. See FileStore.
  • /databricks-datasets: Sample public datasets.
  • /databricks-results: Files generated by downloading the full results of a query.
  • /databricks/init: Global and cluster-named (deprecated) init scripts.
  • /user/hive/warehouse: Data and metadata for non-external Hive tables.

In a new workspace, the DBFS root has the following default folders:

../_images/dbfs-root.png

The DBFS root also contains data—including mount point metadata and credentials and certain types of logs—that is not visible and cannot be directly accessed.

Important

Data written to mount point paths (/mnt) is stored outside of the DBFS root. Even though the DBFS root is writeable, we recommend storing data in mounted object storage rather than in the DBFS root.

Note

For some time DBFS used an S3 bucket in the Databricks account to store data that is not stored on a DBFS mount point. If your Databricks workspace still uses this S3 bucket, we recommend that you contact Databricks support to have the data moved to an S3 bucket in your own account.

Mount object storage to DBFS

Mounting object storage to DBFS allows you to access objects in object storage as if they were on the local file system.

Important

All users have read and write access to the objects in object storage mounted to DBFS.

For information on how to mount and unmount AWS S3 buckets, see Mount S3 Buckets with DBFS. For information on encrypting data when writing to S3 through DBFS, see Encrypt data in S3 buckets.

For information on how to mount and unmount Azure Blob storage containers and Azure Data Lake Storage accounts, see Mount Azure Blob storage containers to DBFS, Mount Azure Data Lake Storage Gen1 resource using a service principal and OAuth 2.0, and Mount an Azure Data Lake Storage Gen2 account using a service principal and OAuth 2.0.

Access DBFS

You can access DBFS objects using the DBFS CLI, DBFS API, Databricks file system utilities (dbutils.fs), Spark APIs, and local file APIs. In a Spark cluster you access DBFS objects using Databricks file system utilities, Spark APIs, or local file APIs. On a local computer you access DBFS objects using the Databricks CLI or DBFS API.

Databricks CLI

The DBFS command-line interface (CLI) uses the DBFS API to expose an easy to use command-line interface to DBFS. Using this client, you can interact with DBFS using commands similar to those you use on a Unix command line. For example:

# List files in DBFS
dbfs ls
# Put local file ./apple.txt to dbfs:/apple.txt
dbfs cp ./apple.txt dbfs:/apple.txt
# Get dbfs:/apple.txt and save to local file ./apple.txt
dbfs cp dbfs:/apple.txt ./apple.txt
# Recursively put local dir ./banana to dbfs:/banana
dbfs cp -r ./banana dbfs:/banana

For more information about the DBFS command-line interface, see Databricks CLI.

dbutils

dbutils.fs provides file-system-like commands to access files in DBFS. This section has several examples of how to write files to and read files from DBFS using dbutils.fs commands.

Tip

To access the help menu for DBFS, use the dbutils.fs.help() command.

  • Write files to and read files from the DBFS root as if it were a local filesystem.

    dbutils.fs.mkdirs("/foobar/")
    
    dbutils.fs.put("/foobar/baz.txt", "Hello, World!")
    
    dbutils.fs.head("/foobar/baz.txt")
    
    dbutils.fs.rm("/foobar/baz.txt")
    
  • Use dbfs:/ to access a DBFS path.

    display(dbutils.fs.ls("dbfs:/foobar"))
    
  • Notebooks support a shorthand—%fs magic commands—for accessing the dbutils filesystem module. Most dbutils.fs commands are available using %fs magic commands.

    # List the DBFS root
    
    %fs ls
    
    # Recursively remove the files under foobar
    
    %fs rm -r foobar
    
    # Overwrite the file "/mnt/my-file" with the string "Hello world!"
    
    %fs put -f "/mnt/my-file" "Hello world!"
    

Spark APIs

When you’re using Spark APIs, you reference files with "/mnt/training/file.csv" or "dbfs:/mnt/training/file.csv". The following example writes the file foo.text to the DBFS /tmp directory.

df.write.text("/tmp/foo.txt")

Local file APIs

You can use local file APIs to read and write to DBFS paths. Databricks configures each cluster node with a FUSE mount /dbfs that allows processes running on cluster nodes to read and write to the underlying distributed storage layer with local file APIs. When using local file APIs, you must provide the path under /dbfs. For example:

Python
#write a file to DBFS using Python I/O APIs
with open("/dbfs/tmp/test_dbfs.txt", 'w') as f:
  f.write("Apache Spark is awesome!\n")
  f.write("End of example!")

# read the file
with open("/dbfs/tmp/test_dbfs.txt", "r") as f_read:
  for line in f_read:
    print line
Scala
import scala.io.Source

val filename = "/dbfs/tmp/test_dbfs.txt"
for (line <- Source.fromFile(filename).getLines()) {
  println(line)
}

Local file API Limitations

The following list enumerates the limitations in local file API usage that apply to each Databricks Runtime version.

  • All - Does not support AWS S3 mounts with client-side encryption enabled.

  • 6.0

    • Does not support random writes. For workloads that require random writes, perform the I/O on local disk first and then copy the result to /dbfs. For example:

      # python
      import xlsxwriter
      from shutil import copyfile
      
      workbook = xlsxwriter.Workbook('/local_disk0/tmp/excel.xlsx')
      worksheet = workbook.add_worksheet()
      worksheet.write(0, 0, "Key")
      worksheet.write(0, 1, "Value")
      workbook.close()
      
      copyfile('/local_disk0/tmp/excel.xlsx', '/dbfs/tmp/excel.xlsx')
      
    • Does not support sparse files. To copy sparse files, use cp --sparse=never:

      $ cp sparse.file /dbfs/sparse.file
      error writing '/dbfs/sparse.file': Operation not supported
      $ cp --sparse=never sparse.file /dbfs/sparse.file
      
  • 5.5 and below

    • Support only files less than 2GB in size. If you use local file I/O APIs to read or write files larger than 2GB you might see corrupted files. Instead, access files larger than 2GB using the DBFS CLI, dbutils.fs, or Spark APIs or use the /dbfs/ml folder described in Local file APIs for deep learning.

    • If you write a file using the local file I/O APIs and then immediately try to access it using the DBFS CLI, dbutils.fs, or Spark APIs, you might encounter a FileNotFoundException, a file of size 0, or stale file contents. That is expected because the OS caches writes by default. To force those writes to be flushed to persistent storage (in our case DBFS), use the standard Unix system call sync. For example:

      // scala
      import scala.sys.process._
      
      // Write a file using the local file API (over the FUSE mount).
      dbutils.fs.put("file:/dbfs/tmp/test", "test-contents")
      
      // Flush to persistent storage.
      "sync /dbfs/tmp/test" !
      
      // Read the file using "dbfs:/" instead of the FUSE mount.
      dbutils.fs.head("dbfs:/tmp/test")
      

Local file APIs for deep learning

For distributed deep learning applications, which require DBFS access for loading, checkpointing, and logging data, Databricks Runtime 6.0 and above provide a high-performance /dbfs mount that’s optimized for deep learning workloads.

In Databricks Runtime 5.4 and Databricks Runtime 5.5, only /dbfs/ml is optimized. In these versions Databricks recommends saving data under /dbfs/ml, which maps to dbfs:/ml.

For Databricks Runtime 5.3 and lower, see the recommendation in Prepare Storage for Data Loading and Model Checkpointing.