Databricks File System - DBFS

The Databricks File System or DBFS is a distributed file system that comes installed on Spark Clusters in Databricks. It is a layer over S3, which allows you to:

  1. Mount S3 buckets to make them available to users in your workspace
  2. Cache S3 data on the solid-state disks (SSDs) of your worker nodes to speed up access.

The Databricks File System is available in both Python and Scala. By default, DBFS uses an S3 bucket created in the Databricks account to store data that is not stored on a DBFS mount point. Databricks can switch this over to an S3 bucket in your own account at your request. Mounting other S3 buckets in DBFS gives your Databricks users access to specific data without requiring them to have your S3 keys. In addition,

  • DBFS can cache data from S3 (including any bucket you mount) onto the SSDs of the Spark clusters you launch.
  • Files in DBFS persist to S3, so you won’t lose data even after you terminate the clusters.
  • dbutils makes it easy for you to use DBFS and is automatically available (no import necessary) in every Databricks notebook.

You can access DBFS through Databricks Utilities - dbutils with a Spark Cluster or using the DBFS Command Line Interface on your local computer.

DBFS Command Line Interface

The DBFS command line interface leverages the DBFS API to expose a easy to use command line interface to DBFS. Using this client, interacting with DBFS is as easy as running.

# List files in DBFS
dbfs ls
# Put local file ./apple.txt to dbfs:/apple.txt
dbfs cp ./apple.txt dbfs:/apple.txt
# Get dbfs:/apple.txt and save to local file ./apple.txt
dbfs cp dbfs:/apple.txt ./apple.txt
# Recursively put local dir ./banana to dbfs:/banana
dbfs cp -r ./banana dbfs:/banana

Installing is as simple as running

pip install --upgrade databricks-cli.

For more instructions visit our Github project.

Note

The Databricks Command Line interface is under active development and is released as an experimental client.

Saving Files to DBFS with dbutils

  • Read and write files to DBFS as if it were a local filesystem.
dbutils.fs.mkdirs("/foobar/")
dbutils.fs.put("/foobar/baz.txt", "Hello, World!")
dbutils.fs.head("/foobar/baz.txt")
dbutils.fs.rm("/foobar/baz.txt")

Use Spark to write to DBFS

sc.parallelize(range(0, 100)).saveAsTextFile("/tmp/foo.txt")
sc.parallelize(0 until 100).saveAsTextFile("/tmp/bar.txt")

Use nothing or dbfs:/ to access a DBFS path.

display(dbutils.fs.ls("dbfs:/foobar"))

Use file:/ to access the local disk

dbutils.fs.ls("file:/foobar")

Filesystem cells provide a shorthand for accessing the dbutils filesystem module. Most dbutils.fs are available via the %fs magic command as well.

%fs rm -r foobar

Using Local File I/O APIs

  • Users can use local APIs to read and write to DBFS paths. Databricks configures each node with a fuse mount that allows processes to read / write to the underlying distributed storage layer.
#python
# write a file to DBFS using python i/o apis
with open("/dbfs/tmp/test_dbfs.txt", 'w') as f:
  f.write("Apache Spark is awesome!\n")
  f.write("End of example!")

# read the file
with open("/dbfs/tmp/test_dbfs.txt", "r") as f_read:
  for line in f_read:
    print line
// scala
import scala.io.Source

val filename = "/dbfs/tmp/test_dbfs.txt"
for (line <- Source.fromFile(filename).getLines()) {
  println(line)
}

Mounting an S3 Bucket

  • Mounting an S3 directly to DBFS allows you to access files in S3 as if it were just on the local file system.

Note

One common issue is to pick bucket names which are not valid URIs. S3 bucket name limitations: http://docs.aws.amazon.com/AmazonS3/latest/dev/BucketRestrictions.html

We recommend Secure Access to S3 Buckets using IAM Roles for mounting your buckets. IAM roles allow you to mount a bucket as a path. You can also mount a bucket using keys, although we do not recommend doing so.

Replace the values in the following cell with your S3 credentials.

# python
ACCESS_KEY = "YOUR_ACCESS_KEY"
# Encode the Secret Key as that can contain "/"
SECRET_KEY = "YOUR_SECRET_KEY".replace("/", "%2F")
AWS_BUCKET_NAME = "MY_BUCKET"
MOUNT_NAME = "MOUNT_NAME"

dbutils.fs.mount("s3a://%s:%s@%s" % (ACCESS_KEY, SECRET_KEY, AWS_BUCKET_NAME), "/mnt/%s" % MOUNT_NAME)
display(dbutils.fs.ls("/mnt/%s" % MOUNT_NAME))
// scala
// Replace with your values
val AccessKey = "YOUR_ACCESS_KEY"
// Encode the Secret Key as that can contain "/"
val SecretKey = "YOUR_SECRET_KEY".replace("/", "%2F")
val AwsBucketName = "MY_BUCKET"
val MountName = "MOUNT_NAME"

dbutils.fs.mount(s"s3a://$AccessKey:$SecretKey@$AwsBucketName", s"/mnt/$MountName")
display(dbutils.fs.ls(s"/mnt/$MountName"))

Now you can access files in your S3 bucket as if they were local files, for example:

rdd = sc.textFile("/mnt/%s/...." % MOUNT_NAME)
rdd = sc.textFile("dbfs:/$MountName/....")
val rdd = sc.textFile(s"/$MountName/....")
val rdd = sc.textFile(s"dbfs:/$MountName/....")

Note: You can use the fuse mounts to access mounted S3 buckets by referring to /dbfs/mnt/myMount/.

Getting Help

  • Use the dbutils.fs.help() command anytime to access the help menu for DBFS.