Amazon S3 with Apache Spark

S3 is Amazon’s cloud storage service, which Databricks recommend using for storing your large data files. We recommend reading using Databricks File System - DBFS in order to be able to read and write to Amazon S3 on Databricks. Mounting an S3 bucket using DBFS allows you to access the data as if it were on the local disk. The mount is a pointer to an S3 location, so the data is never synced locally. Once mounted, any user can read from that directory without the need for explicit keys.

Tip

We support s3a protocol and recommend it over the native S3 block-based file system. See AmazonS3 Wiki for more details on the differences between the two.

Mounting your S3 Bucket with DBFS

First what you’ll need to do is get your access credentials for being able to read from your Amazon bucket. by We recommend you follow the best practice for hiding secrets in Databricks see: Best Practices for Hiding Secrets in Databricks.

In Python

# Replace with your values
#
# NOTE: Set the access to this notebook appropriately to protect the security of your keys.
# Or you can delete this cell after you run the mount command below once successfully.
ACCESS_KEY = "REPLACE_WITH_YOUR_ACCESS_KEY"
SECRET_KEY = "REPLACE_WITH_YOUR_SECRET_KEY"
ENCODED_SECRET_KEY = SECRET_KEY.replace("/", "%2F")
AWS_BUCKET_NAME = "REPLACE_WITH_YOUR_S3_BUCKET"
MOUNT_NAME = "REPLACE_WITH_YOUR_MOUNT_NAME"

dbutils.fs.mount("s3a://%s:%s@%s" % (ACCESS_KEY, ENCODED_SECRET_KEY, AWS_BUCKET_NAME), "/mnt/%s" % MOUNT_NAME)

In Scala

The code is nearly identical in Scala.

// Replace with your values
//
// NOTE: Set the access to this notebook appropriately to protect the security of your keys.
// Or you can delete this cell after you run the mount command below once successfully.

val AccessKey = "REPLACE_WITH_YOUR_ACCESS_KEY"
val SecretKey = "REPLACE_WITH_YOUR_SECRET_KEY"
val EncodedSecretKey = SecretKey.replace("/", "%2F")
val AwsBucketName = "REPLACE_WITH_YOUR_S3_BUCKET"
val MountName = "REPLACE_WITH_YOUR_MOUNT_NAME"

dbutils.fs.mount(s"s3a://$AccessKey:$EncodedSecretKey@$AwsBucketName", s"/mnt/$MountName")

Accessing S3 Data

myRDD = sc.textFile("/mnt/%s/...path_to_your_file..." % MOUNT_NAME)
myRDD.count()

Once you mount your S3 bucket, you can use dbutils to view your bucket and access your files. See the Databricks File System - DBFS for details.

display(dbutils.fs.ls("/mnt/%s" % MOUNT_NAME))

Alternative 1: Prefix with s3a://

Use any Spark command for creating RDDs/DataFrames/Datasets from data on a file system. The secret key needs to be URL escaped.

myRDD = sc.textFile("s3a://%s:%s@%s/.../..." % ACCESS_KEY, ENCODED_SECRET_KEY, AWS_BUCKET_NAME)
myRDD.count()

Alternative 2: Set AWS Keys in the Spark Context

This will allow the Apache Spark workers to access your S3 bucket without requiring the credentials in the path. You do not need to escape your secret key when specified here.

Warning

this method is not recommended

In Python

sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", ACCESS_KEY)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", SECRET_KEY)
myRDD = sc.textFile("s3a://%s/.../..." % MOUNT_NAME)
myRDD.count()

Alternative 3: Using Boto

Lastly you could use the Boto Python library to programmatically write and read data from S3 however this is not done in parallel and is not recommended.

Amazon S3 Server-Side Encryption (SSE-S3/KMS)

  • This page covers how to use server-side encryption when writing files in S3 through DBFS.
  • This is possible through dbutils in both Python and Scala.
  • Currently, Databricks supports SSE-S3 and SSE-KMS.

Writing Files using SSE-S3

  • Mount a source directory passing in “sse-s3” as the encryption-type.
  • Write files using that mount point.

Run the command below to mount your S3 bucket with SSE-S3.

dbutils.fs.mount(s"s3a://$AccessKey:$SecretKey@$AwsBucketName", s"/mnt/$MountName", "sse-s3")

Run the command below to write files to the corresponding S3 bucket with SSE-S3.

dbutils.fs.put(s"/mnt/$MountName", "file content")

Writing Files using SSE-KMS

  • Mount a source directory passing in “sse-kms” or “sse-kms:$KmsKey” as the encryption type.
  • Write files using that mount point.

Run the command below to mount your S3 bucket with SSE-KMS using the default KMS master key.

dbutils.fs.mount(s"s3a://$AccessKey:$SecretKey@$AwsBucketName", s"/mnt/$MountName", "sse-kms")

Or run the command below to mount your S3 bucket with SSE-KMS using a specific KMS key.

dbutils.fs.mount(s"s3a://$AccessKey:$SecretKey@$AwsBucketName", s"/mnt/$MountName", "sse-kms:$KmsKey")

Now, run the command below to write files to the corresponding S3 bucket with SSE-KMS.

dbutils.fs.put(s"/mnt/$MountName", "file content")

Client-Side S3 Encryption

Databricks has an implementation of the EncryptionMaterialsProvider support for the AWS S3AFileSystem. This is an advanced feature for clients that would like to use client-side encryption of data on Databricks clusters and manage their own keys. This is similar to the feature provided by Amazon’s EMRFS.

The first step to using this feature is to attach a library (see: Libraries) containing your EncryptionMaterialsProvider class. After attaching this library you must set several configurations:

sc.hadoopConfiguration.setBoolean("fs.s3.cse.enabled", true)
sc.hadoopConfiguration.setClass("fs.s3.cse.encryptionMaterialsProvider",
  classOf[YourEncryptionMaterialsProvider],
  classOf[com.amazonaws.services.s3.model.EncryptionMaterialsProvider])

Note

You must use s3araw and you cannot use DBFS mount points or caching in tandem with this approach.

Read files with:

sc.textFile("s3araw://<YOUR KEY>@bucket/foo")

Checking the Encryption Type of Mount Points

It is possible to see the encryption type associated with each mount point by running the command below in both Python and Scala.

display(dbutils.fs.mounts())