Amazon S3 with Apache Spark

Amazon S3 is a cloud storage service that Databricks recommends for storing your large data files. We recommend using Databricks File System - DBFS to read from and write to Amazon S3.

Tip

Databricks supports the s3a protocol and recommends it over the native S3 block-based file system. See AmazonS3 Wiki for more details about the differences between the two.

Requirements

Get AWS access credentials. We recommend that you follow Best Practices for Hiding Secrets in Databricks.

Access S3 with DBFS

Mounting an S3 bucket using DBFS allows you to access the data as if it were on the local disk. The mount is a pointer to an S3 location, so the data is never synced locally. Once mounted, any user can read from that directory without the need for explicit keys.

Step 1: Mount S3 Bucket with DBFS

  • In Python
# Replace with your values
#
# Set the access to this notebook appropriately to protect the security of your keys.
# Or you can delete this cell after you run the mount command below once successfully.
#
ACCESS_KEY = "<aws-access-key>"
SECRET_KEY = "<aws-secret-key>"
ENCODED_SECRET_KEY = SECRET_KEY.replace("/", "%2F")
AWS_BUCKET_NAME = "<aws-bucket-name>"
MOUNT_NAME = "<mount-name>"

dbutils.fs.mount("s3a://%s:%s@%s" % (ACCESS_KEY, ENCODED_SECRET_KEY, AWS_BUCKET_NAME), "/mnt/%s" % MOUNT_NAME)
  • In Scala
// Replace with your values
//
// Set the access to this notebook appropriately to protect the security of your keys.
// Or you can delete this cell after you run the mount command below once successfully.

val AccessKey = "<aws-access-key>"
val SecretKey = "<aws-secret-key>"
val EncodedSecretKey = SecretKey.replace("/", "%2F")
val AwsBucketName = "<aws-bucket-name>"
val MountName = "<mount-name>"

dbutils.fs.mount(s"s3a://$AccessKey:$EncodedSecretKey@$AwsBucketName", s"/mnt/$MountName")

Step 2: Access S3 data

Use Databricks utilities to view your bucket and access your files. See Databricks File System - DBFS for details. For example:

myRDD = sc.textFile("/mnt/%s/...path_to_your_file..." % MOUNT_NAME)
myRDD.count()
display(dbutils.fs.ls("/mnt/%s" % MOUNT_NAME))

Alternative 1: Prefix with s3a://

Use any Spark command for creating RDDs, DataFrames, and Datasets from data on a file system. You must URL escape the secret key.

myRDD = sc.textFile("s3a://%s:%s@%s/.../..." % ACCESS_KEY, ENCODED_SECRET_KEY, BUCKET_NAME)
myRDD.count()

Alternative 2: Set AWS keys in the Spark context

This allows the Apache Spark workers to access your S3 bucket without requiring the credentials in the path. You do not need to escape your secret key.

sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", ACCESS_KEY)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", SECRET_KEY)
myRDD = sc.textFile("s3a://%s/.../..." % MOUNT_NAME)
myRDD.count()

Important

Databricks does not recommend this method.

Alternative 3: Use Boto

You can use the Boto Python library to programmatically write and read data from S3. However, this is not done in parallel.

Important

Databricks does not recommend this method.

Encryption

Databricks supports server-side and client-side encryption.

Server-side S3 encryption

This section covers how to use server-side encryption when writing files in S3 through DBFS. Databricks supports SSE-S3 and SSE-KMS.

Write files using SSE-S3

  1. To mount your S3 bucket with SSE-S3, run:

    dbutils.fs.mount(s"s3a://$AccessKey:$SecretKey@$AwsBucketName", s"/mnt/$MountName", "sse-s3")
    
  2. To write files to the corresponding S3 bucket with SSE-S3, run:

    dbutils.fs.put(s"/mnt/$MountName", "file content")
    

Write files using SSE-KMS

  1. Mount a source directory passing in sse-kms or sse-kms:$KmsKey as the encryption type.

    • To mount your S3 bucket with SSE-KMS using the default KMS master key, run:

      dbutils.fs.mount(s"s3a://$AccessKey:$SecretKey@$AwsBucketName", s"/mnt/$MountName", "sse-kms")
      
    • To mount your S3 bucket with SSE-KMS using a specific KMS key, run:

      dbutils.fs.mount(s"s3a://$AccessKey:$SecretKey@$AwsBucketName", s"/mnt/$MountName", "sse-kms:$KmsKey")
      
  2. To write files to the corresponding S3 bucket with SSE-KMS, run:

    dbutils.fs.put(s"/mnt/$MountName", "file content")
    

Client-side S3 encryption

Databricks has an implementation of the EncryptionMaterialsProvider support for the AWS S3AFileSystem. This is an advanced feature for clients that want to use client-side encryption of data on Databricks clusters and manage their own keys. This is similar to the feature provided by Amazon EMRFS.

  1. Attach a library containing your EncryptionMaterialsProvider class.

  2. Set the configuration:

    sc.hadoopConfiguration.setBoolean("fs.s3.cse.enabled", true)
    sc.hadoopConfiguration.setClass("fs.s3.cse.encryptionMaterialsProvider",
      classOf[<YourEncryptionMaterialsProvider>],
      classOf[com.amazonaws.services.s3.model.EncryptionMaterialsProvider])
    
  3. Read files with:

    sc.textFile("s3araw://<YOUR KEY>@bucket/foo")
    

    Note

    You must use s3araw and you cannot use DBFS mount points or caching in tandem with this approach.

Check the encryption type of mount points

To see the encryption type associated with each mount point, run the command:

display(dbutils.fs.mounts())