Amazon S3

Amazon S3 is a cloud storage service that Databricks recommends for storing your large data files. We recommend using Databricks File System - DBFS to read from and write to Amazon S3.

Tip

Databricks supports the s3a protocol and recommends it over the native S3 block-based file system. See AmazonS3 Wiki for more details about the differences between the two.

This topic explains how to access AWS S3 buckets by mounting AWS S3 buckets using DBFS or directly using APIs.

Authentication

We recommend using IAM roles for authentication and authorization. You can also use AWS keys, although we do not recommend doing so. If you use keys, for leveraging credentials safely in Databricks, we recommend that you follow the Secrets user guide.

Access S3 with DBFS

Mount an S3 bucket

Mounting an S3 bucket using DBFS allows you to access the data as if it were on the local disk. The mount is a pointer to an S3 location, so the data is never synced locally. Once mounted, any user can read from that directory.

Tip

One common issue is to pick bucket names that are not valid URIs. For information, see S3 bucket name limitations.

  1. Mount the buckets. In the following examples, replace the values in the following cell with your S3 credentials.

    Python
    ACCESS_KEY = "<aws-access-key>"
    SECRET_KEY = "<aws-secret-key>"
    ENCODED_SECRET_KEY = SECRET_KEY.replace("/", "%2F")
    AWS_BUCKET_NAME = "<aws-bucket-name>"
    MOUNT_NAME = "<mount-name>"
    
    dbutils.fs.mount("s3a://%s:%s@%s" % (ACCESS_KEY, ENCODED_SECRET_KEY, AWS_BUCKET_NAME), "/mnt/%s" % MOUNT_NAME)
    display(dbutils.fs.ls("/mnt/%s" % MOUNT_NAME))
    
    Scala
    val AccessKey = "<aws-access-key>"
    // Encode the Secret Key as that can contain "/"
    val SecretKey = "<aws-secret-key>"
    val EncodedSecretKey = SecretKey.replace("/", "%2F")
    val AwsBucketName = "<aws-bucket-name>"
    val MountName = "<mount-name>"
    
    dbutils.fs.mount(s"s3a://$AccessKey:$EncodedSecretKey@$AwsBucketName", s"/mnt/$MountName")
    display(dbutils.fs.ls(s"/mnt/$MountName"))
    
  2. Access files in your S3 bucket as if they were local files:

    Python
    df = spark.read.text("/mnt/%s/...." % MOUNT_NAME)
    

    or

    df = spark.read.text("dbfs:/MOUNT_NAME/....")
    
Scala
// scala
val df = spark.read.text(s"/mnt/$MountName/....")

or

val df = spark.read.text(s"dbfs:/$MountName/....")

Note

You can also use FUSE mounts to access mounted S3 buckets by referring to /dbfs/mnt/myMount/.

Unmount an S3 bucket

To unmount a mount point, use the following command:

Python
dbutils.fs.unmount("/mnt/MOUNT_NAME")
Scala
dbutils.fs.unmount(s"/mnt/$MountName")

Access AWS S3 directly

Alternative 1: Set AWS keys in the Spark context

This allows the Apache Spark workers to access your S3 bucket without requiring the credentials in the path. You do not need to escape your secret key.

sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", ACCESS_KEY)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", SECRET_KEY)
myRDD = sc.textFile("s3a://%s/.../..." % MOUNT_NAME)
myRDD.count()

Alternative 2: Encode keys in URI

Use any Spark command for creating RDDs, DataFrames, and Datasets from data on a file system. You must URL escape the secret key.

myRDD = sc.textFile("s3a://%s:%s@%s/.../..." % ACCESS_KEY, ENCODED_SECRET_KEY, BUCKET_NAME)
myRDD.count()

Important

Databricks does not recommend this method.

Alternative 3: Use Boto

You can use the Boto Python library to programmatically write and read data from S3. However, this is not done in parallel.

Important

Databricks does not recommend this method.

Encryption

Databricks supports server-side and client-side encryption.

Server-side S3 encryption

This section covers how to use server-side encryption when writing files in S3 through DBFS. Databricks supports Amazon S3-managed encryption keys (SSE-S3) and AWS KMS–managed encryption keys (SSE-KMS).

Write files using SSE-S3

  1. To mount your S3 bucket with SSE-S3, run:

    dbutils.fs.mount(s"s3a://$AccessKey:$SecretKey@$AwsBucketName", s"/mnt/$MountName", "sse-s3")
    
  2. To write files to the corresponding S3 bucket with SSE-S3, run:

    dbutils.fs.put(s"/mnt/$MountName", "<file content>")
    

Write files using SSE-KMS

  1. Mount a source directory passing in sse-kms or sse-kms:$KmsKey as the encryption type.

    • To mount your S3 bucket with SSE-KMS using the default KMS master key, run:

      dbutils.fs.mount(s"s3a://$AccessKey:$SecretKey@$AwsBucketName", s"/mnt/$MountName", "sse-kms")
      
    • To mount your S3 bucket with SSE-KMS using a specific KMS key, run:

      dbutils.fs.mount(s"s3a://$AccessKey:$SecretKey@$AwsBucketName", s"/mnt/$MountName", "sse-kms:$KmsKey")
      
  2. To write files to the S3 bucket with SSE-KMS, run:

    dbutils.fs.put(s"/mnt/$MountName", "<file content>")
    
Configure KMS encryption

If you want to use the s3a:// paths in your code, you must set up the following global KMS encryption properties in a Spark configuration setting or using an init script. Configure the spark.hadoop.fs.s3a.server-side-encryption-kms-master-key-id key with your own key ARN.

spark.hadoop.fs.s3a.server-side-encryption-kms-master-key-id arn:aws:kms:<region>:<aws-acccount-id>:key/<bbbbbbbb-ddd-ffff-aaa-bdddddddddd>
spark.hadoop.fs.s3a.server-side-encryption-algorithm aws:kms
spark.hadoop.fs.s3a.impl com.databricks.s3a.S3AFileSystem

To use these configurations you must also configure an IAM role and add the IAM role as a key user for the KMS key provided in the configuration. To add key user permission to an IAM role:

  1. Go to the IAM service.
  2. Click Encryption Keys at the bottom of the sidebar.
  3. Click the key that you want to add permission to.
  4. In the Key Users section, click Add.
  5. Select the checkbox next to the IAM role.
  6. Click Attach.

Init script

You can test the global encryption setting by running the following code in a notebook cell, and launching a cluster named test-kms. Once you verify that encryption is working, remove test-kms, and rerun the cell to enable encryption on all clusters.

%python
dbutils.fs.put("/databricks/init/test-kms/set-kms.sh", """
#!/bin/bash

cat >/databricks/driver/conf/aes-encrypt-custom-spark-conf.conf <<EOL
[driver] {
  "spark.hadoop.fs.s3a.server-side-encryption-kms-master-key-id" = "arn:aws:kms:<region>:<aws-acccount-id>:key/<bbbbbbbb-ddd-ffff-aaa-bdddddddddd>"
  "spark.hadoop.fs.s3a.server-side-encryption-algorithm" = "aws:kms"
  "spark.hadoop.fs.s3a.impl" = "com.databricks.s3a.S3AFileSystem"
}
EOL
""", true)

Client-side S3 encryption

Databricks has an implementation of the EncryptionMaterialsProvider support for the AWS S3AFileSystem. This is an advanced feature for clients that want to use client-side encryption of data on Databricks clusters and manage their own keys. This is similar to the feature provided by Amazon EMRFS.

  1. Attach a library containing your EncryptionMaterialsProvider class.

  2. Set the configuration:

    sc.hadoopConfiguration.setBoolean("fs.s3.cse.enabled", true)
    sc.hadoopConfiguration.setClass("fs.s3.cse.encryptionMaterialsProvider",
      classOf[<YourEncryptionMaterialsProvider>],
      classOf[com.amazonaws.services.s3.model.EncryptionMaterialsProvider])
    
  3. Read files with:

    sc.textFile("s3araw://<YOUR KEY>@bucket/foo")
    

    Note

    You must use s3araw and you cannot use DBFS mount points or caching in tandem with this approach.

Check the encryption type of mount points

To verify the encryption type associated with each mount point, run the command:

display(dbutils.fs.mounts())