Amazon S3 Select

Amazon S3 Select enables retrieving only required data from an object. The Databricks S3 Select connector provides an Apache Spark data source that leverages S3 Select. When you use an S3 Select data source, filter and column selection on a DataFrame is pushed down, saving S3 data bandwidth.

Limitations

Amazon S3 Select supports the following file formats:

  • CSV and JSON files

  • UTF-8 encoding

  • GZIP or no compression

The Databricks S3 Select connector has the following limitations:

  • Complex types (arrays and objects) cannot be used in JSON

  • Schema inference is not supported

  • File splitting is not supported, however multiline records are supported

  • DBFS mount points are not supported

Important

Databricks Runtime 7.0, which includes an AWS SDK upgrade to 1.11.655, does not support org.apache.hadoop.fs.s3native.NativeS3FileSystem and org.apache.hadoop.fs.s3.S3FileSystem for accessing S3.

Databricks strongly encourage you to use S3AFileSystem provided by Databricks, which is the default for s3a://, s3://, and s3n:// file system schemes in Databricks Runtime. If you need assistance with migration to S3AFileSystem, contact Databricks support or your Databricks account team.

Usage

sc.read.format("s3select").schema(...).options(...).load("s3://bucket/filename")
CREATE TABLE name (...) USING S3SELECT LOCATION 's3://bucket/filename' [ OPTIONS (...) ]

If the filename extension is .csv or .json, the format is automatically detected; otherwise you must provide the FileFormat option.

Options

This section describes options for all file types and options specific to CSV and JSON.

Generic options

Option name

Default value

Description

FileFormat

‘auto’

Input file type (‘auto’, ‘csv’, or ‘json’)

CompressionType

‘none’

Compression codec used by the input file (‘none’ or ‘gzip’)

CSV specific options

Option name

Default value

Description

NullValue

‘’

Character string representing null values in the input

Header

false

Whether to skip the first line of the input (potential header contents are ignored)

Comment

‘#’

Lines starting with the value of this parameters are ignored

RecordDelimiter

‘n’

Character separating records in a file

Delimiter

‘,’

Character separating fields within a record

Quote

‘”’

Character used to quote values containing reserved characters

Escape

‘”’

Character used to escape quoted quote character

AllowQuotedRecordDelimiter

false

Whether values can contain quoted record delimiters

JSON specific options

Option name

Default value

Description

Type

document

Type of input (‘document’ or ‘lines’)

S3 authentication

You can use the S3 authentication methods (keys and instance profiles) available in Databricks; we recommend that you use instance profiles. There are three ways of providing the credentials:

  1. Default Credential Provider Chain (recommended option): AWS credentials are automatically retrieved through the DefaultAWSCredentialsProviderChain. If you use instance profiles to authenticate to S3 then you should use this method. Other methods of providing credentials (methods 2 and 3) take precedence over this default.

  2. Set keys in Hadoop conf: Specify AWS keys in Hadoop configuration properties.

    Important

    • When using AWS keys to access S3, always set the configuration properties fs.s3n.awsAccessKeyId and fs.s3n.awsSecretAccessKey as shown in the following example; the properties fs.s3a.access.key and fs.s3a.secret.key are not supported.

    • To reference the s3a:// filesystem, set the fs.s3n.awsAccessKeyId and fs.s3n.awsSecretAccessKey properties in a Hadoop XML configuration file or call sc.hadoopConfiguration.set() to set Spark’s global Hadoop configuration.

      sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "$AccessKey")
      sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "$SecretKey")
      
      sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", ACCESS_KEY)
      sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", SECRET_KEY)
      
  3. Encode keys in URI: For example, the URI s3a://$AccessKey:$SecretKey@bucket/path/to/dir encodes the key pair (AccessKey, SecretKey).