Amazon S3 Select
Amazon S3 Select enables retrieving only required data from an object. The Databricks S3 Select connector provides an Apache Spark data source that leverages S3 Select. When you use an S3 Select data source, filter and column selection on a DataFrame is pushed down, saving S3 data bandwidth.
Limitations
Amazon S3 Select supports the following file formats:
CSV and JSON files
UTF-8 encoding
GZIP or no compression
The Databricks S3 Select connector has the following limitations:
Complex types (arrays and objects) cannot be used in JSON
Schema inference is not supported
File splitting is not supported, however multiline records are supported
DBFS mount points are not supported
Important
Databricks strongly encourages you to use S3AFileSystem
provided by Databricks, which is the default for s3a://
, s3://
, and s3n://
file system schemes in Databricks Runtime. If you need assistance with migration to S3AFileSystem
, contact Databricks support or your Databricks account team.
Usage
sc.read.format("s3select").schema(...).options(...).load("s3://bucket/filename")
CREATE TABLE name (...) USING S3SELECT LOCATION 's3://bucket/filename' [ OPTIONS (...) ]
If the filename extension is .csv
or .json
, the format is automatically detected; otherwise you must provide the FileFormat
option.
Options
This section describes options for all file types and options specific to CSV and JSON.
Generic options
Option name |
Default value |
Description |
---|---|---|
FileFormat |
‘auto’ |
Input file type (‘auto’, ‘csv’, or ‘json’) |
CompressionType |
‘none’ |
Compression codec used by the input file (‘none’ or ‘gzip’) |
CSV specific options
Option name |
Default value |
Description |
---|---|---|
NullValue |
‘’ |
Character string representing null values in the input |
Header |
false |
Whether to skip the first line of the input (potential header contents are ignored) |
Comment |
‘#’ |
Lines starting with the value of this parameters are ignored |
RecordDelimiter |
‘n’ |
Character separating records in a file |
Delimiter |
‘,’ |
Character separating fields within a record |
Quote |
‘”’ |
Character used to quote values containing reserved characters |
Escape |
‘”’ |
Character used to escape quoted quote character |
AllowQuotedRecordDelimiter |
false |
Whether values can contain quoted record delimiters |
S3 authentication
You can use the S3 authentication methods (keys and instance profiles) available in Databricks; we recommend that you use instance profiles. There are three ways of providing the credentials:
Default Credential Provider Chain (recommended option): AWS credentials are automatically retrieved through the DefaultAWSCredentialsProviderChain. If you use instance profiles to authenticate to S3 then you should use this method. Other methods of providing credentials (methods 2 and 3) take precedence over this default.
Set keys in Hadoop conf: Specify AWS keys in Hadoop configuration properties.
Important
When using AWS keys to access S3, always set the configuration properties
fs.s3n.awsAccessKeyId
andfs.s3n.awsSecretAccessKey
as shown in the following example; the propertiesfs.s3a.access.key
andfs.s3a.secret.key
are not supported.To reference the
s3a://
filesystem, set thefs.s3n.awsAccessKeyId
andfs.s3n.awsSecretAccessKey
properties in a Hadoop XML configuration file or callsc.hadoopConfiguration.set()
to set Spark’s global Hadoop configuration.sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "$AccessKey") sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "$SecretKey")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", ACCESS_KEY) sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", SECRET_KEY)
Encode keys in URI: For example, the URI
s3a://$AccessKey:$SecretKey@bucket/path/to/dir
encodes the key pair (AccessKey
,SecretKey
).