Amazon S3 Select

Amazon S3 Select enables retrieving only required data from an object. The Databricks S3 Select connector provides an Apache Spark data source that leverages S3 Select. When you use an S3 Select data source, filter and column selection on a DataFrame is pushed down, saving S3 data bandwidth.

Experimental

The legacy query federation documentation has been retired and might not be updated. The configurations mentioned in this content are not officially endorsed or tested by Databricks. If Lakehouse Federation supports your source database, Databricks recommends using that instead.

Limitations

Amazon S3 Select supports the following file formats:

CSV and JSON files
UTF-8 encoding
GZIP or no compression

The Databricks S3 Select connector has the following limitations:

Complex types (arrays and objects) cannot be used in JSON
Schema inference is not supported
File splitting is not supported, however multiline records are supported
DBFS mount points are not supported

important

Databricks strongly encourages you to use S3AFileSystem provided by Databricks, which is the default for s3a://, s3://, and s3n:// file system schemes in Databricks Runtime. If you need assistance with migration to S3AFileSystem, contact Databricks support or your Databricks account team.

Usage

Scala
SQL

Scala
sc.read.format("s3select").schema(...).options(...).load("s3://bucket/filename")

SQL
CREATE TABLE name (...) USING S3SELECT LOCATION 's3://bucket/filename' [ OPTIONS (...) ]

If the filename extension is .csv or .json, the format is automatically detected; otherwise you must provide the FileFormat option.

Options

This section describes options for all file types and options specific to CSV and JSON.

Generic options

Option name	Default value	Description
FileFormat	'auto'	Input file type ('auto', 'csv', or 'json')
CompressionType	'none'	Compression codec used by the input file ('none' or 'gzip')

CSV specific options

Option name	Default value	Description
NullValue	''	Character string representing null values in the input
Header	false	Whether to skip the first line of the input (potential header contents are ignored)
Comment	'#'	Lines starting with the value of this parameters are ignored
RecordDelimiter	'n'	Character separating records in a file
Delimiter	','	Character separating fields within a record
Quote	'”'	Character used to quote values containing reserved characters
Escape	'”'	Character used to escape quoted quote character
AllowQuotedRecordDelimiter	false	Whether values can contain quoted record delimiters

JSON specific options

Option name	Default value	Description
Type	document	Type of input ('document' or 'lines')

S3 authentication

You can use the S3 authentication methods (keys and instance profiles) available in Databricks; we recommend that you use instance profiles. There are three ways of providing the credentials:

Default Credential Provider Chain (recommended option): AWS credentials are automatically retrieved through the DefaultAWSCredentialsProviderChain. If you use instance profiles to authenticate to S3 then you should use this method. Other methods of providing credentials (methods 2 and 3) take precedence over this default.
Set keys in Hadoop conf: Specify AWS keys in Hadoop configuration properties.
important
- When using AWS keys to access S3, always set the configuration properties fs.s3n.awsAccessKeyId and fs.s3n.awsSecretAccessKey as shown in the following example; the properties fs.s3a.access.key and fs.s3a.secret.key are not supported.
- To reference the s3a:// filesystem, set the fs.s3n.awsAccessKeyId and fs.s3n.awsSecretAccessKey properties in a Hadoop XML configuration file or call sc.hadoopConfiguration.set() to set Spark's global Hadoop configuration.
  Scala
  sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "$AccessKey") sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "$SecretKey")
  Python
  sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", ACCESS_KEY) sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", SECRET_KEY)
Encode keys in URI: For example, the URI s3a://$AccessKey:$SecretKey@bucket/path/to/dir encodes the key pair (AccessKey, SecretKey).

Limitations​

Usage​

Options​

Generic options​

CSV specific options​

JSON specific options​

S3 authentication​