Auto Loader options
Configuration options specific to the cloudFiles
source are prefixed with cloudFiles
so that they are in a separate namespace from other Structured Streaming source options.
Common Auto Loader options
You can configure the following options for directory listing or file notification mode.
Option |
---|
Type: Whether to allow input directory file changes to overwrite existing data. Available in Databricks Runtime 7.6 and above. There are a few caveats regarding enabling this config. Please refer to Auto Loader FAQ for details. Default value: |
Type: Auto Loader can trigger asynchronous backfills at a given interval,
e.g. Default value: None |
Type: The data file format in the source path. Allowed values include:
Default value: None (required option) |
Type: Whether to include existing files in the stream processing input path or to only process new files arriving after initial setup. This option is evaluated only when you start a stream for the first time. Changing this option after restarting the stream has no effect. Default value: |
Type: Whether to infer exact column types when leveraging schema inference. By default, columns are inferred as strings when inferring JSON and CSV datasets. See schema inference for more details. Default value: |
Type: The maximum number of new bytes to be processed in every trigger.
You can specify a byte string such as Default value: None |
Type: How long a file event is tracked for deduplication purposes. Databricks does not recommend tuning this parameter unless you are ingesting data at the order of millions of files an hour. See the section on Event retention for more details. Default value: None |
Type: The maximum number of new files to be processed in every trigger.
When used together with Default value: 1000 |
Type: A comma separated list of Hive style partition columns that you would like
inferred from the directory structure of the files. Hive style partition
columns are key value pairs combined by an equality sign such as
Specifying Default value: None |
Type: The mode for evolving the schema as new columns are discovered in the data. By default, columns are inferred as strings when inferring JSON datasets. See schema evolution for more details. Default value: |
Type: Schema information that you provide to Auto Loader during schema inference. See schema hints for more details. Default value: None |
Type: The location to store the inferred schema and subsequent changes. See schema inference for more details. Default value: None (required when inferring the schema) |
Type: Whether to use a strict globber that matches the default globbing behavior of other file sources in Apache Spark. See Common data loading patterns for more details. Available in Databricks Runtime 12.0 and above. Default value: |
Type: Whether to validate Auto Loader options and return an error for unknown or inconsistent options. Default value: |
Directory listing options
The following options are relevant to directory listing mode.
Option |
---|
Type: Whether to use the incremental listing rather than the full listing in
directory listing mode.
By default, Auto Loader will make the best effort to automatically detect if
a given directory is applicable for the incremental listing. You can
explicitly use the incremental listing or use the full directory listing
by setting it as Works with Azure Data Lake Storage Gen2 ( Available in Databricks Runtime 9.1 LTS and above. Default value: Available values: |
File notification options
The following options are relevant to file notification mode.
Option |
---|
Type: Number of threads to use when fetching messages from the queueing service. Default value: 1 |
Type: A JSON string Required only if you specify a Default value: None |
Type: A series of key-value tag pairs to help associate and identify related resources, for example:
For more information on AWS, see Amazon SQS cost allocation tags and Configuring tags for an Amazon SNS topic. (1) For more information on Azure, see
Naming Queues and Metadata
and the coverage of For more information on GCP, see Reporting usage with labels. (1) Default value: None |
Type: Whether to use file notification mode to determine when there are new files.
If Default value: |
(1) Auto Loader adds the following key-value tag pairs by default on a best-effort basis:
vendor
:Databricks
path
: The location from where the data is loaded. Unavailable in GCP due to labeling limitations.checkpointLocation
: The location of the stream’s checkpoint. Unavailable in GCP due to labeling limitations.streamId
: A globally unique identifier for the stream.
These key names are reserved and you cannot overwrite their values.
File format options
With Auto Loader you can ingest JSON
, CSV
, PARQUET
, AVRO
, TEXT
, BINARYFILE
, and ORC
files.
Generic options
The following options apply to all file formats.
Option |
---|
ignoreCorruptFiles Type: Whether to ignore corrupt files. If true, the Spark jobs will continue
to run when encountering corrupted files and the contents that have been
read will still be returned. Observable as Default value: |
ignoreMissingFiles Type: Whether to ignore missing files. If true, the Spark jobs will continue to run when encountering missing files and the contents that have been read will still be returned. Available in Databricks Runtime 11.0 and above. Default value: |
modifiedAfter Type: An optional timestamp to ingest files that have a modification timestamp after the provided timestamp. Default value: None |
modifiedBefore Type: An optional timestamp to ingest files that have a modification timestamp before the provided timestamp. Default value: None |
pathGlobFilter or fileNamePattern Type: A potential glob pattern to provide for choosing files. Equivalent to
Default value: None |
recursiveFileLookup Type: Whether to load data recursively within the base directory and skip partition inference. Default value: |
JSON
options
Option |
---|
allowBackslashEscapingAnyCharacter Type: Whether to allow backslashes to escape any character that succeeds it. If not enabled, only characters that are explicitly listed by the JSON specification can be escaped. Default value: |
allowComments Type: Whether to allow the use of Java, C, and C++ style comments
( Default value: |
allowNonNumericNumbers Type: Whether to allow the set of not-a-number ( Default value: |
allowNumericLeadingZeros Type: Whether to allow integral numbers to start with additional (ignorable) zeroes (for example, 000001). Default value: |
allowSingleQuotes Type: Whether to allow use of single quotes (apostrophe,
character Default value: |
allowUnquotedControlChars Type: Whether to allow JSON strings to contain unescaped control characters (ASCII characters with value less than 32, including tab and line feed characters) or not. Default value: |
allowUnquotedFieldNames Type: Whether to allow use of unquoted field names (which are allowed by JavaScript, but not by the JSON specification). Default value: |
badRecordsPath Type: The path to store files for recording the information about bad JSON records. Default value: None |
columnNameOfCorruptRecord Type: The column for storing records that are malformed and cannot be parsed.
If the Default value: |
dateFormat Type: The format for parsing date strings. Default value: |
dropFieldIfAllNull Type: Whether to ignore columns of all null values or empty arrays and structs during schema inference. Default value: |
encoding or charset Type: The name of the encoding of the JSON files. See Default value: |
inferTimestamp Type: Whether to try and infer timestamp strings as a Default value: |
lineSep Type: A string between two consecutive JSON records. Default value: None, which covers |
locale Type: A Default value: |
mode Type: Parser mode around handling malformed records. One of Default value: |
multiLine Type: Whether the JSON records span multiple lines. Default value: |
prefersDecimal Type: Attempts to infer strings as Default value: |
primitivesAsString Type: Whether to infer primitive types like numbers and booleans as Default value: |
rescuedDataColumn Type: Whether to collect all data that can’t be parsed due to a data type mismatch or schema mismatch (including column casing) to a separate column. This column is included by default when using Auto Loader. For more details, refer to What is the rescued data column?. Default value: None |
timestampFormat Type: The format for parsing timestamp strings. Default value: |
timeZone Type: The Default value: None |
CSV
options
Option |
---|
badRecordsPath Type: The path to store files for recording the information about bad CSV records. Default value: None |
charToEscapeQuoteEscaping Type: The character used to escape the character used for escaping quotes.
For example, for the following record:
Default value: |
columnNameOfCorruptRecord Type: A column for storing records that are malformed and cannot be parsed.
If the Default value: |
comment Type: Defines the character that represents a line comment when found in the
beginning of a line of text. Use Default value: |
dateFormat Type: The format for parsing date strings. Default value: |
emptyValue Type: String representation of an empty value. Default value: |
encoding or charset Type: The name of the encoding of the CSV files. See Default value: |
enforceSchema Type: Whether to forcibly apply the specified or inferred schema to the CSV files. If the option is enabled, headers of CSV files are ignored. This option is ignored by default when using Auto Loader to rescue data and allow schema evolution. Default value: |
escape Type: The escape character to use when parsing the data. Default value: |
header Type: Whether the CSV files contain a header. Auto Loader assumes that files have headers when inferring the schema. Default value: |
ignoreLeadingWhiteSpace Type: Whether to ignore leading whitespaces for each parsed value. Default value: |
ignoreTrailingWhiteSpace Type: Whether to ignore trailing whitespaces for each parsed value. Default value: |
inferSchema Type: Whether to infer the data types of the parsed CSV records or to assume all
columns are of Default value: |
lineSep Type: A string between two consecutive CSV records. Default value: None, which covers |
locale Type: A Default value: |
maxCharsPerColumn Type: Maximum number of characters expected from a value to parse. Can be used to
avoid memory errors. Defaults to Default value: |
maxColumns Type: The hard limit of how many columns a record can have. Default value: |
mergeSchema Type: Whether to infer the schema across multiple files and to merge the schema of each file. Enabled by default for Auto Loader when inferring the schema. Default value: |
mode Type: Parser mode around handling malformed records. One of Default value: |
multiLine Type: Whether the CSV records span multiple lines. Default value: |
nanValue Type: The string representation of a non-a-number value when parsing Default value: |
negativeInf Type: The string representation of negative infinity when parsing Default value: |
nullValue Type: String representation of a null value. Default value: |
parserCaseSensitive (deprecated) Type: While reading files, whether to align columns declared in the header with the
schema case sensitively. This is Default value: |
positiveInf Type: The string representation of positive infinity when parsing Default value: |
preferDate Type: Attempts to infer strings as dates instead of timestamp when possible. You
must also use schema inference, either by enabling Default value: |
quote Type: The character used for escaping values where the field delimiter is part of the value. Default value: |
readerCaseSensitive Type: Specifies the case sensitivity behavior when Default value: |
rescuedDataColumn Type: Whether to collect all data that can’t be parsed due to: a data type mismatch, and schema mismatch (including column casing) to a separate column. This column is included by default when using Auto Loader. For more details refer to What is the rescued data column?. Default value: None |
sep or delimiter Type: The separator string between columns. Default value: |
skipRows Type: The number of rows from the beginning of the CSV file that should be ignored
(including commented and empty rows). If Default value: |
timestampFormat Type: The format for parsing timestamp strings. Default value: |
timeZone Type: The Default value: None |
unescapedQuoteHandling Type: The strategy for handling unescaped quotes. Allowed options:
Default value: |
PARQUET
options
Option |
---|
datetimeRebaseMode Type: Controls the rebasing of the DATE and TIMESTAMP values between Julian and
Proleptic Gregorian calendars. Allowed values: Default value: |
int96RebaseMode Type: Controls the rebasing of the INT96 timestamp values between Julian and
Proleptic Gregorian calendars. Allowed values: Default value: |
mergeSchema Type: Whether to infer the schema across multiple files and to merge the schema of each file. Default value: |
readerCaseSensitive Type: Specifies the case sensitivity behavior when Default value: |
rescuedDataColumn Type: Whether to collect all data that can’t be parsed due to: a data type mismatch, and schema mismatch (including column casing) to a separate column. This column is included by default when using Auto Loader. For more details refer to What is the rescued data column?. Default value: None |
AVRO
options
Option |
---|
avroSchema Type: Optional schema provided by a user in Avro format. When reading Avro, this option can be set to an evolved schema, which is compatible but different with the actual Avro schema. The deserialization schema will be consistent with the evolved schema. For example, if you set an evolved schema containing one additional column with a default value, the read result will contain the new column too. Default value: None |
datetimeRebaseMode Type: Controls the rebasing of the DATE and TIMESTAMP values between Julian and
Proleptic Gregorian calendars. Allowed values: Default value: |
mergeSchema Type: Whether to infer the schema across multiple files and to merge the schema
of each file.
Default value: |
readerCaseSensitive Type: Specifies the case sensitivity behavior when Default value: |
rescuedDataColumn Type: Whether to collect all data that can’t be parsed due to: a data type mismatch, and schema mismatch (including column casing) to a separate column. This column is included by default when using Auto Loader. For more details refer to What is the rescued data column?. Default value: None |
BINARYFILE
options
Binary files do not have any additional configuration options.
TEXT
options
Option |
---|
encoding Type: The name of the encoding of the TEXT files. See Default value: |
lineSep Type: A string between two consecutive TEXT records. Default value: None, which covers |
wholeText Type: Whether to read a file as a single record. Default value: |
ORC
options
Option |
---|
mergeSchema Type: Whether to infer the schema across multiple files and to merge the schema of each file. Default value: |
Cloud specific options
Auto Loader provides a number of options for configuring cloud infrastructure.
AWS specific options
Provide the following option only if you choose cloudFiles.useNotifications
= true
and you want Auto Loader to set up the notification services for you:
Option |
---|
Type: The region where the source S3 bucket resides and where the AWS SNS and SQS services will be created. Default value: In Databricks Runtime 9.0 and above the region of the EC2 instance. In Databricks Runtime 8.4 and below you must specify the region. |
Provide the following option only if you choose cloudFiles.useNotifications
= true
and you want Auto Loader to use a queue that you have already set up:
Option |
---|
Type: The URL of the SQS queue. If provided, Auto Loader directly consumes events from this queue instead of setting up its own AWS SNS and SQS services. Default value: None |
You can use the following options to provide credentials to access AWS SNS and SQS when IAM roles are not available or when you’re ingesting data from different clouds.
Option |
---|
Type: The AWS access key ID for the user. Must be provided with
Default value: None |
Type: The AWS secret access key for the user. Must be provided with
Default value: None |
Type: The ARN of an IAM role to assume. The role can be assumed from your
cluster’s instance profile or by providing credentials with
Default value: None |
Type: An identifier to provide while assuming a role using Default value: None |
Type: An optional session name to use while assuming a role using
Default value: None |
Type: An optional endpoint to provide for accessing AWS STS when assuming a role
using Default value: None |
Azure specific options
You must provide values for all of the following options if you specify cloudFiles.useNotifications
= true
and you want Auto Loader to set up the notification services for you:
Option |
---|
Type: The client ID or application ID of the service principal. Default value: None |
Type: The client secret of the service principal. Default value: None |
Type: The connection string for the storage account, based on either account access key or shared access signature (SAS). Default value: None |
Type: The Azure Resource Group under which the storage account is created. Default value: None |
Type: The Azure Subscription ID under which the resource group is created. Default value: None |
Type: The Azure Tenant ID under which the service principal is created. Default value: None |
Important
Automated notification setup is available in Azure China and Government regions with Databricks Runtime 9.1 and later. You must provide a queueName
to use Auto Loader with file notifications in these regions for older DBR versions.
Provide the following option only if you choose cloudFiles.useNotifications
= true
and you want Auto Loader to use a queue that you have already set up:
Option |
---|
Type: The name of the Azure queue. If provided, the cloud files source directly
consumes events from this queue instead of setting up its own Azure Event Grid
and Queue Storage services. In that case, your Default value: None |
Google specific options
Auto Loader can automatically set up notification services for you by leveraging Google Service Accounts. You can configure your cluster to assume a service account by following Google service setup. The permissions that your service account needs are specified in What is Auto Loader file notification mode?. Otherwise, you can provide the following options for authentication if you want Auto Loader to set up the notification services for you.
Option |
---|
Type: The client ID of the Google Service Account. Default value: None |
Type: The email of the Google Service Account. Default value: None |
Type: The private key that’s generated for the Google Service Account. Default value: None |
Type: The id of the private key that’s generated for the Google Service Account. Default value: None |
Type: The id of the project that the GCS bucket is in. The Google Cloud Pub/Sub subscription will also be created within this project. Default value: None |
Provide the following option only if you choose cloudFiles.useNotifications
= true
and you want Auto Loader to use a queue that you have already set up:
Option |
---|
Type: The name of the Google Cloud Pub/Sub subscription. If provided, the cloud files source consumes events from this queue instead of setting up its own GCS Notification and Google Cloud Pub/Sub services. Default value: None |