Auto Loader options

Configuration options specific to the cloudFiles source are prefixed with cloudFiles so that they are in a separate namespace from other Structured Streaming source options.

Common Auto Loader options

You can configure the following options for directory listing or file notification mode.

Option

cloudFiles.allowOverwrites

Type: Boolean

Whether to allow input directory file changes to overwrite existing data. Available in Databricks Runtime 7.6 and above.

Default value: false

cloudFiles.backfillInterval

Type: Interval String

Auto Loader can trigger asynchronous backfills at a given interval, e.g. 1 day to backfill once a day, or 1 week to backfill once a week. File event notification systems do not guarantee 100% delivery of all files that have been uploaded therefore you can use backfills to guarantee that all files eventually get processed, available in Databricks Runtime 8.4 (Unsupported) and above. If using the incremental listing, you can also use regular backfills to guarantee the eventual completeness, available in Databricks Runtime 9.1 LTS and above.

Default value: None

cloudFiles.format

Type: String

The data file format in the source path. Allowed values include:

Default value: None (required option)

cloudFiles.includeExistingFiles

Type: Boolean

Whether to include existing files in the stream processing input path or to only process new files arriving after initial setup. This option is evaluated only when you start a stream for the first time. Changing this option after restarting the stream has no effect.

Default value: true

cloudFiles.inferColumnTypes

Type: Boolean

Whether to infer exact column types when leveraging schema inference. By default, columns are inferred as strings when inferring JSON and CSV datasets. See schema inference for more details.

Default value: false

cloudFiles.maxBytesPerTrigger

Type: Byte String

The maximum number of new bytes to be processed in every trigger. You can specify a byte string such as 10g to limit each microbatch to 10 GB of data. This is a soft maximum. If you have files that are 3 GB each, Databricks processes 12 GB in a microbatch. When used together with cloudFiles.maxFilesPerTrigger, Databricks consumes up to the lower limit of cloudFiles.maxFilesPerTrigger or cloudFiles.maxBytesPerTrigger, whichever is reached first. This option has no effect when used with Trigger.Once().

Default value: None

cloudFiles.maxFileAge

Type: Interval String

How long a file event is tracked for deduplication purposes. Databricks does not recommend tuning this parameter unless you are ingesting data at the order of millions of files an hour. See the section on Event retention for more details.

Default value: None

cloudFiles.maxFilesPerTrigger

Type: Integer

The maximum number of new files to be processed in every trigger. When used together with cloudFiles.maxBytesPerTrigger, Databricks consumes up to the lower limit of cloudFiles.maxFilesPerTrigger or cloudFiles.maxBytesPerTrigger, whichever is reached first. This option has no effect when used with Trigger.Once().

Default value: 1000

cloudFiles.partitionColumns

Type: String

A comma separated list of Hive style partition columns that you would like inferred from the directory structure of the files. Hive style partition columns are key value pairs combined by an equality sign such as <base_path>/a=x/b=1/c=y/file.format. In this example, the partition columns are a, b, and c. By default these columns will be automatically added to your schema if you are using schema inference and provide the <base_path> to load data from. If you provide a schema, Auto Loader expects these columns to be included in the schema. If you do not want these columns as part of your schema, you can specify "" to ignore these columns. In addition, you can use this option when you want columns to be inferred the file path in complex directory structures, like the example below:

<base_path>/year=2022/week=1/file1.csv <base_path>/year=2022/month=2/day=3/file2.csv <base_path>/year=2022/month=2/day=4/file3.csv

Specifying cloudFiles.partitionColumns as year,month,day will return year=2022 for file1.csv, but the month and day columns will be null. month and day will be parsed correctly for file2.csv and file3.csv.

Default value: None

cloudFiles.schemaEvolutionMode

Type: String

The mode for evolving the schema as new columns are discovered in the data. By default, columns are inferred as strings when inferring JSON datasets. See schema evolution for more details.

Default value: "addNewColumns" when a schema is not provided. "none" otherwise.

cloudFiles.schemaHints

Type: String

Schema information that you provide to Auto Loader during schema inference. See schema hints for more details.

Default value: None

cloudFiles.schemaLocation

Type: String

The location to store the inferred schema and subsequent changes. See schema inference for more details.

Default value: None (required when inferring the schema)

cloudFiles.validateOptions

Type: Boolean

Whether to validate Auto Loader options and return an error for unknown or inconsistent options.

Default value: true

Directory listing options

The following options are relevant to directory listing mode.

Option

cloudFiles.useIncrementalListing

Type: String

Whether to use the incremental listing rather than the full listing in directory listing mode. By default, Auto Loader will make the best effort to automatically detect if a given directory is applicable for the incremental listing. You can explicitly use the incremental listing or use the full directory listing by setting it as true or false respectively.

Works with Azure Data Lake Storage Gen2 (abfss://), S3 (s3://), and GCS (gs://).

Available in Databricks Runtime 9.1 LTS and above.

Default value: auto

Available values: auto, true, false

File notification options

The following options are relevant to file notification mode.

Option

cloudFiles.fetchParallelism

Type: Integer

Number of threads to use when fetching messages from the queueing service.

Default value: 1

cloudFiles.pathRewrites

Type: A JSON string

Required only if you specify a queueUrl that receives file notifications from multiple S3 buckets and you want to leverage mount points configured for accessing data in these containers. Use this option to rewrite the prefix of the bucket/key path with the mount point. Only prefixes can be rewritten. For example, for the configuration {"<databricks-mounted-bucket>/path": "dbfs:/mnt/data-warehouse"}, the path s3://<databricks-mounted-bucket>/path/2017/08/fileA.json is rewritten to dbfs:/mnt/data-warehouse/2017/08/fileA.json.

Default value: None

cloudFiles.resourceTags

Type: Map(String, String)

A series of key-value tag pairs to help associate and identify related resources, for example:

cloudFiles.option("cloudFiles.resourceTag.myFirstKey", "myFirstValue")           .option("cloudFiles.resourceTag.mySecondKey", "mySecondValue")

For more information on AWS, see Amazon SQS cost allocation tags and Configuring tags for an Amazon SNS topic. (1)

For more information on Azure, see Naming Queues and Metadata and the coverage of properties.labels in Event Subscriptions. Auto Loader stores these key-value tag pairs in JSON as labels. (1)

For more information on GCP, see Reporting usage with labels. (1)

Default value: None

cloudFiles.useNotifications

Type: Boolean

Whether to use file notification mode to determine when there are new files. If false, use directory listing mode. See How Auto Loader works.

Default value: false

(1) Auto Loader adds the following key-value tag pairs by default on a best-effort basis:

  • vendor: Databricks

  • path: The location from where the data is loaded. Unavailable in GCP due to labeling limitations.

  • checkpointLocation: The location of the stream’s checkpoint. Unavailable in GCP due to labeling limitations.

  • streamId: A globally unique identifier for the stream.

These key names are reserved and you cannot overwrite their values.

File format options

With Auto Loader you can ingest JSON, CSV, PARQUET, AVRO, TEXT, BINARYFILE, and ORC files.

Generic options

The following options apply to all file formats.

Option

ignoreCorruptFiles

Type: Boolean

Whether to ignore corrupt files. If true, the Spark jobs will continue to run when encountering corrupted files and the contents that have been read will still be returned. Observable as numSkippedCorruptFiles in the operationMetrics column of the Delta Lake history. Available in Databricks Runtime 11.0 and above.

Default value: false

ignoreMissingFiles

Type: Boolean

Whether to ignore missing files. If true, the Spark jobs will continue to run when encountering missing files and the contents that have been read will still be returned. Available in Databricks Runtime 11.0 and above.

Default value: false (true for COPY INTO)

modifiedAfter

Type: Timestamp String, for example, 2021-01-01 00:00:00.000000 UTC+0

An optional timestamp to ingest files that have a modification timestamp after the provided timestamp.

Default value: None

modifiedBefore

Type: Timestamp String, for example, 2021-01-01 00:00:00.000000 UTC+0

An optional timestamp to ingest files that have a modification timestamp before the provided timestamp.

Default value: None

pathGlobFilter

Type: String

A potential glob pattern to provide for choosing files. Equivalent to PATTERN in COPY INTO.

Default value: None

recursiveFileLookup

Type: Boolean

Whether to load data recursively within the base directory and skip partition inference.

Default value: false

JSON options

Option

allowBackslashEscapingAnyCharacter

Type: Boolean

Whether to allow backslashes to escape any character that succeeds it. If not enabled, only characters that are explicitly listed by the JSON specification can be escaped.

Default value: false

allowComments

Type: Boolean

Whether to allow the use of Java, C, and C++ style comments ('/', '*', and '//' varieties) within parsed content or not.

Default value: false

allowNonNumericNumbers

Type: Boolean

Whether to allow the set of not-a-number (NaN) tokens as legal floating number values.

Default value: true

allowNumericLeadingZeros

Type: Boolean

Whether to allow integral numbers to start with additional (ignorable) zeroes (for example, 000001).

Default value: false

allowSingleQuotes

Type: Boolean

Whether to allow use of single quotes (apostrophe, character '\') for quoting strings (names and String values).

Default value: true

allowUnquotedControlChars

Type: Boolean

Whether to allow JSON strings to contain unescaped control characters (ASCII characters with value less than 32, including tab and line feed characters) or not.

Default value: false

allowUnquotedFieldNames

Type: Boolean

Whether to allow use of unquoted field names (which are allowed by JavaScript, but not by the JSON specification).

Default value: false

badRecordsPath

Type: String

The path to store files for recording the information about bad JSON records.

Default value: None

columnNameOfCorruptRecord

Type: String

The column for storing records that are malformed and cannot be parsed. If the mode for parsing is set as DROPMALFORMED, this column will be empty.

Default value: _corrupt_record

dateFormat

Type: String

The format for parsing date strings.

Default value: yyyy-MM-dd

dropFieldIfAllNull

Type: Boolean

Whether to ignore columns of all null values or empty arrays and structs during schema inference.

Default value: false

encoding or charset

Type: String

The name of the encoding of the JSON files. See java.nio.charset.Charset for list of options. You cannot use UTF-16 and UTF-32 when multiline is true.

Default value: UTF-8

inferTimestamp

Type: Boolean

Whether to try and infer timestamp strings as a TimestampType. When set to true, schema inference may take noticeably longer.

Default value: false

lineSep

Type: String

A string between two consecutive JSON records.

Default value: None, which covers \r, \r\n, and \n

locale

Type: String

A java.util.Locale identifier. Influences default date, timestamp, and decimal parsing within the JSON.

Default value: US

mode

Type: String

Parser mode around handling malformed records. One of 'PERMISSIVE', 'DROPMALFORMED', or 'FAILFAST'.

Default value: PERMISSIVE

multiLine

Type: Boolean

Whether the JSON records span multiple lines.

Default value: false

prefersDecimal

Type: Boolean

Whether to infer floats and doubles as DecimalType during schema inference.

Default value: false

primitivesAsString

Type: Boolean

Whether to infer primitive types like numbers and booleans as StringType.

Default value: false

rescuedDataColumn

Type: String

Whether to collect all data that can’t be parsed due to a data type mismatch or schema mismatch (including column casing) to a separate column. This column is included by default when using Auto Loader. For more details, refer to Rescued data column.

Default value: None

timestampFormat

Type: String

The format for parsing timestamp strings.

Default value: yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]

timeZone

Type: String

The java.time.ZoneId to use when parsing timestamps and dates.

Default value: None

CSV options

Option

badRecordsPath

Type: String

The path to store files for recording the information about bad CSV records.

Default value: None

charToEscapeQuoteEscaping

Type: Char

The character used to escape the character used for escaping quotes. For example, for the following record: [ " a\\", b ]:

  • If the character to escape the '\' is undefined, the record won’t be parsed. The parser will read characters: [a],[\],["],[,],[ ],[b] and throw an error because it cannot find a closing quote.

  • If the character to escape the '\' is defined as '\', the record will be read with 2 values: [a\] and [b].

Default value: '\0'

columnNameOfCorruptRecord

Type: String

A column for storing records that are malformed and cannot be parsed. If the mode for parsing is set as DROPMALFORMED, this column will be empty.

Default value: _corrupt_record

comment

Type: Char

Defines the character that represents a line comment when found in the beginning of a line of text. Use '\0' to disable comment skipping.

Default value: '\u0000'

dateFormat

Type: String

The format for parsing date strings.

Default value: yyyy-MM-dd

emptyValue

Type: String

String representation of an empty value.

Default value: ""

encoding or charset

Type: String

The name of the encoding of the CSV files. See java.nio.charset.Charset for the list of options. UTF-16 and UTF-32 cannot be used when multiline is true.

Default value: UTF-8

enforceSchema

Type: Boolean

Whether to forcibly apply the specified or inferred schema to the CSV files. If the option is enabled, headers of CSV files are ignored. This option is ignored by default when using Auto Loader to rescue data and allow schema evolution.

Default value: true

escape

Type: Char

The escape character to use when parsing the data.

Default value: '\'

header

Type: Boolean

Whether the CSV files contain a header. Auto Loader assumes that files have headers when inferring the schema.

Default value: false

ignoreLeadingWhiteSpace

Type: Boolean

Whether to ignore leading whitespaces for each parsed value.

Default value: false

ignoreTrailingWhiteSpace

Type: Boolean

Whether to ignore trailing whitespaces for each parsed value.

Default value: false

inferSchema

Type: Boolean

Whether to infer the data types of the parsed CSV records or to assume all columns are of StringType. Requires an additional pass over the data if set to true.

Default value: false

lineSep

Type: String

A string between two consecutive CSV records.

Default value: None, which covers \r, \r\n, and \n

locale

Type: String

A java.util.Locale identifier. Influences default date, timestamp, and decimal parsing within the CSV.

Default value: US

maxCharsPerColumn

Type: Int

Maximum number of characters expected from a value to parse. Can be used to avoid memory errors. Defaults to -1, which means unlimited.

Default value: -1

maxColumns

Type: Int

The hard limit of how many columns a record can have.

Default value: 20480

mergeSchema

Type: Boolean

Whether to infer the schema across multiple files and to merge the schema of each file. Enabled by default for Auto Loader when inferring the schema.

Default value: false

mode

Type: String

Parser mode around handling malformed records. One of 'PERMISSIVE', 'DROPMALFORMED', and 'FAILFAST'.

Default value: PERMISSIVE

multiLine

Type: Boolean

Whether the CSV records span multiple lines.

Default value: false

nanValue

Type: String

The string representation of a non-a-number value when parsing FloatType and DoubleType columns.

Default value: "NaN"

negativeInf

Type: String

The string representation of negative infinity when parsing FloatType or DoubleType columns.

Default value: "-Inf"

nullValue

Type: String

String representation of a null value.

Default value: ""

parserCaseSensitive (deprecated)

Type: Boolean

While reading files, whether to align columns declared in the header with the schema case sensitively. This is true by default for Auto Loader. Columns that differ by case will be rescued in the rescuedDataColumn if enabled. This option has been deprecated in favor of readerCaseSensitive.

Default value: false

positiveInf

Type: String

The string representation of positive infinity when parsing FloatType or DoubleType columns.

Default value: "Inf"

quote

Type: Char

The character used for escaping values where the field delimiter is part of the value.

Default value: '\'

readerCaseSensitive

Type: Boolean

Specifies the case sensitivity behavior when rescuedDataColumn is enabled. If true, rescue the data columns whose names differ by case from the schema; otherwise, read the data in a case-insensitive manner.

Default value: true

rescuedDataColumn

Type: String

Whether to collect all data that can’t be parsed due to: a data type mismatch, and schema mismatch (including column casing) to a separate column. This column is included by default when using Auto Loader. For more details refer to Rescued data column.

Default value: None

sep or delimiter

Type: String

The separator string between columns.

Default value: ","

skipRows

Type: Int

The number of rows from the beginning of the CSV file that should be ignored (including commented and empty rows). If header is true, the header will be the first unskipped and uncommented row.

Default value: 0

timestampFormat

Type: String

The format for parsing timestamp strings.

Default value: yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]

timeZone

Type: String

The java.time.ZoneId to use when parsing timestamps and dates.

Default value: None

unescapedQuoteHandling

Type: String

The strategy for handling unescaped quotes. Allowed options:

  • STOP_AT_CLOSING_QUOTE: If unescaped quotes are found in the input, accumulate the quote character and proceed parsing the value as a quoted value, until a closing quote is found.

  • BACK_TO_DELIMITER: If unescaped quotes are found in the input, consider the value as an unquoted value. This will make the parser accumulate all characters of the current parsed value until the delimiter defined by sep is found. If no delimiter is found in the value, the parser will continue accumulating characters from the input until a delimiter or line ending is found.

  • STOP_AT_DELIMITER: If unescaped quotes are found in the input, consider the value as an unquoted value. This will make the parser accumulate all characters until the delimiter defined by sep, or a line ending is found in the input.

  • SKIP_VALUE: If unescaped quotes are found in the input, the content parsed for the given value will be skipped (until the next delimiter is found) and the value set in nullValue will be produced instead.

  • RAISE_ERROR: If unescaped quotes are found in the input, a TextParsingException will be thrown.

Default value: STOP_AT_DELIMITER

PARQUET options

Option

datetimeRebaseMode

Type: String

Controls the rebasing of the DATE and TIMESTAMP values between Julian and Proleptic Gregorian calendars. Allowed values: EXCEPTION, LEGACY, and CORRECTED.

Default value: LEGACY

int96RebaseMode

Type: String

Controls the rebasing of the INT96 timestamp values between Julian and Proleptic Gregorian calendars. Allowed values: EXCEPTION, LEGACY, and CORRECTED.

Default value: LEGACY

mergeSchema

Type: Boolean

Whether to infer the schema across multiple files and to merge the schema of each file.

Default value: false

readerCaseSensitive

Type: Boolean

Specifies the case sensitivity behavior when rescuedDataColumn is enabled. If true, rescue the data columns whose names differ by case from the schema; otherwise, read the data in a case-insensitive manner.

Default value: true

rescuedDataColumn

Type: String

Whether to collect all data that can’t be parsed due to: a data type mismatch, and schema mismatch (including column casing) to a separate column. This column is included by default when using Auto Loader. For more details refer to Rescued data column.

Default value: None

AVRO options

Option

avroSchema

Type: String

Optional schema provided by a user in Avro format. When reading Avro, this option can be set to an evolved schema, which is compatible but different with the actual Avro schema. The deserialization schema will be consistent with the evolved schema. For example, if you set an evolved schema containing one additional column with a default value, the read result will contain the new column too.

Default value: None

datetimeRebaseMode

Type: String

Controls the rebasing of the DATE and TIMESTAMP values between Julian and Proleptic Gregorian calendars. Allowed values: EXCEPTION, LEGACY, and CORRECTED.

Default value: LEGACY

mergeSchema

Type: Boolean

Whether to infer the schema across multiple files and to merge the schema of each file. mergeSchema for Avro does not relax data types.

Default value: false

readerCaseSensitive

Type: Boolean

Specifies the case sensitivity behavior when rescuedDataColumn is enabled. If true, rescue the data columns whose names differ by case from the schema; otherwise, read the data in a case-insensitive manner.

Default value: true

rescuedDataColumn

Type: String

Whether to collect all data that can’t be parsed due to: a data type mismatch, and schema mismatch (including column casing) to a separate column. This column is included by default when using Auto Loader. For more details refer to Rescued data column.

Default value: None

BINARYFILE options

Binary files do not have any additional configuration options.

TEXT options

Option

encoding

Type: String

The name of the encoding of the TEXT files. See java.nio.charset.Charset for list of options.

Default value: UTF-8

lineSep

Type: String

A string between two consecutive TEXT records.

Default value: None, which covers \r, \r\n and \n

wholeText

Type: Boolean

Whether to read a file as a single record.

Default value: false

ORC options

Option

mergeSchema

Type: Boolean

Whether to infer the schema across multiple files and to merge the schema of each file.

Default value: false

Cloud specific options

Auto Loader provides a number of options for configuring cloud infrastructure.

AWS specific options

Provide the following option only if you choose cloudFiles.useNotifications = true and you want Auto Loader to set up the notification services for you:

Option

cloudFiles.region

Type: String

The region where the source S3 bucket resides and where the AWS SNS and SQS services will be created.

Default value: In Databricks Runtime 9.0 and above the region of the EC2 instance. In Databricks Runtime 8.4 and below you must specify the region.

Provide the following option only if you choose cloudFiles.useNotifications = true and you want Auto Loader to use a queue that you have already set up:

Option

cloudFiles.queueUrl

Type: String

The URL of the SQS queue. If provided, Auto Loader directly consumes events from this queue instead of setting up its own AWS SNS and SQS services.

Default value: None

You can use the following options to provide credentials to access AWS SNS and SQS when IAM roles are not available or when you’re ingesting data from different clouds.

Option

cloudFiles.awsAccessKey

Type: String

The AWS access key ID for the user. Must be provided with cloudFiles.awsSecretKey.

Default value: None

cloudFiles.awsSecretKey

Type: String

The AWS secret access key for the user. Must be provided with cloudFiles.awsAccessKey.

Default value: None

cloudFiles.roleArn

Type: String

The ARN of an IAM role to assume. The role can be assumed from your cluster’s instance profile or by providing credentials with cloudFiles.awsAccessKey and cloudFiles.awsSecretKey.

Default value: None

cloudFiles.roleExternalId

Type: String

An identifier to provide while assuming a role using cloudFiles.roleArn.

Default value: None

cloudFiles.roleSessionName

Type: String

An optional session name to use while assuming a role using cloudFiles.roleArn.

Default value: None

cloudFiles.stsEndpoint

Type: String

An optional endpoint to provide for accessing AWS STS when assuming a role using cloudFiles.roleArn.

Default value: None

Azure specific options

You must provide values for all of the following options if you specify cloudFiles.useNotifications = true and you want Auto Loader to set up the notification services for you:

Option

cloudFiles.clientId

Type: String

The client ID or application ID of the service principal.

Default value: None

cloudFiles.clientSecret

Type: String

The client secret of the service principal.

Default value: None

cloudFiles.connectionString

Type: String

The connection string for the storage account, based on either account access key or shared access signature (SAS).

Default value: None

cloudFiles.resourceGroup

Type: String

The Azure Resource Group under which the storage account is created.

Default value: None

cloudFiles.subscriptionId

Type: String

The Azure Subscription ID under which the resource group is created.

Default value: None

cloudFiles.tenantId

Type: String

The Azure Tenant ID under which the service principal is created.

Default value: None

Important

Automated notification setup is available in Azure China and Government regions with Databricks Runtime 9.1 and later. You must provide a queueName to use Auto Loader with file notifications in these regions for older DBR versions.

Provide the following option only if you choose cloudFiles.useNotifications = true and you want Auto Loader to use a queue that you have already set up:

Option

cloudFiles.queueName

Type: String

The name of the Azure queue. If provided, the cloud files source directly consumes events from this queue instead of setting up its own Azure Event Grid and Queue Storage services. In that case, your cloudFiles.connectionString requires only read permissions on the queue.

Default value: None

Google specific options

Auto Loader can automatically set up notification services for you by leveraging Google Service Accounts. You can configure your cluster to assume a service account by following Google service setup. The permissions that your service account needs are specified in Required permissions for setting up file notification resources. Otherwise, you can provide the following options for authentication if you want Auto Loader to set up the notification services for you.

Option

cloudFiles.client

Type: String

The client ID of the Google Service Account.

Default value: None

cloudFiles.clientEmail

Type: String

The email of the Google Service Account.

Default value: None

cloudFiles.privateKey

Type: String

The private key that’s generated for the Google Service Account.

Default value: None

cloudFiles.privateKeyId

Type: String

The id of the private key that’s generated for the Google Service Account.

Default value: None

cloudFiles.projectId

Type: String

The id of the project that the GCS bucket is in. The Google Cloud Pub/Sub subscription will also be created within this project.

Default value: None

Provide the following option only if you choose cloudFiles.useNotifications = true and you want Auto Loader to use a queue that you have already set up:

Option

cloudFiles.subscription

Type: String

The name of the Google Cloud Pub/Sub subscription. If provided, the cloud files source consumes events from this queue instead of setting up its own GCS Notification and Google Cloud Pub/Sub services.

Default value: None