Spark API options reference

This page lists available input and output options for Spark APIs that read and write data.

DataFrameReader options

Use these options with DataFrameReader.option(), DataFrameReader.options(), read_files, COPY INTO, and Auto Loader to control how Databricks reads data files.

Example

The following example sets multiLine to True for reading JSON files:

Python
Scala
SQL

Python
df = spark.read.format("json").option("multiLine", True).load("/path/to/data")

Scala
val df = spark.read.format("json").option("multiLine", "true").load("/path/to/data")

SQL
SELECT * FROM read_files("/path/to/data", format => "json", multiLine => true)

Common

The following options apply to all file formats.

Key	Default	Description
`ignoreCorruptFiles`	`false`	Whether to ignore corrupt files. If true, the Spark jobs will continue to run when encountering corrupted files and the contents that have been read will still be returned. For `COPY INTO`, you can observe skipped corrupt files as `numSkippedCorruptFiles` in the `operationMetrics` column of the Delta Lake history. Available in Databricks Runtime 11.3 LTS and above.
`ignoreMissingFiles`	`false` for Auto Loader, `true` for `COPY INTO` (legacy)	Whether to ignore missing files. If true, the Spark jobs continue to run when encountering missing files and the contents are still returned. Available in Databricks Runtime 11.3 LTS and above.
`modifiedAfter`	None	An optional timestamp as a filter to only ingest files that have a modification timestamp after the provided timestamp.
`modifiedBefore`	None	An optional timestamp as a filter to only ingest files that have a modification timestamp before the provided timestamp.
`pathGlobFilter` or `fileNamePattern`	None	A potential glob pattern to provide for choosing files. Equivalent to `PATTERN` in `COPY INTO` (legacy). `fileNamePattern` can be used in `read_files`.
`recursiveFileLookup`	`false`	When `true`, this option searches through nested directories even if their names do not follow a partition naming scheme like `date=2019-07-01`.

Avro

Key	Default	Description
`avroSchema`	None	Optional schema provided by a user in Avro format. When reading Avro, this option can be set to an evolved schema that is compatible but different from the actual Avro schema. The deserialization schema is consistent with the evolved schema. For example, if you set an evolved schema containing one additional column with a default value, the read result contains the new column too.
`avroSchemaEvolutionMode`	`none`	How to handle schema evolution when using a schema registry. Valid values: `none` (ignore schema changes and continue the job), `restart` (when schema changes are detected, raises an `UnknownFieldException` and requires a job restart).
`datetimeRebaseMode`	`LEGACY`	Controls the rebasing of the DATE and TIMESTAMP values between Julian and Proleptic Gregorian calendars. Valid values: `EXCEPTION`, `LEGACY`, and `CORRECTED`.
`enableStableIdentifiersForUnionType`	`false`	Whether to use stable field names for Avro Union types. When enabled, union type field names are derived from their type names in lowercase (for example, `member_int`, `member_string`). Throws an exception if two type names are identical after lowercasing.
`mergeSchema`	`false`	Whether to infer the schema across multiple files and to merge the schema of each file. `mergeSchema` for Avro does not relax data types.
`mode`	`FAILFAST`	Parser mode for handling corrupt records. Valid values: `FAILFAST` (throws an exception), `PERMISSIVE` (sets malformed fields to null), `DROPMALFORMED` (silently drops bad records).
`readerCaseSensitive`	`true`	Specifies the case sensitivity behavior when `rescuedDataColumn` is enabled. If true, rescue the data columns whose names differ by case from the schema. When false, read the data in a case-insensitive manner.
`recursiveFieldMaxDepth`	None	The maximum recursion depth for recursive Avro fields. Set to `1` to truncate all recursive fields, `2` to allow one level of recursion, and so on up to `15`. When unset or `0`, recursive fields are not permitted. Valid values: `0` to `15`.
`rescuedDataColumn`	None	Whether to collect all data that can't be parsed due to: a data type mismatch, and schema mismatch (including column casing) to a separate column. This column is included by default when using Auto Loader. `COPY INTO` (legacy) does not support the rescued data column because you cannot manually set the schema using `COPY INTO`. Databricks recommends using Auto Loader for most ingestion scenarios. For more details refer to What is the rescued data column?.
`stableIdentifierPrefixForUnionType`	`member_`	The prefix to use for stable union type field names when `enableStableIdentifiersForUnionType=true`.

CSV

Key	Default	Description
`badRecordsPath`	None	The path to store files for recording the information about bad CSV records.
`charToEscapeQuoteEscaping`	`\0`	The character used to escape the character used for escaping quotes. For example, for the following record: `[ " a\\", b ]`: If the character to escape the `'\'` is undefined, the record won't be parsed. The parser will read characters: `[a],[\],["],[,],[ ],[b]` and throw an error because it cannot find a closing quote. If the character to escape the `'\'` is defined as `'\'`, the record will be read with 2 values: `[a\]` and `[b]`.
`columnNameOfCorruptRecord`	`_corrupt_record`	Supported for Auto Loader. Not supported for `COPY INTO` (legacy). The column for storing records that are malformed and cannot be parsed. If the `mode` for parsing is set as `DROPMALFORMED`, this column will be empty.
`comment`	`\0`	Defines the character that represents a line comment when found in the beginning of a line of text. Use `'\0'` to disable comment skipping.
`dateFormat`	`yyyy-MM-dd`	The format for parsing date strings.
`emptyValue`	Empty string	String representation of an empty value.
`enableDateTimeParsingFallback`	`false`	Whether to fall back to the legacy date and timestamp parsing behavior when a value cannot be parsed with the specified format. When `false`, parsing failures raise an error or produce null depending on `mode`.
`encoding` or `charset`	`UTF-8`	The name of the encoding of the CSV files. See `java.nio.charset.Charset` for the list of options. `UTF-16` and `UTF-32` cannot be used when `multiline` is `true`.
`enforceSchema`	`true`	Whether to forcibly apply the specified or inferred schema to the CSV files. If the option is enabled, headers of CSV files are ignored. This option is ignored by default when using Auto Loader to rescue data and allow schema evolution.
`escape`	`\`	The escape character to use when parsing the data.
`extension`	`csv`	The expected filename extension. Files without this extension are filtered out during reads.
`failOnUnknownFields`	`false`	Whether to fail when the CSV record contains columns not present in the schema. When `false`, unrecognized columns are silently dropped or rescued depending on `rescuedDataColumn`.
`failOnWidenedFields`	`false`	Whether to fail when a field value cannot be parsed as the declared schema type without widening. When `false`, type-widened values are silently rescued depending on `rescuedDataColumn`. Setting `failOnUnknownFields=true` can mask the effects of this option.
`header`	`false`	Whether the CSV files contain a header. Auto Loader assumes that files have headers when inferring the schema.
`ignoreLeadingWhiteSpace`	`false`	Whether to ignore leading whitespaces for each parsed value.
`ignoreTrailingWhiteSpace`	`false`	Whether to ignore trailing whitespaces for each parsed value.
`inferSchema`	`false`	Whether to infer the data types of the parsed CSV records or to assume all columns are of `StringType`. Requires an additional pass over the data if set to `true`. For Auto Loader, use `cloudFiles.inferColumnTypes` instead.
`inputBufferSize`	`1048576` (1 MB)	The buffer size in bytes for the CSV parser. Useful for tuning memory usage when parsing large CSV files. Valid values: positive integers.
`lineSep`	None, which covers `\r`, `\r\n`, and `\n`	A string between two consecutive CSV records.
`locale`	`US`	A `java.util.Locale` identifier. Influences default date, timestamp, and decimal parsing within the CSV.
`maxCharsPerColumn`	`-1`	Maximum number of characters expected from a value to parse. Can be used to avoid memory errors. Defaults to `-1`, which means unlimited. Valid values: positive integers, or `-1` for unlimited.
`maxColumns`	`20480`	The hard limit of how many columns a record can have. Valid values: positive integers.
`mergeSchema`	`false`	Whether to infer the schema across multiple files and to merge the schema of each file. Enabled by default for Auto Loader when inferring the schema.
`mode`	`PERMISSIVE`	Parser mode around handling malformed records. Valid values: `PERMISSIVE`, `DROPMALFORMED`, `FAILFAST`.
`multiLine`	`false`	Whether the CSV records span multiple lines.
`nanValue`	`NaN`	The string representation of a non-a-number value when parsing `FloatType` and `DoubleType` columns.
`negativeInf`	`-Inf`	The string representation of negative infinity when parsing `FloatType` or `DoubleType` columns.
`nullValue`	Empty string	String representation of a null value.
`parserCaseSensitive` (deprecated)	`false`	While reading files, whether to align columns declared in the header with the schema case sensitively. This is `true` by default for Auto Loader. Columns that differ by case will be rescued in the `rescuedDataColumn` if enabled. This option has been deprecated in favor of `readerCaseSensitive`.
`positiveInf`	`Inf`	The string representation of positive infinity when parsing `FloatType` or `DoubleType` columns.
`preferDate`	`true`	Attempts to infer strings as dates instead of timestamp when possible. You must also use schema inference, either by enabling `inferSchema` or using `cloudFiles.inferColumnTypes` with Auto Loader.
`quote`	`"`	The character used for escaping values where the field delimiter is part of the value.
`readerCaseSensitive`	`true`	Specifies the case sensitivity behavior when `rescuedDataColumn` is enabled. If true, rescue the data columns whose names differ by case from the schema. When false, read the data in a case-insensitive manner.
`rescuedDataColumn`	None	Whether to collect all data that can't be parsed due to: a data type mismatch, and schema mismatch (including column casing) to a separate column. This column is included by default when using Auto Loader. For more details refer to What is the rescued data column?. `COPY INTO` (legacy) does not support the rescued data column because you cannot manually set the schema using `COPY INTO`. Databricks recommends using Auto Loader for most ingestion scenarios.
`sep` or `delimiter`	`,`	The separator string between columns.
`singleVariantColumn`	None	When set to a column name, reads the entire CSV record into a single `VariantType` column with that name instead of parsing each field into its own column. Requires `header=true`.
`skipRows`	`0`	The number of rows from the beginning of the CSV file that should be ignored (including commented and empty rows). If `header` is true, the header will be the first unskipped and uncommented row. Valid values: positive integers or 0.
`timeFormat`	`HH:mm:ss`	The format for parsing `TimeType` column values.
`timestampFormat`	`yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]`	The format for parsing timestamp strings.
`timestampNTZFormat`	`yyyy-MM-dd'T'HH:mm:ss[.SSS]`	The format for parsing timestamp without timezone (`TimestampNTZType`) strings.
`timeZone`	None	The `java.time.ZoneId` to use when parsing timestamps and dates.
`unescapedQuoteHandling`	`STOP_AT_DELIMITER`	The strategy for handling unescaped quotes. Allowed options: `STOP_AT_CLOSING_QUOTE`: If unescaped quotes are found in the input, accumulate the quote character and proceed parsing the value as a quoted value, until a closing quote is found. `BACK_TO_DELIMITER`: If unescaped quotes are found in the input, consider the value as an unquoted value. This will make the parser accumulate all characters of the current parsed value until the delimiter defined by `sep` is found. If no delimiter is found in the value, the parser will continue accumulating characters from the input until a delimiter or line ending is found. `STOP_AT_DELIMITER`: If unescaped quotes are found in the input, consider the value as an unquoted value. This will make the parser accumulate all characters until the delimiter defined by `sep`, or a line ending is found in the input. `SKIP_VALUE`: If unescaped quotes are found in the input, the content parsed for the given value will be skipped (until the next delimiter is found) and the value set in `nullValue` will be produced instead. `RAISE_ERROR`: If unescaped quotes are found in the input, a `TextParsingException` will be thrown.

Excel

Key	Default	Description
`dataAddress`	None	The cell range to read in Excel syntax. If omitted, reads all valid cells from the first sheet. Use `"SheetName!C5:H10"` to read a range from a named sheet, `"C5:H10"` to read a range from the first sheet, or `"SheetName"` to read all data from a specific sheet.
`headerRows`	`0`	Number of initial rows to use as column name headers. When `dataAddress` is specified, this applies within the cell range. When `0`, column names are auto-generated as `_c1`, `_c2`, `_c3`, etc. Valid values: `0`, `1`.
`ignoreMissingSheet`	`false`	Whether to silently skip files that do not contain the sheet specified by `dataAddress`. When `false`, an error is thrown if a file is missing the requested sheet. Only applies when a sheet name is specified in `dataAddress`. Valid values: `true`, `false`.
`includePhoneticRuns`	`false`	Whether to include phonetic annotations (such as pinyin or furigana) concatenated to cell string values when reading XLSX files. Valid values: `true`, `false`.
`operation`	`readSheet`	The operation to perform on the Excel workbook. Valid values: `readSheet` (reads data from a sheet), `listSheets` (returns a struct with fields `sheetIndex: long` and `sheetName: String` for each sheet).
`timestampNTZFormat`	`yyyy-MM-dd'T'HH:mm:ss[.SSS]`	Custom format string for timestamp-without-timezone values stored as strings in Excel. Custom date formats follow the formats at Datetime patterns.
`dateFormat`	`yyyy-MM-dd`	Custom format string for string values read as `Date`. Custom date formats follow the formats at Datetime patterns.

JSON

Key	Default	Description
`allowBackslashEscapingAnyCharacter`	`false`	Whether to allow backslashes to escape any character that succeeds it. If not enabled, only characters that are explicitly listed by the JSON specification can be escaped.
`allowComments`	`false`	Whether to allow the use of Java, C, and C++ style comments (`'/'`, `'*'`, and `'//'` varieties) within parsed content or not.
`allowNonNumericNumbers`	`true`	Whether to allow the set of not-a-number (`NaN`) tokens as legal floating number values.
`allowNumericLeadingZeros`	`false`	Whether to allow integral numbers to start with additional (ignorable) zeroes (for example, `000001`).
`allowSingleQuotes`	`true`	Whether to allow use of single quotes (apostrophe, character `'\'`) for quoting strings (names and String values).
`allowUnquotedControlChars`	`false`	Whether to allow JSON strings to contain unescaped control characters (ASCII characters with value less than 32, including tab and line feed characters) or not.
`allowUnquotedFieldNames`	`false`	Whether to allow use of unquoted field names, which are allowed by JavaScript, but not by the JSON specification.
`alternateVariantEncoding`	None	The encoding used for Variant values in the source JSON. Set to `Z85` to decode Variant values that have been Base85-encoded instead of stored as inline JSON.
`badRecordsPath`	None	The path to store files for recording the information about bad JSON records. Using the `badRecordsPath` option in a file-based data source has the following limitations: It is non-transactional and can lead to inconsistent results. Transient errors are treated as failures.
`columnNameOfCorruptRecord`	`_corrupt_record`	The column for storing records that are malformed and cannot be parsed. If the `mode` for parsing is set as `DROPMALFORMED`, this column will be empty.
`dateFormat`	`yyyy-MM-dd`	The format for parsing date strings.
`dropFieldIfAllNull`	`false`	Whether to ignore columns of all null values or empty arrays and structs during schema inference.
`encoding` or `charset`	`UTF-8`	The name of the encoding of the JSON files. See `java.nio.charset.Charset` for list of options. You cannot use `UTF-16` and `UTF-32` when `multiline` is `true`.
`inferTimestamp`	`false`	Whether to try and infer timestamp strings as a `TimestampType`. When set to `true`, schema inference might take noticeably longer. You must enable `cloudFiles.inferColumnTypes` to use with Auto Loader.
`lineSep`	None, which covers `\r`, `\r\n`, and `\n`	A string between two consecutive JSON records.
`locale`	`US`	A `java.util.Locale` identifier. Influences default date, timestamp, and decimal parsing within the JSON.
`maxNestingDepth`	`500`	The maximum allowed nesting depth for JSON objects and arrays. Increase this value for deeply nested documents. Valid values: positive integers.
`maxNumLen`	`1000`	The maximum length of number tokens in the JSON input. Increase this value for JSON with large numeric literals. Valid values: positive integers.
`maxStringLen`	unlimited	The maximum length of string values in the JSON input. Set to limit memory usage when parsing JSON with large strings. Valid values: positive integers.
`mode`	`PERMISSIVE`	Parser mode around handling malformed records. Valid values: `PERMISSIVE`, `DROPMALFORMED`, `FAILFAST`.
`multiLine`	`false`	Whether the JSON records span multiple lines.
`prefersDecimal`	`false`	Attempts to infer strings as `DecimalType` instead of float or double type when possible. You must also use schema inference, either by enabling `inferSchema` or using `cloudFiles.inferColumnTypes` with Auto Loader.
`primitivesAsString`	`false`	Whether to infer primitive types like numbers and booleans as `StringType`.
`readerCaseSensitive`	`true`	Specifies the case sensitivity behavior when `rescuedDataColumn` is enabled. If true, rescue the data columns whose names differ by case from the schema. When false, read the data in a case-insensitive manner. Available in Databricks Runtime 13.3 and above.
`rescuedDataColumn`	None	Whether to collect all data that can't be parsed due to a data type mismatch or schema mismatch (including column casing) to a separate column. This column is included by default when using Auto Loader. For more details, refer to What is the rescued data column?. `COPY INTO` (legacy) does not support the rescued data column because you cannot manually set the schema using `COPY INTO`. Databricks recommends using Auto Loader for most ingestion scenarios.
`singleVariantColumn`	None	Whether to ingest the entire JSON document, parsed into a single Variant column with the specified string as the column's name. If not set, the JSON fields are ingested into their own columns. Valid values: any string.
`timestampFormat`	`yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]`	The format for parsing timestamp strings.
`timestampNTZFormat`	`yyyy-MM-dd'T'HH:mm:ss[.SSS]`	The format for parsing timestamp without timezone (`TimestampNTZType`) strings.
`timeZone`	None	The `java.time.ZoneId` to use when parsing timestamps and dates.
`upgradeExceptionAsBadRecord`	`false`	Whether to treat type upgrade exceptions (for example, when a value can't be widened to the declared column type) as bad records rather than throwing an exception.

Kafka

For the full list of Kafka reader options, see DataStreamReader Kafka options. The following options apply only to batch reads using spark.read.format("kafka").

Key	Default	Description
`endingOffsets`	`latest`	Where to stop reading. Valid values: `latest`, or a JSON string of offsets for each partition such as `{"topicA":{"0":50,"1":-1}}`. In the JSON string, `-1` is the latest offset. `-2`, which is the earliest offset, is not allowed as an ending offset.
`endingOffsetsByTimestamp`	None	Per-partition ending offsets specified as timestamps in milliseconds. Valid values: a JSON string of timestamps for each partition, such as `{"topicA":{"0":2000,"1":3000}}`.
`endingTimestamp`	None	Global ending timestamp in milliseconds applied to all partitions. Valid values: non-negative integers.

Key

Default

Description

endingOffsets

latest

Where to stop reading. Valid values: latest, or a JSON string of offsets for each partition such as {"topicA":{"0":50,"1":-1}}.

In the JSON string, -1 is the latest offset. -2, which is the earliest offset, is not allowed as an ending offset.

endingOffsetsByTimestamp

None

Per-partition ending offsets specified as timestamps in milliseconds. Valid values: a JSON string of timestamps for each partition, such as {"topicA":{"0":2000,"1":3000}}.

endingTimestamp

None

Global ending timestamp in milliseconds applied to all partitions. Valid values: non-negative integers.

ORC

Key	Default	Description
`mergeSchema`	`false`	Whether to infer the schema across multiple files and to merge the schema of each file.

Parquet

Key	Default	Description
`datetimeRebaseMode`	`LEGACY`	Controls the rebasing of the DATE and TIMESTAMP values between Julian and Proleptic Gregorian calendars. Valid values: `EXCEPTION`, `LEGACY`, and `CORRECTED`.
`int96RebaseMode`	`LEGACY`	Controls the rebasing of the INT96 timestamp values between Julian and Proleptic Gregorian calendars. Valid values: `EXCEPTION`, `LEGACY`, and `CORRECTED`.
`mergeSchema`	`false`	Whether to infer the schema across multiple files and to merge the schema of each file.
`readerCaseSensitive`	`true`	Specifies the case sensitivity behavior when `rescuedDataColumn` is enabled. If true, rescue the data columns whose names differ by case from the schema. When false, read the data in a case-insensitive manner.
`rescuedDataColumn`	None	Whether to collect all data that can't be parsed due to: a data type mismatch, and schema mismatch (including column casing) to a separate column. This column is included by default when using Auto Loader. For more details refer to What is the rescued data column?. `COPY INTO` (legacy) does not support the rescued data column because you cannot manually set the schema using `COPY INTO`. Databricks recommends using Auto Loader for most ingestion scenarios.

State store

Use these options with spark.read.format("statestore") or the read_statestore table-valued function to read Structured Streaming state data. See Read Structured Streaming state information.

Key	Default	Description
`batchId`	Latest batch ID	The target batch to read from. Use to query an earlier state of the query. The batch must be committed but not yet cleaned up. Valid values: non-negative integers.
`operatorId`	`0`	The target operator to read from. Use when the query has multiple stateful operators. Valid values: non-negative integers.
`storeName`	`DEFAULT`	The target state store name to read from. Use when the stateful operator has multiple state store instances. You must specify either `storeName` or `joinSide` for a stream-stream join, but not both. Valid values: any string.
`joinSide`	None	The target side to read from for a stream-stream join. You must specify either `storeName` or `joinSide` for a stream-stream join, but not both. Valid values: `left`, `right`.
`snapshotStartBatchId`	None	The batch ID of the snapshot to use as the starting point when reading state. The reader rebuilds state by replaying changes from this snapshot until `batchId`. Useful when a snapshot is corrupted. Must specify together with `snapshotPartitionId`. Cannot use with `readChangeFeed`. Supports the HDFS-backed state store and the RocksDB state store with changelog checkpointing enabled. Available in Databricks Runtime 15.4 LTS and above. Valid values: non-negative integers.
`snapshotPartitionId`	None	If specified, the query only reads this partition. Must specify together with `snapshotStartBatchId`. Cannot use with `readChangeFeed`. Available in Databricks Runtime 15.4 LTS and above. Valid values: non-negative integers.
`readChangeFeed`	`false`	When `true`, returns state changes across a specified range of batches between `changeStartBatchId` and `changeEndBatchId`. Requires `changeStartBatchId`. Cannot use with `joinSide`, `batchId`, `snapshotStartBatchId`, or `snapshotPartitionId`. Available in Databricks Runtime 16.4 LTS and above. Valid values: `true`, `false`. For details, see Read Structured Streaming state changes.
`changeStartBatchId`	None	The starting batch ID for the change feed range. Required when `readChangeFeed` is `true`. Only applies when `readChangeFeed` is set to `true`. Available in Databricks Runtime 16.4 LTS and above. Valid values: non-negative integers.
`changeEndBatchId`	Latest batch ID	The ending batch ID for the change feed range. Must be greater than or equal to `changeStartBatchId`. Only applies when `readChangeFeed` is set to `true`. Available in Databricks Runtime 16.4 LTS and above. Valid values: non-negative integers.
`stateVarName`	None	The state variable name to read. The state variable name is the unique name of each variable within the `init` function of a `StatefulProcessor` used by the `transformWithState` operator. Required when you use the `transformWithState` operator. Available in Databricks Runtime 16.4 LTS and above. Valid values: any string.
`readRegisteredTimers`	`false`	When `true`, reads registered timers used by the `transformWithState` operator. Only applies to the `transformWithState` operator. Available in Databricks Runtime 16.4 LTS and above. Valid values: `true`, `false`.
`flattenCollectionTypes`	`true`	When `true`, flattens the records returned for map and list state variables. When `false`, returns records as a Spark SQL `Array` or `Map`. Only applies to the `transformWithState` operator. Available in Databricks Runtime 16.4 LTS and above. Valid values: `true`, `false`.

Text

Key	Default	Description
`encoding`	`UTF-8`	The name of the encoding of the TEXT file line separator. For a list of options, see `java.nio.charset.Charset`. The content of the file is not affected by this option and is read as-is.
`lineSep`	None, which covers `\r`, `\r\n` and `\n`	A string between two consecutive TEXT records.
`wholeText`	`false`	Whether to read a file as a single record.

XML

Key	Default	Description
`rowTag`	None	The row tag of the XML files to treat as a row. In the example XML `<books> <book><book>...<books>`, the appropriate value is `book`. This is a required option.
`samplingRatio`	`1.0`	Defines a fraction of rows used for schema inference. XML built-in functions ignore this option. Valid values: `0.0` to `1.0`.
`excludeAttribute`	`false`	Whether to exclude attributes in elements.
`mode`	None	Mode for dealing with corrupt records during parsing. `PERMISSIVE`: For corrupted records, puts the malformed string into a field configured by `columnNameOfCorruptRecord`, and sets malformed fields to `null`. To keep corrupt records, you can set a `string` type field named `columnNameOfCorruptRecord` in a user-defined schema. If a schema does not have the field, corrupt records are dropped during parsing. When inferring a schema, the parser implicitly adds a `columnNameOfCorruptRecord` field in an output schema. `DROPMALFORMED`: Ignores corrupted records. This mode is unsupported for XML built-in functions. `FAILFAST`: Throws an exception when the parser meets corrupted records.
`inferSchema`	`true`	If `true`, attempts to infer an appropriate type for each resulting DataFrame column. If `false`, all resulting columns are of `string` type. XML built-in functions ignore this option.
`columnNameOfCorruptRecord`	`spark.sql.columnNameOfCorruptRecord`	Allows renaming the new field that contains a malformed string created by `PERMISSIVE` mode.
`attributePrefix`	None	The prefix for attributes to differentiate attributes from elements. This will be the prefix for field names. Default is `_`. Can be empty for reading XML, but not for writing. Also applies to DataFrameWriter XML options.
`valueTag`	`_VALUE`	The tag used for the character data within elements that also have attribute(s) or child element(s) elements. User can specify the `valueTag` field in the schema or it will be added automatically during schema inference when character data is present in elements with other elements or attributes. Also applies to DataFrameWriter XML options.
`encoding`	`UTF-8`	For reading, decodes the XML files by the given encoding type. For writing, specifies encoding (charset) of saved XML files. XML built-in functions ignore this option. Also applies to DataFrameWriter XML options.
`ignoreSurroundingSpaces`	`true`	Whether white spaces surrounding values must be skipped. Whitespace-only character data are ignored.
`rowValidationXSDPath`	None	Path to an optional XSD file that is used to validate the XML for each row individually. Rows that fail to validate are treated like parse errors. The XSD does not otherwise affect the schema, whether provided or inferred.
`ignoreNamespace`	`false`	If `true`, namespaces' prefixes on XML elements and attributes are ignored. Tags `<abc:author>` and `<def:author>`, for example, are treated as if both are just `<author>`. Namespaces cannot be ignored on the `rowTag` element, only its read children. XML parsing is not namespace-aware even if `false`.
`timestampFormat`	`yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]`	Custom timestamp format string that follows the datetime pattern format. This applies to `timestamp` type. Also applies to DataFrameWriter XML options.
`timestampNTZFormat`	`yyyy-MM-dd'T'HH:mm:ss[.SSS]`	Custom format string for timestamp without timezone that follows the datetime pattern format. This applies to TimestampNTZType type. Also applies to DataFrameWriter XML options.
`dateFormat`	`yyyy-MM-dd`	Custom date format string that follows the datetime pattern format. This applies to date type. Also applies to DataFrameWriter XML options.
`locale`	`en-US`	Sets a locale as a language tag in IETF BCP 47 format. For instance, `locale` is used while parsing dates and timestamps.
`nullValue`	string `null`	Sets the string representation of a null value. When this is `null`, the parser does not write attributes and elements for fields. Also applies to DataFrameWriter XML options.
`readerCaseSensitive`	`true`	Specifies the case sensitivity behavior when rescuedDataColumn is enabled. If true, rescue the data columns whose names differ by case from the schema. When false, read the data in a case-insensitive manner.
`rescuedDataColumn`	None	Whether to collect all data that can't be parsed due to a data type mismatch and schema mismatch (including column casing) to a separate column. This column is included by default when using Auto Loader. For more details, see What is the rescued data column?. `COPY INTO` (legacy) does not support the rescued data column because you cannot manually set the schema using `COPY INTO`. Databricks recommends using Auto Loader for most ingestion scenarios.
`singleVariantColumn`	`none`	Specifies the name of the single variant column. If this option is specified for reading, parse the entire XML record into a single Variant column with the given option string value as the column's name. If this option is provided for writing, write the value of the single Variant column to XML files. Also applies to DataFrameWriter XML options.
`useLegacyXMLParser`	`true`	Whether to use the legacy XML parser. The legacy parser has less stringent validation for malformed content but is less memory-efficient. Set to `false` to opt into the stricter default parser.
`wildcardColName`	`xs_any`	The column name used to capture XML elements that match the wildcard (`xs:any`) schema element. Cannot be used together with `rescuedDataColumn`.

DataStreamReader options

Use these options with DataStreamReader.option() to configure streaming reads from Delta Lake tables and other file-based sources.

For file format options (JSON, CSV, Parquet, and others), see DataFrameReader options.

For Auto Loader (cloudFiles.*) options, see Auto Loader.

Example

The following example sets maxFilesPerTrigger to 10 for a Delta Lake table stream:

Python
Scala

Python
df = spark.readStream.format("delta").option("maxFilesPerTrigger", 10).load("/path/to/delta-table")

Scala
val df = spark.readStream.format("delta").option("maxFilesPerTrigger", "10").load("/path/to/delta-table")

Common

The following options apply to Delta Lake tables and other file-based streaming sources.

Key	Default	Description
`cleanSource`	`off`	How to handle source files after they are processed by the stream. Valid values: `off` (no action), `delete` (permanently delete the source file), `archive` (move to `sourceArchiveDir`). When set to `archive`, `sourceArchiveDir` must also be set. Does not apply to Delta Lake table streaming.
`fileNameOnly`	`false`	Whether to identify already-processed files by filename only rather than by full path. When `true`, files at different paths with the same filename are treated as the same file and are not reprocessed. Does not apply to Delta Lake table streaming.
`latestFirst`	`false`	Whether to process the most recently modified files first within each micro-batch. Useful when you want to process the latest data as quickly as possible. When `true` and `maxFilesPerTrigger` or `maxBytesPerTrigger` is set, `maxFileAge` is ignored. Does not apply to Delta Lake table streaming.
`maxBytesPerTrigger`	None	Soft maximum for the amount of data processed for each micro-batch. A batch may process more than the limit if the smallest input unit exceeds it. When used together with `maxFilesPerTrigger`, the micro-batch processes data until either limit is reached first. Valid values: positive integers. For Auto Loader, use `cloudFiles.maxBytesPerTrigger` instead. See Common.
`maxCachedFiles`	`10000`	Maximum number of unprocessed files to cache for subsequent micro-batches. Set to `0` to turn off caching. Increase this value when the source directory contains many new files for each trigger. Does not apply to Delta Lake table streaming. Valid values: positive integers or `0`.
`maxFileAge`	`7d`	Maximum age of files considered for processing, relative to the timestamp of the most recently modified file rather than the current system time. Files older than this threshold are ignored. Accepts duration strings such as `7d` or `4h`. Ignored when `latestFirst` is `true` and `maxFilesPerTrigger` or `maxBytesPerTrigger` is set. Does not apply to Delta Lake table streaming.
`maxFilesPerTrigger`	`1000` for Delta Lake and Auto Loader. No maximum for other file-based sources.	Upper bound for the number of new files processed in each micro-batch. When used together with `maxBytesPerTrigger`, the micro-batch processes data until either limit is reached first. Valid values: positive integers. For Auto Loader, use `cloudFiles.maxFilesPerTrigger` instead. See Common.
`sourceArchiveDir`	None	Path to the archive directory when `cleanSource` is set to `archive`. Source files are moved to this path after processing, preserving their relative directory structure. Does not apply to Delta Lake table streaming.

Auto Loader

Use these options with the cloudFiles source to configure Auto Loader for streaming ingestion from cloud storage. Options specific to the cloudFiles source are prefixed with cloudFiles to keep them in a separate namespace from other Structured Streaming source options.

Common

Key	Default	Description
`cloudFiles.allowOverwrites`	`false`	Whether to allow input directory file changes to overwrite existing data. For configuration caveats, see Does Auto Loader process the file again when the file gets appended or overwritten?.
`cloudFiles.backfillInterval`	None	Auto Loader can trigger asynchronous backfills at a given interval. For example `1 day` to backfill daily or `1 week` to backfill weekly. For more information, see Trigger regular backfills using cloudFiles.backfillInterval. Do not use when `cloudFiles.useManagedFileEvents` is set to `true`.
`cloudFiles.cleanSource`	`OFF`	Whether to automatically delete processed files from the input directory. When set to `OFF` (default), no files are deleted. When set to `DELETE`, Auto Loader automatically deletes files 30 days after they are processed. To do this, Auto Loader must have write permissions to the source directory. When set to `MOVE`, Auto Loader automatically moves files to the specified location in `cloudFiles.cleanSource.moveDestination` 30 days after they are processed. To do this, Auto Loader must have write permissions to the source directory as well as to the move location. A file is considered processed when it has a non-null value for `commit_time` in the result of the `cloud_files_state` table-valued function. See `cloud_files_state` table-valued function. The 30-day additional wait after processing can be configured using `cloudFiles.cleanSource.retentionDuration`. Review the following considerations before enabling `cloudFiles.cleanSource`: Databricks does not recommend using this option if there are multiple streams consuming data from the source location because the fastest consumer will delete the files and they will not be ingested in the slower sources. Enabling this feature requires Auto Loader to maintain additional state in its checkpoint, which incurs performance overhead but enables improved observability through the `cloud_files_state` table-valued function. See `cloud_files_state` table-valued function. `cleanSource` uses the current setting to decide whether to `MOVE` or `DELETE` a given file. For example, suppose that the setting was `MOVE` when the file was originally processed but was changed to `DELETE` when the file became a candidate for cleanup 30 days later. In this case, cleanSource will delete the file. Files are not guaranteed to be cleaned as soon as the `retentionDuration` expires. To keep costs low, Auto Loader deletes files concurrently with stream processing and terminates as soon as the stream processing is complete or is terminated. Files that were candidates for cleanup, but couldn't be cleaned during the stream processing will be picked up the next time Auto Loader runs. Available in Databricks Runtime 16.4 and above.
`cloudFiles.cleanSource.retentionDuration`	`30 days`	Amount of time to wait before processed files become candidates for archival with `cleanSource`. Must be greater than 7 days for `DELETE`. No minimum restriction for `MOVE`. The value is a CalendarInterval string. For example, `"14 days"`, `"30 days"`, `"2 weeks"`, or `"1 month"`. Available in Databricks Runtime 16.4 and above.
`cloudFiles.cleanSource.moveDestination`	None	Path to archive processed files to when `cloudFiles.cleanSource` is set to `MOVE`. This can be a cloud storage path or a Unity Catalog volume path (for example, `/Volumes/my_catalog/my_schema/my_volume/archive/`). The move location must: Not be a child of the source directory. If you place the move destination inside the source directory, the archived files are ingested again. Be in the same external location, volume, or DBFS mount as the source. Cross-bucket and cross-container moves are not supported and result in an error. Auto Loader must have write permissions to this directory. Available in Databricks Runtime 16.4 and above.
`cloudFiles.format`	None (required option)	The data file format in the source path. Valid values include: `avro`: Avro files `binaryFile`: Binary files `csv`: CSV files `json`: JSON files `orc`: ORC files `parquet`: Parquet files `text`: TXT files `xml`: XML files
`cloudFiles.includeExistingFiles`	`true`	Whether to include existing files in the stream processing input path or to only process new files arriving after initial setup. This option is evaluated only when you start a stream for the first time. Changing this option after restarting the stream has no effect.
`cloudFiles.inferColumnTypes`	`false`	Whether to infer exact column types when leveraging schema inference. By default, columns are inferred as strings when inferring JSON and CSV datasets. See schema inference for more details.
`cloudFiles.maxBytesPerTrigger`	None	The maximum number of new bytes to be processed in every trigger. You can specify a byte string such as `10g` to limit each microbatch to 10 GB of data. This is a soft maximum. If you have files that are 3 GB each, Databricks processes 12 GB in a microbatch. When used together with `cloudFiles.maxFilesPerTrigger`, Databricks consumes up to the lower limit of `cloudFiles.maxFilesPerTrigger` or `cloudFiles.maxBytesPerTrigger`, whichever is reached first. This option has no effect when used with `Trigger.Once()` (`Trigger.Once()` is deprecated). In Databricks Runtime 18.0 and above, this option is dynamically configured and does not need to be set manually.
`cloudFiles.maxFileAge`	None	How long a file event is tracked for deduplication purposes. Databricks does not recommend tuning this parameter unless you are ingesting data at the order of millions of files an hour. See the section on File event tracking for more details. Tuning `cloudFiles.maxFileAge` too aggressively can cause data quality issues such as duplicate ingestion or missing files. Therefore, Databricks recommends a conservative setting for `cloudFiles.maxFileAge`, such as 90 days, which is similar to what comparable data ingestion solutions recommend.
`cloudFiles.maxFilesPerTrigger`	`1000`	The maximum number of new files to be processed in every trigger. When used together with `cloudFiles.maxBytesPerTrigger`, Databricks consumes up to the lower limit of `cloudFiles.maxFilesPerTrigger` or `cloudFiles.maxBytesPerTrigger`, whichever is reached first. This option has no effect when used with `Trigger.Once()` (deprecated). In Databricks Runtime 18.0 and above, this option is dynamically configured and does not need to be set manually.
`cloudFiles.partitionColumns`	None	A comma-separated list of Hive-style partition columns that you would like inferred from the directory structure of the files. Hive-style partition columns are key-value pairs combined by an equality sign such as `<base-path>/a=x/b=1/c=y/file.format`. In this example, the partition columns are `a`, `b`, and `c`. By default these columns are automatically added to your schema if you are using schema inference and provide the `<base-path>` to load data from. If you provide a schema, Auto Loader expects these columns to be included in the schema. If you do not want these columns as part of your schema, you can specify `""` to ignore these columns. In addition, you can use this option when you want columns to be inferred the file path in complex directory structures, like the example below: `<base-path>/year=2022/week=1/file1.csv` `<base-path>/year=2022/month=2/day=3/file2.csv` `<base-path>/year=2022/month=2/day=4/file3.csv` Specifying `cloudFiles.partitionColumns` as `year,month,day` returns `year=2022` for `file1.csv`, but the `month` and `day` columns are `null`. `month` and `day` are parsed correctly for `file2.csv` and `file3.csv`.
`cloudFiles.schemaEvolutionMode`	`addNewColumns` when a schema is not provided, `none` otherwise	The mode for evolving the schema as new columns are discovered in the data. By default, columns are inferred as strings when inferring JSON datasets. See schema evolution for more details.
`cloudFiles.schemaHints`	None	Schema information that you provide to Auto Loader during schema inference. See schema hints for more details.
`cloudFiles.schemaLocation`	None (required to infer the schema)	The location to store the inferred schema and subsequent changes. See schema inference for more details.
`cloudFiles.useStrictGlobber`	`false`	Whether to use a strict globber that matches the default globbing behavior of other file sources in Apache Spark. See Common data loading patterns for more details. Available in Databricks Runtime 12.2 LTS and above.
`cloudFiles.validateOptions`	`true`	Whether to validate Auto Loader options and return an error for unknown or inconsistent options.

Directory listing

Key	Default	Description
`cloudFiles.useIncrementalListing` (deprecated)	`auto` on Databricks Runtime 17.2 and below, `false` on Databricks Runtime 17.3 and above	This feature has been deprecated. Databricks recommends using file notification mode with file events instead of `cloudFiles.useIncrementalListing`. Whether to use the incremental listing rather than the full listing in directory listing mode. By default, Auto Loader makes the best effort to automatically detect if a given directory is applicable for the incremental listing. You can explicitly use the incremental listing or use the full directory listing by setting it as `true` or `false` respectively. Incorrectly enabling incremental listing on a non-lexically ordered directory prevents Auto Loader from discovering new files. Works with Azure Data Lake Storage (`abfss://`), S3 (`s3://`), and GCS (`gs://`). Available in Databricks Runtime 9.1 LTS and above. Available values: `auto`, `true`, `false`

Key

Default

Description

cloudFiles.useIncrementalListing (deprecated)

auto on Databricks Runtime 17.2 and below, false on Databricks Runtime 17.3 and above

This feature has been deprecated. Databricks recommends using file notification mode with file events instead of cloudFiles.useIncrementalListing.

Whether to use the incremental listing rather than the full listing in directory listing mode. By default, Auto Loader makes the best effort to automatically detect if a given directory is applicable for the incremental listing. You can explicitly use the incremental listing or use the full directory listing by setting it as true or false respectively.

Incorrectly enabling incremental listing on a non-lexically ordered directory prevents Auto Loader from discovering new files.

Works with Azure Data Lake Storage (abfss://), S3 (s3://), and GCS (gs://).

Available in Databricks Runtime 9.1 LTS and above. Available values: auto, true, false

File notification

For information about configuring file notification mode, including required cloud permissions, setup instructions, and authentication methods, see Configure Auto Loader streams in file notification mode.

Key	Default	Description
`cloudFiles.fetchParallelism`	`1`	Number of threads to use when fetching messages from the queueing service. Do not use when `cloudFiles.useManagedFileEvents` is set to `true`.
`cloudFiles.pathRewrites`	None	Required only if you specify a `queueUrl` that receives file notifications from multiple S3 buckets and you want to use mount points configured for accessing data in these containers. Use this option to rewrite the prefix of the `bucket/key` path with the mount point. Only prefixes can be rewritten. For example, for the configuration `{"<databricks-mounted-bucket>/path": "dbfs:/mnt/data-warehouse"}`, the path `s3://<databricks-mounted-bucket>/path/2017/08/fileA.json` is rewritten to `dbfs:/mnt/data-warehouse/2017/08/fileA.json`. Do not use when `cloudFiles.useManagedFileEvents` is set to `true`.
`cloudFiles.resourceTag`	None	A series of key-value tag pairs to help associate and identify related resources, for example: `cloudFiles.option("cloudFiles.resourceTag.myFirstKey", "myFirstValue")` `.option("cloudFiles.resourceTag.mySecondKey", "mySecondValue")` For more information on AWS, see Amazon SQS cost allocation tags and Configuring tags for an Amazon SNS topic. (1) For more information on Azure, see Naming Queues and Metadata and the coverage of `properties.labels` in Event Subscriptions. Auto Loader stores these key-value tag pairs in JSON as labels. (1) For more information on GCP, see Reporting usage with labels. (1) Do not use when `cloudFiles.useManagedFileEvents` is set to `true`. Instead set resource tags using the cloud provider console.
`cloudFiles.useManagedFileEvents`	`false`	When set to `true`, Auto Loader uses the file events service to discover files in your external location. You can use this option only if the load path is in an external location with file events enabled. See Use file notification mode with file events. File events provide notifications-level performance in file discovery, because Auto Loader can discover new files after the last run. Unlike directory listing, this process does not need to list all files in the directory. There are some situations when Auto Loader uses directory listing even though the file events option is enabled: During initial load, when `includeExistingFiles` is set to `true`, a full directory listing takes place to discover all of the files that were present in the directory before Auto Loader started. The file events service optimizes file discovery by caching the most recently created files. If Auto Loader runs infrequently, this cache can expire, and Auto Loader falls back to directory listing to discover files and update the cache. To avoid this scenario, invoke Auto Loader at least once every seven days. See When does Auto Loader with file events use directory listing? for a comprehensive list of situations when Auto Loader uses directory listing with this option. Available in Databricks Runtime 14.3 LTS and above.
`cloudFiles.listOnStart`	`false`	When set to `true`, Auto Loader performs a full directory listing when the stream starts, instead of starting with the continuation token in the checkpoint. Use this option to recover from errors, such as `CF_MANAGED_FILE_EVENTS_INVALID_CONTINUATION_TOKEN`. See How do I recover from a `CF_MANAGED_FILE_EVENTS_INVALID_CONTINUATION_TOKEN` error?.
`cloudFiles.useNotifications`	`false`	Whether to use file notification mode to determine when there are new files. If `false`, use directory listing mode. See Compare Auto Loader file detection modes. Do not use when `cloudFiles.useManagedFileEvents` is set to `true`.

(1) Auto Loader adds the following key-value tag pairs by default on a best-effort basis:

vendor: Databricks
path: The location from where the data is loaded. Unavailable in GCP due to labeling limitations.
checkpointLocation: The location of the stream's checkpoint. Unavailable in GCP due to labeling limitations.
streamId: A globally unique identifier for the stream.

Databricks reserves these key names, and you cannot overwrite their values.

Cloud-specific

Auto Loader provides options for configuring cloud infrastructure for file notification mode. For required cloud permissions and setup instructions, see Configure Auto Loader streams in file notification mode.

AWS

Provide the following options only if you choose cloudFiles.useNotifications = true and you want Auto Loader to set up the notification services for you:

Key	Default	Description
`cloudFiles.region`	The region of the EC2 instance	The region where the source S3 bucket resides and where you want to create the AWS SNS and SQS services.

Key	Default	Description
`cloudFiles.restrictNotificationSetupToSameAWSAccountId`	`false`	Only allow event notifications from AWS S3 buckets in the same account as the SNS topic. When true, Auto Loader only accepts event notifications from AWS S3 buckets in the same account as the SNS topic. When `false`, the access policy does not restrict cross-account bucket and SNS topic setups. This is useful when the SNS topic and bucket path are associated with different accounts. Available in Databricks Runtime 17.2 and above.

Key

Default

Description

cloudFiles.restrictNotificationSetupToSameAWSAccountId

false

Only allow event notifications from AWS S3 buckets in the same account as the SNS topic. When true, Auto Loader only accepts event notifications from AWS S3 buckets in the same account as the SNS topic.

When false, the access policy does not restrict cross-account bucket and SNS topic setups. This is useful when the SNS topic and bucket path are associated with different accounts.

Available in Databricks Runtime 17.2 and above.

Provide the following option only if you choose cloudFiles.useNotifications = true and you want Auto Loader to use a queue that you have already set up:

Key	Default	Description
`cloudFiles.queueUrl`	None	The URL of the SQS queue. If provided, Auto Loader directly consumes events from this queue instead of setting up its own AWS SNS and SQS services.

AWS authentication options

Provide the following authentication option to use a Databricks service credential:

Key	Default	Description
`databricks.serviceCredential`	None	The name of your Databricks service credential. Available in Databricks Runtime 16.1 and above.

When Databricks service credentials or IAM roles are not available, you can provide the following authentication options instead:

Key	Default	Description
`cloudFiles.awsAccessKey`	None	The AWS access key ID for the user. Must be provided with `cloudFiles.awsSecretKey`.
`cloudFiles.awsSecretKey`	None	The AWS secret access key for the user. Must be provided with `cloudFiles.awsAccessKey`.
`cloudFiles.roleArn`	None	The ARN of an IAM role to assume, if needed. The role can be assumed from your cluster's instance profile or by providing credentials with `cloudFiles.awsAccessKey` and `cloudFiles.awsSecretKey`.
`cloudFiles.roleExternalId`	None	An identifier to provide while assuming a role using `cloudFiles.roleArn`.
`cloudFiles.roleSessionName`	None	An optional session name to use while assuming a role using `cloudFiles.roleArn`.
`cloudFiles.stsEndpoint`	None	An optional endpoint to provide for accessing AWS STS when assuming a role using `cloudFiles.roleArn`.

Azure

You must provide values for all of the following options if you specify cloudFiles.useNotifications = true and you want Auto Loader to set up the notification services for you:

Key	Default	Description
`cloudFiles.resourceGroup`	None	The Azure Resource Group in which the storage account is created.
`cloudFiles.subscriptionId`	None	The Azure Subscription ID in which the resource group is created.
`databricks.serviceCredential`	None	The name of your Databricks service credential. Available in Databricks Runtime 16.1 and above.

If a Databricks service credential is not available, you can provide the following authentication options instead:

Key	Default	Description
`cloudFiles.clientId`	None	The client ID or application ID of the Databricks service principal.
`cloudFiles.clientSecret`	None	The client secret of the Databricks service principal.
`cloudFiles.connectionString`	None	The connection string for the storage account, based on either account access key or shared access signature (SAS).
`cloudFiles.tenantId`	None	The Azure Tenant ID in which the Databricks service principal is created.

Provide the following option only if you set cloudFiles.useNotifications = true and you want Auto Loader to use an existing queue:

Key	Default	Description
`cloudFiles.queueName`	None	The name of the Azure queue. If provided, the cloud files source directly consumes events from this queue instead of setting up its own Azure Event Grid and Queue Storage services. In that case, your `databricks.serviceCredential` or `cloudFiles.connectionString` requires only read permissions on the queue.

GCP

Auto Loader can automatically set up notification services for you by leveraging Databricks service credentials. The service account created with the Databricks service credential will require the permissions specified in Configure Auto Loader streams in file notification mode.

Key	Default	Description
`cloudFiles.projectId`	None	The ID of the project that the GCS bucket is in. The Google Cloud Pub/Sub subscription is also created within this project.
`databricks.serviceCredential`	None	The name of your Databricks service credential. Available in Databricks Runtime 16.1 and above.

If a Databricks service credential is not available, you can use Google Service Accounts directly. You can either configure your cluster to assume a service account by following Google service setup or provide the following authentication options directly:

Key	Default	Description
`cloudFiles.client`	None	The client ID of the Google Service Account.
`cloudFiles.clientEmail`	None	The email of the Google Service Account.
`cloudFiles.privateKey`	None	The private key that's generated for the Google Service Account.
`cloudFiles.privateKeyId`	None	The ID of the private key that's generated for the Google Service Account.

Provide the following option only if you choose cloudFiles.useNotifications = true and you want Auto Loader to use a queue that you have already set up:

Key	Default	Description
`cloudFiles.subscription`	None	The name of the Google Cloud Pub/Sub subscription. If provided, the cloud files source consumes events from this queue instead of setting up its own GCS Notification and Google Cloud Pub/Sub services.

Delta Lake

The following options apply when reading from a Delta Lake table using spark.readStream.

Key	Default	Description
`allowSourceColumnDrop`	None	Set to a Delta table version number or `"always"` to allow the stream to continue after columns are dropped from the source table schema. When set to a version number, acknowledges all schema changes up to that version. Requires `schemaTrackingLocation`. See Rename and drop columns with Delta Lake column mapping.
`allowSourceColumnRename`	None	Set to a Delta table version number or `"always"` to allow the stream to continue after columns are renamed in the source table. When set to a version number, acknowledges all schema changes up to that version. Requires `schemaTrackingLocation`. See Rename and drop columns with Delta Lake column mapping.
`allowSourceColumnTypeChange`	None	Set to a Delta table version number or `"always"` to allow the stream to continue after column types are changed in the source table. When set to a version number, acknowledges all schema changes up to that version. Requires `schemaTrackingLocation`. See Type widening.
`excludeRegex`	None	A regular expression pattern. Files whose paths match the pattern are excluded from the streaming read. Useful for filtering out files that do not conform to the expected naming convention.
`failOnDataLoss`	`true`	Whether to fail the streaming query if source data has been deleted due to log retention (`logRetentionDuration`). Set to `false` to skip missing data and continue processing. See Configure data retention for time travel queries.
`ignoreChanges` (deprecated)	`false`	Available in Databricks Runtime 11.3 LTS and lower. Re-emits rewritten data files after modification operations such as `UPDATE`, `MERGE INTO`, `DELETE`, or `OVERWRITE`. Unchanged rows may be emitted alongside new rows, so downstream consumers must handle duplicates. Deletes are not propagated downstream. Replaced by `skipChangeCommits` in Databricks Runtime 12.2 LTS and above.
`ignoreDeletes` (deprecated)	`false`	Ignores transactions that delete data at partition boundaries (full partition drops only). Does not handle non-partition deletes, updates, or other modifications. Use `skipChangeCommits` instead.
`readChangeFeed` or `readChangeData`	`false`	Whether to enable reading the change data feed for the streaming query. When enabled, the stream emits row-level changes (inserts, updates, and deletes) with additional metadata columns. See Use Delta Lake change data feed on Databricks.
`schemaTrackingLocation`	None	Path to a directory where Delta Lake tracks schema changes for the streaming read. Required when streaming from tables with column mapping enabled and using `allowSourceColumn*` options to handle schema evolution. Must be within the `checkpointLocation` of the streaming query. See Rename and drop columns with Delta Lake column mapping.
`skipChangeCommits`	`false`	Ignores transactions that delete or modify existing records and processes only appends. Databricks recommends this option for most workloads that do not use change data feeds. Available in Databricks Runtime 12.2 LTS and above. See Skip upstream change commits with `skipChangeCommits`.
`startingTimestamp`	Latest available	Timestamp to start reading from. The stream reads all table changes committed at or after the specified timestamp. If the timestamp precedes all available table commits, the stream starts from the earliest available commit. Cannot be used together with `startingVersion`. Ignored if the streaming checkpoint already exists. Valid values: a timestamp string such as `"2019-01-01T00:00:00.000Z"` or a date string such as `"2019-01-01"`.
`startingVersion`	Latest available	Delta table version to start reading from. The stream reads all changes committed at or after the specified version. Specify `"latest"` to start from only the most recent changes. Cannot be used together with `startingTimestamp`. Ignored if the streaming checkpoint already exists. See Work with table history.
`withEventTimeOrder`	`false`	Divides the initial table snapshot into event time buckets to prevent records from being incorrectly marked as late events and dropped in stateful queries with watermarks. Cannot be changed after initial snapshot processing has begun without deleting the checkpoint. Available in Databricks Runtime 11.3 LTS and above. See Process initial snapshot without dropping data.

Kafka

Use these options with either spark.readStream.format("kafka") or spark.read.format("kafka"):

Key	Default	Description
`assign`	None	The specific partitions to consume. You must specify exactly one of the `subscribe`, `subscribePattern`, or `assign` options. Valid values: a JSON string, such as `{"topicA":[0,1],"topicB":[2,4]}`.
`failOnDataLoss`	`true`	Whether to fail the query if data might have been lost, for example, due to deleted topics or offset truncation. Set to `false` to skip missing data and continue. Valid values: `true`, `false`. Databricks estimates conservatively whether data might have been lost. However, this might cause false alarms.
`fetchoffset.numretries`	`3`	The number of retries when fetching Kafka offsets fails. Valid values: non-negative integers.
`fetchoffset.retryintervalms`	`1000`	The interval in milliseconds between offset fetch retries. Valid values: non-negative integers.
`groupIdPrefix`	`spark-kafka-source` (streaming), `spark-kafka-relation` (batch)	The customized prefix to use for the auto-generated Kafka consumer group ID. If `kafka.group.id` is explicitly set, the connector ignores this option. Valid values: any string.
`includeHeaders`	`false`	Whether to include Kafka message headers as a column in the output. Valid values: `true`, `false`.
`kafkaconsumer.polltimeoutms`	None	The timeout in milliseconds for the Kafka consumer `poll()` call. Valid values: positive integers.
`kafka.bootstrap.servers`	None	A comma-separated list of host:port addresses for Kafka brokers. Sets the Kafka client's `bootstrap.servers` property. If you find there is no data from Kafka, check this broker address list for incorrect addresses. If the broker address list is incorrect, there might not be any errors. Kafka clients assume the brokers will be available eventually and retry forever when they receive network errors.
`maxRecordsPerPartition`	None	The maximum number of records for each Spark partition. When set, the connector splits Kafka partitions so that each Spark partition reads at most this many records. Valid values: positive integers. You can also use this option with `minPartitions`. When both options are set, Spark uses whichever option results in more partitions.
`minPartitions`	None	The minimum number of Spark partitions to read from Kafka. When set, the connector splits large Kafka partitions to increase parallelism. When not set, Spark creates one partition for each Kafka topic-partition. Useful for handling data skew or peak loads. Valid values: positive integers. This option reinitializes Kafka consumers for each trigger, which might affect performance with SSL.
`startingOffsets`	`latest` (streaming), `earliest` (batch)	The offset that the query begins the read from. Valid values: `earliest`, `latest`, or a JSON string of offsets for each partition such as `{"topicA":{"0":23,"1":-2}}`. In the JSON string, `-1` is the latest offset. `-2` is the earliest offset. For streaming queries, this option only applies when a new query starts. Resumed queries always use the checkpoint. During a query, new partitions start reading at the earliest offset. For batch queries, `latest` is not allowed.
`startingOffsetsByTimestamp`	None	A list of starting offsets for each partition, specified as timestamps in milliseconds. When no offset exists for a timestamp, query behavior is determined by `startingOffsetsByTimestampStrategy`. Valid values: a JSON string of timestamps for each partition, such as `{"topicA":{"0":1000,"1":2000}}`. For streaming queries, this option only applies when a new query starts. Resumed queries always use the checkpoint. During a query, new partitions start reading at the earliest offset.
`startingOffsetsByTimestampStrategy`	`error`	The strategy to use when no offset is found for a timestamp specified in `startingOffsetsByTimestamp` or `startingTimestamp`. Valid values: `error` (raises an exception), `latest` (uses the latest available offset).
`startingTimestamp`	None	The global starting timestamp in milliseconds that applies to all partitions. When no offset exists for the timestamp, behavior is controlled by `startingOffsetsByTimestampStrategy`. Valid values: non-negative integers.
`subscribe`	None	The topics to subscribe to. You must specify exactly one of the `subscribe`, `subscribePattern`, or `assign` options. Valid values: a comma-separated list of topic names.
`subscribePattern`	None	The pattern used to subscribe to topics. You must specify exactly one of the `subscribe`, `subscribePattern`, or `assign` options. For example, `topic.*`. Valid values: any Java regex string.

The following options apply only to streaming reads with spark.readStream.format("kafka"):

Key	Default	Description
`bytesEstimateWindowLength`	`300s`	The time window used to estimate remaining bytes for the `estimatedTotalBytesBehindLatest` metric. Valid values: duration strings such as `10m` or `600s`. See Retrieve Kafka metrics.
`maxOffsetsPerTrigger`	None	The maximum number of offsets to process per trigger interval. Offsets are distributed proportionally across topic partitions. Valid values: positive integers.
`maxTriggerDelay`	`15m`	The maximum time to wait for `minOffsetsPerTrigger` to accumulate before triggering. Valid values: duration strings such as `10m` or `600s`.
`minOffsetsPerTrigger`	None	The minimum number of offsets to accumulate before triggering a micro-batch. When `maxTriggerDelay` is reached, the micro-batch runs regardless. Valid values: positive integers.

For offset options that apply only to batch reads with spark.read.format("kafka"), see DataFrameReader Kafka options.

For Kafka client (kafka.*) and authentication options, see Options.

DataFrameWriter options

Use these options with DataFrameWriter.option() and DataFrameWriterV2.option() to control how Databricks writes data.

Example

The following example sets mergeSchema to True for writing a Delta Lake table:

Python
Scala

Python
df.write.format("delta").option("mergeSchema", True).saveAsTable("my_table")

Scala
df.write.format("delta").option("mergeSchema", "true").saveAsTable("my_table")

Avro

Key	Default	Description
`avroSchema`	None	The full Avro schema as a JSON string. Use this option to convert Spark SQL types to specific Avro types. Applies to Avro file.
`avroSchemaUrl`	None	A URL pointing to an Avro schema file. Use instead of `avroSchema` when the schema is stored externally. Mutually exclusive with `avroSchema`. Applies to Avro file.
`compression`	`snappy`	Compression codec to use when writing. Valid values: `uncompressed`, `deflate`, `snappy`, `bzip2`, `xz`, `zstandard`. Applies to Avro file.
`recordName`	`topLevelRecord`	The top-level record name in the output Avro schema. Applies to Avro file.
`positionalFieldMatching`	`false`	Whether to match columns between the Spark schema and the Avro schema by field position instead of by name. Applies to Avro file.
`recordNamespace`	Empty string	The namespace for the top-level record in the output Avro schema. Applies to Avro file.

Delta Lake and Apache Iceberg

Key	Default	Description
`clusterByAuto`	`false`	Whether to enable automatic liquid clustering, where Databricks selects clustering columns based on query patterns. Only valid with `mode("overwrite")`. Cannot be used with `append` mode. Available in Databricks Runtime 16.4 and above. Applies to Use liquid clustering for tables.
`mergeSchema`	None	Whether to enable schema evolution for the write operation. New columns in the source DataFrame are added to the target table schema. Applies to batch and streaming appends. Applies to Update table schema.
`overwriteSchema`	None	Whether to replace the table schema and partitioning when overwriting. Requires `mode("overwrite")` without `replaceWhere`. Cannot be used with `partitionOverwriteMode`. Applies to Update table schema.
`partitionOverwriteMode`	None	The partition overwrite mode. Set this to `dynamic` to overwrite only partitions containing new data, leaving all other partitions unchanged. Legacy mode, not supported on serverless compute or Databricks SQL. Valid values: `static`, `dynamic`. Applies to Selectively overwrite data with Delta Lake.
`replaceOn`	None	A boolean expression that matches rows in the target table to replace with rows from the source query. Can reference columns from both the target table and the source query. Rows in the target that match a source row are deleted and replaced. If the source is empty, no deletions occur. Use `targetAlias` to disambiguate column references. Available in Databricks Runtime 17.1 and above. Applies to Selectively overwrite data with Delta Lake.
`replaceUsing`	None	A comma-separated list of column names used to match rows between the target table and the source query. Both the target and the source must contain all listed columns. Rows in the target that match a source row under equality comparison are deleted and replaced. `NULL` values are treated as not equal and won't match. Available in Databricks Runtime 16.3 and above. Applies to Selectively overwrite data with Delta Lake.
`replaceWhere`	None	A predicate expression. Atomically overwrites only the records that match the predicate. Applies to Selectively overwrite data with Delta Lake.
`targetAlias`	None	A string alias for the target table. Use with `replaceOn` or `replaceWhere` to disambiguate column references when the condition references columns from both the target table and the source query. Applies to Selectively overwrite data with Delta Lake.
`txnAppId`	None	A unique string identifying the application for idempotent writes in `foreachBatch` operations. Use together with `txnVersion` to ensure exactly-once writes to multiple Delta Lake tables. Applies to Use `foreachBatch` for idempotent table writes.
`txnVersion`	None	A monotonically increasing number used as the transaction version for idempotent writes in `foreachBatch` operations. Use together with `txnAppId` to ensure exactly-once writes to multiple Delta Lake tables. Applies to Use `foreachBatch` for idempotent table writes.
`optimizeWrite`	None	Whether to enable Auto Optimize Write for this write operation. Overrides the `spark.databricks.delta.optimizeWrite.enabled` configuration. Applies to What is Delta Lake in Databricks?.
`userMetadata`	None	A user-defined string appended to the commit metadata for the write operation. Visible in the output of `DESCRIBE HISTORY`. Applies to Enrich tables with custom metadata.

CSV

Key	Default	Description
`charToEscapeQuoteEscaping`	`\0` (not enabled)	The character used to escape the escape character when it differs from the quote character. Applies to csv (DataFrameWriter).
`compression`	`none`	Compression codec to use when writing. Valid values: `none`, `bzip2`, `gzip`, `lz4`, `snappy`, `deflate`, `zstd`. Applies to csv (DataFrameWriter).
`dateFormat`	`yyyy-MM-dd`	Format string for date column values. Applies to csv (DataFrameWriter).
`emptyValue`	Empty string	The string written for empty (non-null) values. Applies to csv (DataFrameWriter).
`encoding`	`UTF-8`	The character encoding for the output files. Applies to csv (DataFrameWriter).
`escape`	`\`	The character used to escape quoted values. Applies to csv (DataFrameWriter).
`escapeQuotes`	`true`	Whether to escape quote characters inside quoted field values. Applies to csv (DataFrameWriter).
`header`	`false`	Whether to write column names as the first line of the output. Applies to csv (DataFrameWriter).
`ignoreLeadingWhiteSpace`	`false`	Whether to trim leading whitespace from values when writing. Applies to csv (DataFrameWriter).
`ignoreTrailingWhiteSpace`	`false`	Whether to trim trailing whitespace from values when writing. Applies to csv (DataFrameWriter).
`lineSep`	`\n`	The line separator string used between records. Applies to csv (DataFrameWriter).
`locale`	`en-US`	A `java.util.Locale` identifier. Influences formatting of date and timestamp values when writing.
`nullValue`	Empty string	String written for null values. Applies to csv (DataFrameWriter).
`quote`	`"`	The character used to quote field values that contain the separator. Applies to csv (DataFrameWriter).
`quoteAll`	`false`	Whether to enclose all field values in quotes regardless of content. Applies to csv (DataFrameWriter).
`sep`	`,`	The field delimiter character. Applies to csv (DataFrameWriter).
`timestampFormat`	`yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]`	The format string for timestamp column values. Applies to csv (DataFrameWriter).
`timestampNTZFormat`	`yyyy-MM-dd'T'HH:mm:ss[.SSS]`	Format string for timestamp without timezone (`TimestampNTZType`) column values.

Excel

Key	Default	Description
`dataAddress`	None	The sheet name or starting cell for the write. If omitted, writes to a sheet named `Sheet1` starting at cell `A1`. Accepts a sheet name (`"SheetName"`) or a single cell reference (`"SheetName!A1"`). Cell ranges are not supported for writes.
`dateFormatInWrite`	`yyyy-mm-dd`	Excel cell format string applied to `Date` columns. Uses Excel format syntax.
`headerRows`	`0`	Whether to write column names as the first row. Valid values: `0`, `1`.
`timestampNTZFormat`	`yyyy-mm-dd hh:mm:ss`	Excel cell format string applied to `TimestampNTZ` and `Timestamp` columns. Uses Excel format syntax.
`version`	`xlsx`	The Excel file format version to write. Valid values: `xlsx`, `xls`.

JSON

Key	Default	Description
`compression`	`none`	Compression codec to use when writing. Valid values: `none`, `bzip2`, `gzip`, `lz4`, `snappy`, `deflate`, `zstd`. Applies to json (DataFrameWriter).
`dateFormat`	`yyyy-MM-dd`	Format string for date column values. Applies to json (DataFrameWriter).
`encoding`	`UTF-8`	The character encoding for the output files. Applies to json (DataFrameWriter).
`ignoreNullFields`	value of `spark.sql.jsonGenerator.ignoreNullFields`	Whether to omit fields with null values from the JSON output. Applies to json (DataFrameWriter).
`lineSep`	`\n`	The line separator string used between records. Applies to json (DataFrameWriter).
`locale`	`en-US`	A `java.util.Locale` identifier. Influences formatting of date and timestamp values when writing.
`pretty`	`false`	Whether to enable pretty (indented, multiline) JSON output.
`sortKeys`	`false`	Whether to sort the keys of JSON objects alphabetically in the output. Useful for producing deterministic output.
`timestampFormat`	`yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]`	The format string for timestamp column values. Applies to json (DataFrameWriter).
`timestampNTZFormat`	`yyyy-MM-dd'T'HH:mm:ss[.SSS]`	Format string for timestamp without timezone (`TimestampNTZType`) column values.
`writeNonAsciiCharacterAsCodePoint`	`false`	Whether to encode non-ASCII characters as `\uXXXX` Unicode escape sequences instead of literal UTF-8 characters in the output.

ORC

Key	Default	Description
`compression`	`zstd`	Compression codec to use when writing. Valid values: `none`, `uncompressed`, `snappy`, `zlib`, `lzo`, `zstd`, `lz4`, `brotli`. Applies to orc (DataFrameWriter).

Parquet

Key	Default	Description
`compression`	`snappy`	Compression codec to use when writing. Valid values: `none`, `uncompressed`, `snappy`, `gzip`, `lzo`, `brotli`, `lz4`, `lz4_raw`, `zstd`. Applies to parquet (DataFrameWriter).
`spark.sql.parquet.outputTimestampType`	`INT96`	The physical type used to encode timestamp columns. Valid values: `INT96`, `TIMESTAMP_MICROS`, `TIMESTAMP_MILLIS`. Use `INT96` for compatibility with legacy Parquet readers that do not support the standard timestamp types.

Text

Key	Default	Description
`compression`	`none`	Compression codec to use when writing. Valid values: `none`, `bzip2`, `gzip`, `lz4`, `snappy`, `deflate`, `zstd`. Applies to text (DataFrameWriter).
`encoding`	`UTF-8`	The character encoding for the output files.
`lineSep`	`\n`	The line separator string used between records. Applies to text (DataFrameWriter).

XML

Key	Default	Description
`arrayElementName`	`item`	The element name for array elements that have no explicit name. Applies to xml (DataFrameWriter).
`attributePrefix`	`_`	The prefix prepended to field names that correspond to XML attributes. Applies to xml (DataFrameWriter).
`compression`	`none`	Compression codec to use when writing. Valid values: `none`, `bzip2`, `gzip`, `lz4`, `snappy`, `deflate`, `zstd`. Applies to xml (DataFrameWriter).
`dateFormat`	`yyyy-MM-dd`	Format string for date column values. Applies to xml (DataFrameWriter).
`declaration`	`version="1.0" encoding="UTF-8" standalone="yes"`	The XML declaration string written at the top of each output file. Set to an empty string to suppress the declaration. Applies to xml (DataFrameWriter).
`encoding`	`UTF-8`	The character encoding for the output files. Applies to xml (DataFrameWriter).
`indent`	4 spaces	The string used to indent child elements in the output. Set to an empty string to turn off indentation and write each row on a single line.
`locale`	`en-US`	A `java.util.Locale` identifier. Influences formatting of date and timestamp values when writing.
`nullValue`	`null`	The string written for null values. When set to `null`, attributes and child elements for null fields are omitted. Applies to xml (DataFrameWriter).
`rootTag`	`ROWS`	The root element tag that wraps all row elements in the output. Applies to xml (DataFrameWriter).
`rowTag`	`ROW`	The element tag that represents a row in the output. Applies to xml (DataFrameWriter).
`singleVariantColumn`	None	The name of the single Variant column to write to XML files. Applies to xml (DataFrameWriter).
`timestampFormat`	`yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]`	The format string for timestamp column values. Applies to xml (DataFrameWriter).
`timestampNTZFormat`	`yyyy-MM-dd'T'HH:mm:ss[.SSS]`	Format string for timestamp without timezone column values. Applies to xml (DataFrameWriter).
`validateName`	`true`	Whether to throw an exception if a column name is not a valid XML element identifier. Applies to xml (DataFrameWriter).
`valueTag`	`_VALUE`	The field name used for character data in XML elements that also have attributes or child elements. Applies to xml (DataFrameWriter).

DataStreamWriter options

Use these options with DataStreamWriter.option() to configure streaming writes.

Example

The following example sets the checkpoint location for a stream:

Python
Scala

Python
(df.writeStream
  .format("delta")
  .option("checkpointLocation", "/path/to/checkpoint")
  .start("/path/to/table"))

Scala
df.writeStream
  .format("delta")
  .option("checkpointLocation", "/path/to/checkpoint")
  .start("/path/to/table")

Common

Key	Default	Description
`checkpointLocation`	None (required)	Path to the checkpoint directory for the streaming query. Required for fault tolerance and exactly-once processing guarantees. Each streaming query must use a unique checkpoint location. Databricks recommends storing checkpoints in a Unity Catalog volume or cloud storage path. See Structured Streaming checkpoints.
`path`	None	Output path for file-based streaming sinks such as Parquet. Applies to file-based formats only.

Console sink

Key	Default	Description
`numRows`	`20`	The number of rows to display for each micro-batch when writing to the console sink.
`truncate`	`true`	Whether to truncate long strings when displaying rows. Set to `false` to show full string values.

Delta Lake

The following options apply when writing a stream to a Delta Lake table using format("delta"). Overwrite-only options such as overwriteSchema, replaceWhere, and partitionOverwriteMode are not supported for streaming writes.

Key	Default	Description
`mergeSchema`	`false`	Whether to evolve the Delta Lake table schema when the streaming DataFrame contains new columns. Applies to append output mode only. Applies to Update table schema.
`userMetadata`	None	A user-defined string appended to the commit metadata for the write operation. Visible in the output of `DESCRIBE HISTORY`. Applies to Enrich tables with custom metadata.

File sink

The following option applies when writing a stream to file-based formats (Parquet, JSON, CSV, ORC, text). For format-specific options, see DataFrameWriter options.

Key	Default	Description
`retention`	None	How long to retain sink metadata files used for fault tolerance and compaction. Accepts a time string such as `7 days` or `24 hours`. When not set, metadata files are retained indefinitely.

Kafka sink

For a complete list of options for writing streams to Kafka, see Options.

Key	Default	Description
`kafka.bootstrap.servers`	None	Required. A comma-separated list of Kafka broker `host:port` addresses.
`topic`	None	The target Kafka topic for all rows. Required if the DataFrame does not include a `topic` column.
`kafka.*`	None	Any Kafka producer configuration prefixed with `kafka.`. For example, `kafka.compression.type`.

Memory sink

Key	Default	Description
`queryName`	None (required)	The name of the in-memory table that the query writes to. Required for the memory sink. Also configurable via `.queryName()`.
`mode`	`exactlyonce`	Delivery guarantee for the memory sink. `exactlyonce` uses micro-batch mode with exactly-once semantics. `atleastonce` uses continuous mode with at-least-once semantics. Valid values: `exactlyonce`, `atleastonce`.

Spark function options

Some Spark SQL built-in functions accept an options map that controls parsing or serialization behavior. Pass options as a Python dict or a Scala Map[String, String].

Example

The following example parses a JSON column while dropping malformed records:

Python
Scala

Python
from pyspark.sql.functions import from_json
from pyspark.sql.types import StructType, StructField, StringType

schema = StructType([StructField("name", StringType())])
df = df.withColumn("parsed", from_json("json_col", schema, {"mode": "DROPMALFORMED"}))

Scala
import org.apache.spark.sql.functions.from_json
import org.apache.spark.sql.types._

val schema = StructType(Seq(StructField("name", StringType)))
val df = df.withColumn("parsed", from_json(col("json_col"), schema, Map("mode" -> "DROPMALFORMED")))

Avro

Avro functions accept the same options as the corresponding DataFrame options:

from_avro and schema_of_avro use DataFrameReader Avro options.
to_avro uses DataFrameWriter Avro options.

Example

The following example decodes an Avro column with schema evolution enabled:

Python
Scala

Python
from pyspark.sql.functions import from_avro

df = df.withColumn("decoded", from_avro("avro_col", json_schema, {"avroSchemaEvolutionMode": "restart"}))

Scala
import org.apache.spark.sql.avro.functions.from_avro

val df = df.withColumn("decoded", from_avro(col("avro_col"), jsonSchema, Map("avroSchemaEvolutionMode" -> "restart")))

In addition, the Schema Registry variants of from_avro and to_avro accept the following options:

Key	Default	Description
`schemaId`	None	Schema ID from the Confluent Schema Registry to use when decoding Avro data that was encoded with a schema incompatible with `jsonFormatSchema`. Applies to `from_avro` only.
`confluent.schema.registry.*`	None	Confluent Schema Registry client configuration properties. Pass any Confluent SR client property using this prefix, for example `confluent.schema.registry.basic.auth.user.info` for basic authentication credentials. Required for the Schema Registry variants of `from_avro` and `to_avro`.

CSV

CSV functions accept the same options as the corresponding DataFrame options:

from_csv and schema_of_csv use DataFrameReader CSV options.
to_csv uses DataFrameWriter CSV options.

Example

The following example reads CSV with a custom separator and NULL value:

Python
Scala

Python
from pyspark.sql.functions import from_csv
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

schema = StructType([StructField("id", IntegerType()), StructField("name", StringType())])
df = df.withColumn("parsed", from_csv("csv_col", schema, {"sep": "|", "nullValue": "N/A"}))

Scala
import org.apache.spark.sql.functions.from_csv
import org.apache.spark.sql.types._

val schema = StructType(Seq(StructField("id", IntegerType), StructField("name", StringType)))
val df = df.withColumn("parsed", from_csv(col("csv_col"), schema, Map("sep" -> "|", "nullValue" -> "N/A")))

JSON

JSON functions accept the same options as the corresponding DataFrame options:

from_json and schema_of_json use DataFrameReader JSON options.
to_json uses DataFrameWriter JSON options.

Example

The following example writes JSON with NULL fields ignored and pretty formatting enabled:

Python
Scala

Python
from pyspark.sql.functions import to_json

df = df.withColumn("json_str", to_json("struct_col", {"pretty": "true", "ignoreNullFields": "true"}))

Scala
import org.apache.spark.sql.functions.to_json

val df = df.withColumn("json_str", to_json(col("struct_col"), Map("pretty" -> "true", "ignoreNullFields" -> "true")))

Protobuf

from_protobuf and to_protobuf do not use a file-based DataSource. Protobuf data is always read and written as binary columns using these functions. Options are passed as a Map[String, String] and are case-sensitive.

Example

The following example decodes a Protobuf column using PERMISSIVE mode:

Python
Scala

Python
from pyspark.sql.functions import from_protobuf

df = df.withColumn("decoded", from_protobuf("proto_col", "MyMessage", "/path/to/descriptor.desc",
    {"mode": "PERMISSIVE", "enums.as.ints": "true"}))

Scala
import org.apache.spark.sql.protobuf.functions.from_protobuf

val df = df.withColumn("decoded", from_protobuf(col("proto_col"), "MyMessage", "/path/to/descriptor.desc",
    Map("mode" -> "PERMISSIVE", "enums.as.ints" -> "true")))

Protobuf functions use the following options:

Key	Default	Description
`mode`	`FAILFAST`	How to handle corrupt records. `FAILFAST` throws an exception. `PERMISSIVE` sets malformed fields to null. Valid values: `FAILFAST`, `PERMISSIVE`. Applies to `from_protobuf`.
`recursive.fields.max.depth`	`-1` (disabled)	Maximum recursion depth for recursive Protobuf fields. Set to `0` to turn off recursive field support. Valid values: `0` to `10`. Applies to `from_protobuf`.
`convert.any.fields.to.json`	`false`	Whether to convert Protobuf `Any` fields to a JSON string instead of a `STRUCT`. Applies to `from_protobuf`.
`emit.default.values`	`false`	Whether to emit fields with zero or default values (proto3 semantics). When `false`, fields with default values are omitted from the output. Applies to `from_protobuf`.
`enums.as.ints`	`false`	Whether to render enum fields as integer values instead of strings. Applies to `from_protobuf`.
`upcast.unsigned.ints`	`false`	Whether to upcast `uint32` to `Long` and `uint64` to `Decimal(20,0)` to prevent integer overflow. Applies to `from_protobuf`.
`unwrap.primitive.wrapper.types`	`false`	Whether to unwrap `google.protobuf` wrapper types (for example, `Int32Value` and `StringValue`) to their corresponding primitive Spark types. Applies to `from_protobuf`.
`retain.empty.message.types`	`false`	Whether to retain empty Protobuf message types in the output schema by inserting a dummy column. Applies to `from_protobuf`.
`schema.registry.subject`	None	Schema Registry subject name. Required when using the Schema Registry variants of `from_protobuf` and `to_protobuf`.
`schema.registry.address`	None	Schema Registry address (host and port). Required when using the Schema Registry variants of `from_protobuf` and `to_protobuf`.
`schema.registry.protobuf.name`	None	Specifies which Protobuf message to use when the schema registry subject contains multiple messages. Optional.

XML

XML functions accept the same options as the corresponding DataFrame options:

from_xml and schema_of_xml use DataFrameReader XML options.
to_xml uses DataFrameWriter XML options.

Example

The following example writes XML with custom root and row tags:

Python
Scala

Python
from pyspark.sql.functions import to_xml

df = df.withColumn("xml_str", to_xml("struct_col", {"rootTag": "records", "rowTag": "record"}))

Scala
import org.apache.spark.sql.functions.to_xml

val df = df.withColumn("xml_str", to_xml(col("struct_col"), Map("rootTag" -> "records", "rowTag" -> "record")))

DataFrameReader options​

Example​

Common​

Avro​

CSV​

Excel​

JSON​

Kafka​

ORC​

Parquet​

State store​

Text​

XML​

DataStreamReader options​

Example​

Common​

Auto Loader​

Common​

Directory listing​

File notification​

Cloud-specific​

AWS​

Azure​

GCP​

Delta Lake​

Kafka​

DataFrameWriter options​

Example​

Avro​

Delta Lake and Apache Iceberg​

CSV​

Excel​

JSON​

ORC​

Parquet​

Text​

XML​

DataStreamWriter options​

Example​

Common​

Console sink​

Delta Lake​

File sink​

Kafka sink​

Memory sink​

Spark function options​

Example​

Avro​

Example​

CSV​

Example​

JSON​

Example​

Protobuf​

Example​

XML​

Example​

DataFrameReader options

Example

Common

Avro

CSV

Excel

JSON

Kafka

ORC

Parquet

State store

Text

XML

DataStreamReader options

Example

Common

Auto Loader

Common

Directory listing

File notification

Cloud-specific

AWS

Azure

GCP

Delta Lake

Kafka

DataFrameWriter options

Example

Avro

Delta Lake and Apache Iceberg

CSV

Excel

JSON

ORC

Parquet

Text

XML

DataStreamWriter options

Example

Common

Console sink

Delta Lake

File sink

Kafka sink

Memory sink

Spark function options

Example

Avro

Example

CSV

Example

JSON

Example

Protobuf

Example

XML

Example