read_files
table-valued function
Applies to: Databricks SQL
Databricks Runtime 13.1 and later
Reads files under a provided location and returns the data in tabular form.
Supports reading JSON
, CSV
, TEXT
, BINARYFILE
, PARQUET
, AVRO
, and ORC
file formats.
Can detect the file format automatically and infer a unified schema across all files.
Arguments
This function requires named parameter invocation for the option keys.
path
: ASTRING
with the URI of the location of the data. Supports reading from Azure Data Lake Storage Gen2 ('abfss://'
), S3 (s3://
) and Google Cloud Storage ('gs://'
). Can contain globs. See File discovery for more details.option_key
: The name of the option to configure. You need to use backticks (`) for options that contain dots (.
).option_value
: A constant expression to set the option to. Accepts literals and scalar functions.
File discovery
read_files
can read an individual file or read files under a provided directory. read_files
discovers all files under the provided directory recursively unless a glob is provided, which instructs read_files
to recurse into a specific directory pattern.
Filtering directories or files using glob patterns
Glob patterns can be used for filtering directories and files when provided in the path.
Pattern |
Description |
---|---|
|
Matches any single character |
|
Matches zero or more characters |
|
Matches a single character from character set {a,b,c}. |
|
Matches a single character from the character range {a…z}. |
|
Matches a single character that is not from character set or range {a}.
Note that the |
|
Matches a string from the string set {ab, cd}. |
|
Matches a string from the string set {ab, cde, cfh}. |
read_files
uses Auto Loader’s strict globber when discovering files with globs. This is configured by the useStrictGlobber
option. When the strict globber is disabled, trailing slashes (/
) are dropped and a star pattern such as /*/
can expand into discovering multiple directories. See the examples below to see the difference in behavior.
Pattern |
File path |
Strict globber disabled |
Strict globber enabled |
---|---|---|---|
/a/b |
/a/b/c/file.txt |
Yes |
Yes |
/a/b |
/a/b_dir/c/file.txt |
No |
No |
/a/b |
/a/b.txt |
No |
No |
/a/b/ |
/a/b.txt |
No |
No |
/a/*/c/ |
/a/b/c/file.txt |
Yes |
Yes |
/a/*/c/ |
/a/b/c/d/file.txt |
Yes |
Yes |
/a/*/d/ |
/a/b/c/d/file.txt |
Yes |
No |
/a/*/c/ |
/a/b/x/y/c/file.txt |
Yes |
No |
/a/*/c |
/a/b/c_file.txt |
Yes |
No |
/a/*/c/ |
/a/b/c_file.txt |
Yes |
No |
/a/*/c |
/a/b/cookie/file.txt |
Yes |
No |
/a/b* |
/a/b.txt |
Yes |
Yes |
/a/b* |
/a/b/file.txt |
Yes |
Yes |
/a/{0.txt,1.txt} |
/a/0.txt |
Yes |
Yes |
/a/*/{0.txt,1.txt} |
/a/0.txt |
No |
No |
/a/b/[cde-h]/i/ |
/a/b/c/i/file.txt |
Yes |
Yes |
Schema inference
The schema of the files can be explicitly provided to read_files
with the schema
option. When the schema is not provided, read_files
attempts to infer a unified schema across the discovered files, which requires reading all the files unless a LIMIT
statement is used. Even when using a LIMIT
query, a larger set of files than required might be read to return a more representative schema of the data. Databricks automatically adds a LIMIT
statement for SELECT
queries in notebooks and the SQL editor if a user hasn’t provided one.
The schemaHints
option can be used to fix subsets of the inferred schema. See Override schema inference with schema hints for more details.
A rescuedDataColumn
is provided by default to rescue any data that doesn’t match the schema. See What is the rescued data column? for more details. You can drop the rescuedDataColumn
by setting the option schemaEvolutionMode => 'none'
.
Partition schema inference
read_files
can also infer partitioning columns if files are stored under Hive-style partitioned directories, that is /column_name=column_value/
. If a schema
is provided, the discovered partition columns leverage the types provided in the schema
. If the partition columns are not part of the provided schema
, then the inferred partition columns are ignored.
If a column exists in both the partition schema and in the data columns, the value that is read from the partition value is used instead of the data value. If you would like to ignore the values coming from the directory and use the data column, you can provide the list of partition columns in a comma-separated list with the partitionColumns
option.
The partitionColumns
option can also be used to instruct read_files
on which discovered columns to include in the final inferred schema. Providing an empty string ignores all partition columns.
The schemaHints
option can also be provided to override the inferred schema for a partition column.
The TEXT
and BINARYFILE
formats have a fixed schema, but read_files
also attempts to infer partitioning for these formats when possible.
Usage in streaming tables
read_files
can be used in streaming tables to ingest files into Delta Lake. read_files
leverages Auto Loader when used in a streaming table query. You must use the STREAM
keyword with read_files
. See What is Auto Loader? for more details.
When used in a streaming query, read_files
will use a sample of the data to infer the schema, and can evolve the schema on-the-fly as it processes more data. See Configure schema inference and evolution in Auto Loader for more details.
Options
Basic Options
Option |
---|
format Type: The data file format in the source path. Auto-inferred if not provided. Allowed values include:
Default value: None |
inferColumnTypes Type: Whether to infer exact column types when leveraging schema inference. By default, columns are inferred when inferring JSON and CSV datasets. See schema inference for more details. Note that this is the opposite of the default of Auto Loader. Default value: |
partitionColumns Type: A comma separated list of Hive style partition columns that you would like
inferred from the directory structure of the files. Hive style partition
columns are key value pairs combined by an equality sign such as
Specifying Default value: None |
schemaHints Type: Schema information that you provide to Auto Loader during schema inference. See schema hints for more details. Default value: None |
useStrictGlobber Type: Whether to use a strict globber that matches the default globbing behavior of other file sources in Apache Spark. See Common data loading patterns for more details. Available in Databricks Runtime 12.0 and above. Note that this is the opposite of the default for Auto Loader. Default value: |
Generic options
The following options apply to all file formats.
Option |
---|
ignoreCorruptFiles Type: Whether to ignore corrupt files. If true, the Spark jobs will continue
to run when encountering corrupted files and the contents that have been
read will still be returned. Observable as Default value: |
ignoreMissingFiles Type: Whether to ignore missing files. If true, the Spark jobs will continue to run when encountering missing files and the contents that have been read will still be returned. Available in Databricks Runtime 11.0 and above. Default value: |
modifiedAfter Type: An optional timestamp to ingest files that have a modification timestamp after the provided timestamp. Default value: None |
modifiedBefore Type: An optional timestamp to ingest files that have a modification timestamp before the provided timestamp. Default value: None |
pathGlobFilter or fileNamePattern Type: A potential glob pattern to provide for choosing files. Equivalent to
Default value: None |
recursiveFileLookup Type: Whether to load data recursively within the base directory and skip partition inference. Default value: |
JSON
options
Option |
---|
allowBackslashEscapingAnyCharacter Type: Whether to allow backslashes to escape any character that succeeds it. If not enabled, only characters that are explicitly listed by the JSON specification can be escaped. Default value: |
allowComments Type: Whether to allow the use of Java, C, and C++ style comments
( Default value: |
allowNonNumericNumbers Type: Whether to allow the set of not-a-number ( Default value: |
allowNumericLeadingZeros Type: Whether to allow integral numbers to start with additional (ignorable) zeroes (for example, 000001). Default value: |
allowSingleQuotes Type: Whether to allow use of single quotes (apostrophe,
character Default value: |
allowUnquotedControlChars Type: Whether to allow JSON strings to contain unescaped control characters (ASCII characters with value less than 32, including tab and line feed characters) or not. Default value: |
allowUnquotedFieldNames Type: Whether to allow use of unquoted field names (which are allowed by JavaScript, but not by the JSON specification). Default value: |
badRecordsPath Type: The path to store files for recording the information about bad JSON records. Default value: None |
columnNameOfCorruptRecord Type: The column for storing records that are malformed and cannot be parsed.
If the Default value: |
dateFormat Type: The format for parsing date strings. Default value: |
dropFieldIfAllNull Type: Whether to ignore columns of all null values or empty arrays and structs during schema inference. Default value: |
encoding or charset Type: The name of the encoding of the JSON files. See Default value: |
inferTimestamp Type: Whether to try and infer timestamp strings as a Default value: |
lineSep Type: A string between two consecutive JSON records. Default value: None, which covers |
locale Type: A Default value: |
mode Type: Parser mode around handling malformed records. One of Default value: |
multiLine Type: Whether the JSON records span multiple lines. Default value: |
prefersDecimal Type: Attempts to infer strings as Default value: |
primitivesAsString Type: Whether to infer primitive types like numbers and booleans as Default value: |
rescuedDataColumn Type: Whether to collect all data that can’t be parsed due to a data type mismatch or schema mismatch (including column casing) to a separate column. This column is included by default when using Auto Loader. For more details, refer to What is the rescued data column?. Default value: None |
timestampFormat Type: The format for parsing timestamp strings. Default value: |
timeZone Type: The Default value: None |
CSV
options
Option |
---|
badRecordsPath Type: The path to store files for recording the information about bad CSV records. Default value: None |
charToEscapeQuoteEscaping Type: The character used to escape the character used for escaping quotes.
For example, for the following record:
Default value: |
columnNameOfCorruptRecord Type: A column for storing records that are malformed and cannot be parsed.
If the Default value: |
comment Type: Defines the character that represents a line comment when found in the
beginning of a line of text. Use Default value: |
dateFormat Type: The format for parsing date strings. Default value: |
emptyValue Type: String representation of an empty value. Default value: |
encoding or charset Type: The name of the encoding of the CSV files. See Default value: |
enforceSchema Type: Whether to forcibly apply the specified or inferred schema to the CSV files. If the option is enabled, headers of CSV files are ignored. This option is ignored by default when using Auto Loader to rescue data and allow schema evolution. Default value: |
escape Type: The escape character to use when parsing the data. Default value: |
header Type: Whether the CSV files contain a header. Auto Loader assumes that files have headers when inferring the schema. Default value: |
ignoreLeadingWhiteSpace Type: Whether to ignore leading whitespaces for each parsed value. Default value: |
ignoreTrailingWhiteSpace Type: Whether to ignore trailing whitespaces for each parsed value. Default value: |
inferSchema Type: Whether to infer the data types of the parsed CSV records or to assume all
columns are of Default value: |
lineSep Type: A string between two consecutive CSV records. Default value: None, which covers |
locale Type: A Default value: |
maxCharsPerColumn Type: Maximum number of characters expected from a value to parse. Can be used to
avoid memory errors. Defaults to Default value: |
maxColumns Type: The hard limit of how many columns a record can have. Default value: |
mergeSchema Type: Whether to infer the schema across multiple files and to merge the schema of each file. Enabled by default for Auto Loader when inferring the schema. Default value: |
mode Type: Parser mode around handling malformed records. One of Default value: |
multiLine Type: Whether the CSV records span multiple lines. Default value: |
nanValue Type: The string representation of a non-a-number value when parsing Default value: |
negativeInf Type: The string representation of negative infinity when parsing Default value: |
nullValue Type: String representation of a null value. Default value: |
parserCaseSensitive (deprecated) Type: While reading files, whether to align columns declared in the header with the
schema case sensitively. This is Default value: |
positiveInf Type: The string representation of positive infinity when parsing Default value: |
preferDate Type: Attempts to infer strings as dates instead of timestamp when possible. You
must also use schema inference, either by enabling Default value: |
quote Type: The character used for escaping values where the field delimiter is part of the value. Default value: |
readerCaseSensitive Type: Specifies the case sensitivity behavior when Default value: |
rescuedDataColumn Type: Whether to collect all data that can’t be parsed due to: a data type mismatch, and schema mismatch (including column casing) to a separate column. This column is included by default when using Auto Loader. For more details refer to What is the rescued data column?. Default value: None |
sep or delimiter Type: The separator string between columns. Default value: |
skipRows Type: The number of rows from the beginning of the CSV file that should be ignored
(including commented and empty rows). If Default value: |
timestampFormat Type: The format for parsing timestamp strings. Default value: |
timeZone Type: The Default value: None |
unescapedQuoteHandling Type: The strategy for handling unescaped quotes. Allowed options:
Default value: |
PARQUET
options
Option |
---|
datetimeRebaseMode Type: Controls the rebasing of the DATE and TIMESTAMP values between Julian and
Proleptic Gregorian calendars. Allowed values: Default value: |
int96RebaseMode Type: Controls the rebasing of the INT96 timestamp values between Julian and
Proleptic Gregorian calendars. Allowed values: Default value: |
mergeSchema Type: Whether to infer the schema across multiple files and to merge the schema of each file. Default value: |
readerCaseSensitive Type: Specifies the case sensitivity behavior when Default value: |
rescuedDataColumn Type: Whether to collect all data that can’t be parsed due to: a data type mismatch, and schema mismatch (including column casing) to a separate column. This column is included by default when using Auto Loader. For more details refer to What is the rescued data column?. Default value: None |
AVRO
options
Option |
---|
avroSchema Type: Optional schema provided by a user in Avro format. When reading Avro, this option can be set to an evolved schema, which is compatible but different with the actual Avro schema. The deserialization schema will be consistent with the evolved schema. For example, if you set an evolved schema containing one additional column with a default value, the read result will contain the new column too. Default value: None |
datetimeRebaseMode Type: Controls the rebasing of the DATE and TIMESTAMP values between Julian and
Proleptic Gregorian calendars. Allowed values: Default value: |
mergeSchema Type: Whether to infer the schema across multiple files and to merge the schema
of each file.
Default value: |
readerCaseSensitive Type: Specifies the case sensitivity behavior when Default value: |
rescuedDataColumn Type: Whether to collect all data that can’t be parsed due to: a data type mismatch, and schema mismatch (including column casing) to a separate column. This column is included by default when using Auto Loader. For more details refer to What is the rescued data column?. Default value: None |
BINARYFILE
options
Binary files do not have any additional configuration options.
TEXT
options
Option |
---|
encoding Type: The name of the encoding of the TEXT files. See Default value: |
lineSep Type: A string between two consecutive TEXT records. Default value: None, which covers |
wholeText Type: Whether to read a file as a single record. Default value: |
ORC
options
Option |
---|
mergeSchema Type: Whether to infer the schema across multiple files and to merge the schema of each file. Default value: |
Streaming options
These options apply when using read_files
inside a streaming table or streaming query.
Option |
---|
allowOverwrites Type: Whether to re-process files that have been modified after discovery. The latest available version of the file will be processed during a refresh if it has been modified since the last successful refresh query start time. Default value: |
includeExistingFiles Type: Whether to include existing files in the stream processing input path or to only process new files arriving after initial setup. This option is evaluated only when you start a stream for the first time. Changing this option after restarting the stream has no effect. Default value: |
maxBytesPerTrigger Type: The maximum number of new bytes to be processed in every trigger.
You can specify a byte string such as Default value: None |
maxFilesPerTrigger Type: The maximum number of new files to be processed in every trigger.
When used together with Default value: 1000 |
schemaEvolutionMode Type: The mode for evolving the schema as new columns are discovered in the data.
By default, columns are inferred as strings when inferring JSON datasets. See
schema evolution
for more details. This option doesn’t apply to Default value: |
schemaLocation Type: The location to store the inferred schema and subsequent changes. See schema inference for more details. The schema location is not required when used in a streaming table query. Default value: None |
Examples
-- Reads the files available in the given path. Auto-detects the format and schema of the data.
> SELECT * FROM read_files('abfss://container@storageAccount.dfs.core.windows.net/base/path');
-- Reads the headerless CSV files in the given path with the provided schema.
> SELECT * FROM read_files(
's3://bucket/path',
format => 'csv',
schema => 'id int, ts timestamp, event string');
-- Infers the schema of CSV files with headers. Because the schema is not provided,
-- the CSV files are assumed to have headers.
> SELECT * FROM read_files(
's3://bucket/path',
format => 'csv')
-- Reads files that have a csv suffix.
> SELECT * FROM read_files('s3://bucket/path/*.csv')
-- Reads a single JSON file
> SELECT * FROM read_files(
'abfss://container@storageAccount.dfs.core.windows.net/path/single.json')
-- Reads JSON files and overrides the data type of the column `id` to integer.
> SELECT * FROM read_files(
's3://bucket/path',
format => 'json',
schemaHints => 'id int')
-- Reads files that have been uploaded or modified yesterday.
> SELECT * FROM read_files(
'gs://my-bucket/avroData',
modifiedAfter => date_sub(current_date(), 1),
modifiedBefore => current_date())
-- Creates a Delta table and stores the source file path as part of the data
> CREATE TABLE my_avro_data
AS SELECT *, _metadata.file_path
FROM read_files('gs://my-bucket/avroData')
-- Creates a streaming table that processes files that appear only after the table's creation.
-- The table will most likely be empty (if there's no clock skew) after being first created,
-- and future refreshes will bring new data in.
> CREATE OR REFRESH STREAMING TABLE avro_data
AS SELECT * FROM STREAM read_files('gs://my-bucket/avroData', includeExistingFiles => false);