Read Excel files
This feature is in Beta. Workspace admins can control access to this feature from the Previews page. See Manage Databricks previews.
You can ingest, parse, and query Excel files for batch and streaming workloads using built-in Excel file format support. It automatically infers schema and data types, eliminating the need for external libraries or manual file conversions. This feature provides seamless ingestion from both local uploads and cloud storage.
Key features
- Directly read
.xlsand.xlsxfiles using Databricks SQL and Spark APIs. - Directly upload
.xlsand.xlsxfiles using the Add Data UI. See Upload files to Databricks. - Read any sheet from a multi-sheet file.
- Specify exact cell bounds or ranges.
- Automatically infer schema, headers, and data types.
- Ingest evaluated formulas.
- Use Auto Loader for structured streaming of Excel files.
Prerequisites
Databricks Runtime 17.1 or above.
Create or modify a table in the UI
You can use the Create or modify table UI to create tables from Excel files. Start by uploading an Excel file or selecting an Excel file from a volume or an external location. Choose the sheet, adjust the number of header rows, and optionally specify a cell range. The UI supports creating a single table from the selected file and sheet.
Query Excel files
You can query your Excel files using Spark batch (spark.read) and streaming (spark.readstream) APIs. You can choose to automatically infer the schema or specify your own schema to parse the Excel files. By default, the parser reads all cells starting from the top-left cell to the bottom-right non-empty cell in the first sheet. To read a different sheet or cell range, use the dataAddress option.
You can query the list of sheets in an Excel file by setting the operation option to listSheets.
Excel parsing options
The following options are available to parse Excel files:
Data source option | Description |
|---|---|
| The address of the cell range to read in Excel syntax. If not specified, the parser reads all valid cells from the first sheet.
|
| The number of initial rows in the Excel file to treat as header rows and read as column names. When Examples:
|
| Indicates the operation to perform on the Excel workbook. The default is
|
| Custom format string for a timestamp value (stored as a string in Excel) without a timezone that follows the datetime pattern format. This applies to string values read as |
| Custom date format string that follows the datetime pattern format. This applies to string values read as |
Examples
Find code examples for reading Excel files using the built-in connector Lakeflow Connect.
Read Excel files using a Spark batch read
You can read an Excel file from cloud storage (for example, S3, ADLS) using spark.read.excel. For example:
# Read the first sheet from a single Excel file or from multiple Excel files in a directory
df = (spark.read.excel(<path to excel directory or file>))
# Infer schema field name from the header row
df = (spark.read
.option("headerRows", 1)
.excel(<path to excel directory or file>))
# Read a specific sheet and range
df = (spark.read
.option("headerRows", 1)
.option("dataAddress", "Sheet1!A1:E10")
.excel(<path to excel directory or file>))
Read Excel files using SQL
You can use the read_files table-valued function to ingest Excel files directly using SQL. For example:
-- Read an entire Excel file
CREATE TABLE my_table AS
SELECT * FROM read_files(
"<path to excel directory or file>",
schemaEvolutionMode => "none"
);
-- Read a specific sheet and range
CREATE TABLE my_sheet_table AS
SELECT * FROM read_files(
"<path to excel directory or file>",
format => "excel",
headerRows => 1,
dataAddress => "'Sheet1'!A2:D10",
schemaEvolutionMode => "none"
);
Stream Excel files using Auto Loader
You can stream Excel files using Auto Loader by setting cloudFiles.format to excel. For example:
df = (
spark
.readStream
.format("cloudFiles")
.option("cloudFiles.format", "excel")
.option("cloudFiles.inferColumnTypes", True)
.option("headerRows", 1)
.option("cloudFiles.schemaLocation", "<path to schema location dir>")
.option("cloudFiles.schemaEvolutionMode", "none")
.load(<path to excel directory or file>)
)
df.writeStream
.format("delta")
.option("mergeSchema", "true")
.option("checkpointLocation", "<path to checkpoint location dir>")
.table(<table name>)
Ingest Excel files using COPY INTO
CREATE TABLE IF NOT EXISTS excel_demo_table;
COPY INTO excel_demo_table
FROM "<path to excel directory or file>"
FILEFORMAT = EXCEL
FORMAT_OPTIONS ('mergeSchema' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true');
Parse complex non-structured Excel sheets
For complex, non-structured Excel sheets (for example, multiple tables per sheet, data islands), Databricks recommends extracting the cell ranges you need to create your Spark DataFrames using the dataAddress options. For example:
df = (spark.read.format("excel")
.option("headerRows", 1)
.option("dataAddress", "Sheet1!A1:E10")
.load(<path to excel directory or file>))
List sheets
You can list the sheets in an Excel file using the listSheets operation. The returned schema is a struct with the following fields:
sheetIndex: longsheetName: String
For example:
- Python
- SQL
# List the name of the Sheets in an Excel file
df = (spark.read.format("excel")
.option("operation", "listSheets")
.load(<path to excel directory or file>))
SELECT * FROM read_files("<path to excel directory or file>",
schemaEvolutionMode => "none",
operation => "listSheets"
)
Limitations
- Password-protected files are not supported.
- Only one header row is supported.
- Merged cell values only populate the top-left cell. Remaining child cells are set to
NULL. - Streaming Excel files using Auto Loader is supported, but schema evolution is not. You must explicitly set
schemaEvolutionMode="None". - "Strict Open XML Spreadsheet (Strict OOXML)" is not supported.
- Macro execution in
.xlsmfiles is not supported. - The
ignoreCorruptFilesoption is not supported.
FAQ
Find answers to frequently asked questions about the Excel connector in Lakeflow Connect.
Can I read all sheets at once?
The parser reads only one sheet from an Excel file at a time. By default, it reads the first sheet. You can specify a different sheet using the dataAddress option. To process multiple sheets, first retrieve the list of sheets by setting the operation option to listSheets, then iterate over the sheet names and read each one by providing its name in the dataAddress option.
Can I ingest Excel files with complex layouts or multiple tables per sheet?
By default, the parser reads all Excel cells from the top-left cell to the bottom-right non-empty cell. You can specify a different cell range using the dataAddress option.
How are formulas and merged cells handled?
Formulas are ingested as their computed values. For merged cells, only the top-left value is retained (child cells are NULL).
Can I use Excel ingestion in Auto Loader and streaming jobs?
Yes, you can stream Excel files using cloudFiles.format = "excel". However, schema evolution is not supported, so you must set "schemaEvolutionMode" to "None".
Is password-protected Excel supported?
No. If this functionality is critical to your workflows, contact your Databricks account representative.