Skip to main content

Read Excel files

important

This feature is in Beta. Workspace admins can control access to this feature from the Previews page. See Manage Databricks previews.

You can ingest, parse, and query Excel files for batch and streaming workloads using built-in Excel file format support. It automatically infers schema and data types, eliminating the need for external libraries or manual file conversions. This feature provides seamless ingestion from both local uploads and cloud storage.

Key features

  • Directly read .xls and .xlsx files using Databricks SQL and Spark APIs.
  • Directly upload .xls and .xlsx files using the Add Data UI. See Upload files to Databricks.
  • Read any sheet from a multi-sheet file.
  • Specify exact cell bounds or ranges.
  • Automatically infer schema, headers, and data types.
  • Ingest evaluated formulas.
  • Use Auto Loader for structured streaming of Excel files.

Prerequisites

Databricks Runtime 17.1 or above.

Create or modify a table in the UI

You can use the Create or modify table UI to create tables from Excel files. Start by uploading an Excel file or selecting an Excel file from a volume or an external location. Choose the sheet, adjust the number of header rows, and optionally specify a cell range. The UI supports creating a single table from the selected file and sheet.

Query Excel files

You can query your Excel files using Spark batch (spark.read) and streaming (spark.readstream) APIs. You can choose to automatically infer the schema or specify your own schema to parse the Excel files. By default, the parser reads all cells starting from the top-left cell to the bottom-right non-empty cell in the first sheet. To read a different sheet or cell range, use the dataAddress option.

You can query the list of sheets in an Excel file by setting the operation option to listSheets.

Excel parsing options

The following options are available to parse Excel files:

Data source option

Description

dataAddress

The address of the cell range to read in Excel syntax. If not specified, the parser reads all valid cells from the first sheet.

  • "" or omitted: Reads all data from the first sheet.
  • "MySheet!C5:H10": Reads range C5 to H10 from the sheet named MySheet.
  • "C5:H10": Reads range C5 to H10 from the first sheet.
  • "Sheet1!A1:A1": Reads only cell A1 from Sheet1.
  • "Sheet1": Reads all data from Sheet1.
  • "'My Sheet!?>'!D5:G10": Reads D5 to G10 from My Sheet!?>. Enclose the sheet name with '.

headerRows

The number of initial rows in the Excel file to treat as header rows and read as column names. When dataAddress is specified, headerRows applies to the header rows within that cell range. Supported values are 0 and 1. The default is 0, in which case column names are generated automatically by appending the column number to _c (for example: _c1, _c2, _c3, ...).

Examples:

  • dataAddress: "A2:D5", headerRows: "0": Infers column names as _c1_c4. Reads the first row of data from row 2: A2 to D2.
  • dataAddress: "A2:D5", headerRows: "1": Sets column names as cell values in row 2: A2 to D2. Reads the first row of data from row 3: A3 to D3.

operation

Indicates the operation to perform on the Excel workbook. The default is readSheet, which reads data from a sheet. The other supported operation is listSheets, which returns the list of sheets in the workbook. For the listSheets operation, the returned schema is a struct with the following fields:

  • sheetIndex: long
  • sheetName: String

timestampNTZFormat

Custom format string for a timestamp value (stored as a string in Excel) without a timezone that follows the datetime pattern format. This applies to string values read as TimestampNTZType. Default: yyyy-MM-dd'T'HH:mm:ss[.SSS].

dateFormat

Custom date format string that follows the datetime pattern format. This applies to string values read as Date. Default: yyyy-MM-dd.

Examples

Find code examples for reading Excel files using the built-in connector Lakeflow Connect.

Read Excel files using a Spark batch read

You can read an Excel file from cloud storage (for example, S3, ADLS) using spark.read.excel. For example:

Python
# Read the first sheet from a single Excel file or from multiple Excel files in a directory
df = (spark.read.excel(<path to excel directory or file>))

# Infer schema field name from the header row
df = (spark.read
.option("headerRows", 1)
.excel(<path to excel directory or file>))

# Read a specific sheet and range
df = (spark.read
.option("headerRows", 1)
.option("dataAddress", "Sheet1!A1:E10")
.excel(<path to excel directory or file>))

Read Excel files using SQL

You can use the read_files table-valued function to ingest Excel files directly using SQL. For example:

SQL
-- Read an entire Excel file
CREATE TABLE my_table AS
SELECT * FROM read_files(
"<path to excel directory or file>",
schemaEvolutionMode => "none"
);

-- Read a specific sheet and range
CREATE TABLE my_sheet_table AS
SELECT * FROM read_files(
"<path to excel directory or file>",
format => "excel",
headerRows => 1,
dataAddress => "'Sheet1'!A2:D10",
schemaEvolutionMode => "none"
);

Stream Excel files using Auto Loader

You can stream Excel files using Auto Loader by setting cloudFiles.format to excel. For example:

Python
df = (
spark
.readStream
.format("cloudFiles")
.option("cloudFiles.format", "excel")
.option("cloudFiles.inferColumnTypes", True)
.option("headerRows", 1)
.option("cloudFiles.schemaLocation", "<path to schema location dir>")
.option("cloudFiles.schemaEvolutionMode", "none")
.load(<path to excel directory or file>)
)
df.writeStream
.format("delta")
.option("mergeSchema", "true")
.option("checkpointLocation", "<path to checkpoint location dir>")
.table(<table name>)

Ingest Excel files using COPY INTO

SQL
CREATE TABLE IF NOT EXISTS excel_demo_table;

COPY INTO excel_demo_table
FROM "<path to excel directory or file>"
FILEFORMAT = EXCEL
FORMAT_OPTIONS ('mergeSchema' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true');

Parse complex non-structured Excel sheets

For complex, non-structured Excel sheets (for example, multiple tables per sheet, data islands), Databricks recommends extracting the cell ranges you need to create your Spark DataFrames using the dataAddress options. For example:

Python
df = (spark.read.format("excel")
.option("headerRows", 1)
.option("dataAddress", "Sheet1!A1:E10")
.load(<path to excel directory or file>))

List sheets

You can list the sheets in an Excel file using the listSheets operation. The returned schema is a struct with the following fields:

  • sheetIndex: long
  • sheetName: String

For example:

Python
# List the name of the Sheets in an Excel file
df = (spark.read.format("excel")
.option("operation", "listSheets")
.load(<path to excel directory or file>))

Limitations

  • Password-protected files are not supported.
  • Only one header row is supported.
  • Merged cell values only populate the top-left cell. Remaining child cells are set to NULL.
  • Streaming Excel files using Auto Loader is supported, but schema evolution is not. You must explicitly set schemaEvolutionMode="None".
  • "Strict Open XML Spreadsheet (Strict OOXML)" is not supported.
  • Macro execution in .xlsm files is not supported.
  • The ignoreCorruptFiles option is not supported.

FAQ

Find answers to frequently asked questions about the Excel connector in Lakeflow Connect.

Can I read all sheets at once?

The parser reads only one sheet from an Excel file at a time. By default, it reads the first sheet. You can specify a different sheet using the dataAddress option. To process multiple sheets, first retrieve the list of sheets by setting the operation option to listSheets, then iterate over the sheet names and read each one by providing its name in the dataAddress option.

Can I ingest Excel files with complex layouts or multiple tables per sheet?

By default, the parser reads all Excel cells from the top-left cell to the bottom-right non-empty cell. You can specify a different cell range using the dataAddress option.

How are formulas and merged cells handled?

Formulas are ingested as their computed values. For merged cells, only the top-left value is retained (child cells are NULL).

Can I use Excel ingestion in Auto Loader and streaming jobs?

Yes, you can stream Excel files using cloudFiles.format = "excel". However, schema evolution is not supported, so you must set "schemaEvolutionMode" to "None".

Is password-protected Excel supported?

No. If this functionality is critical to your workflows, contact your Databricks account representative.