Skip to main content

Ingest files from SharePoint

Beta

The standard SharePoint connector is in Beta. Workspace admins can control access to this feature from the Previews page. See Manage Databricks previews.

Learn how to ingest structured, semi-structured, and unstructured files from Microsoft SharePoint into Delta tables. The SharePoint connector supports incremental ingestion of SharePoint files using batch and streaming APIs, including Auto Loader, spark.read and COPY INTO, all with Unity Catalog governance.

Choose your SharePoint connector

Lakeflow Connect offers two complementary SharePoint connectors. They both access data in SharePoint, but they support distinct goals.

Consideration

Managed SharePoint connector

Standard SharePoint connector

Management and customization

A fully-managed connector.

Simple, low-maintenance connectors for enterprise applications that ingest data in to Delta tables and keep them in sync with the source. See Managed connectors in Lakeflow Connect.

Build custom ingestion pipelines with SQL, PySpark, or Lakeflow Spark Declarative Pipelines using batch and streaming APIs such as read_files, spark.read, COPY INTO, and Auto Loader.

Offers the flexibility to perform complex transformations during ingestion, while giving you greater responsibility for managing and maintaining your pipelines.

Output format

Uniform binary content table. Ingests each file in binary format (one file per row), along with file metadata in additional columns.

Structured Delta tables. Ingests structured files (like CSV and Excel) as Delta tables. Can also be used to ingest unstructured files in binary format.

Granularity, filtering, and selection

No subfolder or file level selection today. No pattern-based filtering.

Ingests all files in the specified SharePoint document library.

Granular and custom.

URL-based selection to ingest from document libraries, subfolders, or individual files. Also supports pattern-based filtering using the pathGlobFilter option.

Key features

The standard SharePoint connector offers:

  • Ingestion of structured, semi-structured, and unstructured files
  • Granular ingestion: Ingest a specific site, a sub-site, a document library, a folder, or a single file
  • Batch and streaming ingestion using spark.read, Auto Loader, and COPY INTO
  • Automatic schema inference and evolution for structured and semi-structured formats such as CSV and Excel
  • Secure credentials storage with a Unity Catalog connection
  • File selection with pattern matching using pathGlobFilter

Requirements

To ingest files from SharePoint, you must have the following:

  • A workspace with Unity Catalog enabled
  • CREATE CONNECTION privileges to create a SharePoint connection (or USE CONNECTION to use an existing one)
  • Compute that uses Databricks Runtime version 17.3 LTS or above
  • OAuth authentication set up with the Sites.Read.All permission scope
  • SharePoint Beta feature is enabled from the Previews page. See Manage Databricks previews
  • Optional: Enable Excel Beta feature for parsing Excel files. See Read Excel files

Create the connection

Create a Unity Catalog connection to store your SharePoint credentials. The connection setup process is shared between both the standard and managed SharePoint connectors.

For complete connection setup instructions including OAuth authentication options, see Overview of SharePoint ingestion setup.

Read files from SharePoint

To read files from SharePoint using Spark, specify the connection you created in the previous step using the databricks.connection data source option and provide the URL of the SharePoint resource you want to access. This URL can refer to a specific file, a folder, a document library (drive), or an entire site. Examples include:

  • https://mytenant.sharepoint.com/sites/test-site/
  • https://mytenant.sharepoint.com/sites/test-site/test-subsite
  • https://mytenant.sharepoint.com/sites/test-site/test-drive
  • https://mytenant.sharepoint.com/sites/test-site/Shared%20Documents/Forms/AllItems.aspx
  • https://mytenant.sharepoint.com/sites/test-site/test-drive/test-folder
  • https://mytenant.sharepoint.com/sites/test-site/test-drive/test-folder/test.csv
  • https://mytenant.sharepoint.com/sites/test-site/another-subsite/another-drive/test.csv

Examples

Find code examples for reading files using the standard SharePoint connector.

Stream SharePoint files using Auto Loader

Auto Loader provides the most efficient way to incrementally ingest structured files from SharePoint. It automatically detects new files and processes them as they arrive. It can also ingest structured and semi-structured files such as CSV and JSON with automatic schema inference and evolution. For details about Auto Loader usage, see Common data loading patterns.

Python
# Incrementally ingest new PDF files
df = (spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "binaryFile")
.option("databricks.connection", "my_sharepoint_conn")
.option("cloudFiles.schemaLocation", <path to a schema location>)
.option("pathGlobFilter", "*.pdf")
.load("https://mytenant.sharepoint.com/sites/Marketing/Shared%20Documents")
)

# Incrementally ingest CSV files with automatic schema inference and evolution
df = (spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("databricks.connection", "my_sharepoint_conn")
.option("pathGlobFilter", "*.csv")
.option("inferColumnTypes", True)
.option("header", True)
.load("https://mytenant.sharepoint.com/sites/Engineering/Data/IoT_Logs")
)

Read SharePoint files using Spark batch read

Python
# Read unstructured data as binary files
df = (spark.read
.format("binaryFile")
.option("databricks.connection", "my_sharepoint_conn")
.option("recursiveFileLookup", True)
.option("pathGlobFilter", "*.pdf") # optional. Example: only ingest PDFs
.load("https://mytenant.sharepoint.com/sites/Marketing/Shared%20Documents"))

# Read a batch of CSV files, infer the schema, and load the data into a DataFrame
df = (spark.read
.format("csv")
.option("databricks.connection", "my_sharepoint_conn")
.option("pathGlobFilter", "*.csv")
.option("recursiveFileLookup", True)
.option("inferSchema", True)
.option("header", True)
.load("https://mytenant.sharepoint.com/sites/Engineering/Data/IoT_Logs"))

# Read a specific Excel file from SharePoint, infer the schema, and load the data into a DataFrame
df = (spark.read
.format("excel")
.option("databricks.connection", "my_sharepoint_conn")
.option("headerRows", 1) # optional
.option("dataAddress", "'Sheet1'!A1:M20") # optional
.load("https://mytenant.sharepoint.com/sites/Finance/Shared%20Documents/Monthly/Report-Oct.xlsx"))

Read SharePoint files using Spark SQL

The following example shows how to ingest SharePoint files in SQL using read_files table-valued function. For details about read_files usage, see read_files table-valued function.

SQL
-- Read pdf files
CREATE TABLE my_table AS
SELECT * FROM read_files(
"https://mytenant.sharepoint.com/sites/Marketing/Shared%20Documents",
`databricks.connection` => "my_sharepoint_conn",
format => "binaryFile",
pathGlobFilter => "*.pdf", -- optional. Example: only ingest PDFs
schemaEvolutionMode => "none"
);

-- Read a specific Excel sheet and range
CREATE TABLE my_sheet_table AS
SELECT * FROM read_files(
"https://mytenant.sharepoint.com/sites/Finance/Shared%20Documents/Monthly/Report-Oct.xlsx",
`databricks.connection` => "my_sharepoint_conn",
format => "excel",
headerRows => 1, -- optional
dataAddress => "'Sheet1'!A2:D10", -- optional
schemaEvolutionMode => "none"
);

Incremental ingestion with COPY INTO

COPY INTO provides idempotent incremental loading of files into a Delta table. For details about COPY INTO usage, see Common data loading patterns using COPY INTO.

SQL
CREATE TABLE IF NOT EXISTS sharepoint_pdf_table;
CREATE TABLE IF NOT EXISTS sharepoint_csv_table;
CREATE TABLE IF NOT EXISTS sharepoint_excel_table;

# Incrementally ingest new PDF files
COPY INTO sharepoint_pdf_table
FROM "https://mytenant.sharepoint.com/sites/Marketing/Shared%20Documents"
FILEFORMAT = BINARYFILE
PATTERN = '*.pdf'
COPY_OPTIONS ('mergeSchema' = 'true');

# Incrementally ingest CSV files with automatic schema inference and evolution
COPY INTO sharepoint_csv_table
FROM "https://mytenant.sharepoint.com/sites/Engineering/Data/IoT_Logs"
FILEFORMAT = CSV
PATTERN = '*.csv'
FORMAT_OPTIONS('header' = 'true', 'inferSchema' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true');

# Ingest a single Excel file
COPY INTO sharepoint_excel_table
FROM "https://mytenant.sharepoint.com/sites/Finance/Shared%20Documents/Monthly/Report-Oct.xlsx"
FILEFORMAT = EXCEL
FORMAT_OPTIONS('headerRows' = '1')
COPY_OPTIONS ('mergeSchema' = 'true');

Ingest SharePoint Files in Lakeflow Spark Declarative Pipelines

note

SharePoint Connector requires Databricks Runtime 17.3 or above. This is not available in Lakeflow Spark Declarative Pipelines release yet. To see the Databricks Runtime versions used with a Lakeflow Spark Declarative Pipelines release, see the release notes for that release.

The following examples show how to read SharePoint files using Auto Loader in Lakeflow Spark Declarative Pipelines:

Python
from pyspark import pipelines as dp

# Incrementally ingest new PDF files
@dp.table
def sharepoint_pdf_table():
return (spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "binaryFile")
.option("databricks.connection", "my_sharepoint_conn")
.option("pathGlobFilter", "*.pdf")
.load("https://mytenant.sharepoint.com/sites/Marketing/Shared%20Documents")
)

# Incrementally ingest CSV files with automatic schema inference and evolution
@dp.table
def sharepoint_csv_table():
return (spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("databricks.connection", "my_sharepoint_conn")
.option("pathGlobFilter", "*.csv")
.option("inferColumnTypes", True)
.option("header", True)
.load("https://mytenant.sharepoint.com/sites/Engineering/Data/IoT_Logs")
)

# Read a specific Excel file from SharePoint in a materialized view
@dp.table
def sharepoint_excel_table():
return (spark.read.format("excel")
.option("databricks.connection", "my_sharepoint_conn")
.option("headerRows", 1) # optional
.option("inferColumnTypes", True) # optional
.option("dataAddress", "'Sheet1'!A1:M20") # optional
.load("https://mytenant.sharepoint.com/sites/Finance/Shared%20Documents/Monthly/Report-Oct.xlsx")

Limitations

The standard SharePoint connector has the following limitations:

  • No multi-site ingestion: You cannot ingest multiple sites with the same query. To ingest from two sites, you must write two separate queries.
  • Filtering: You can use the pathGlobFilter option to filter files by name. Folder path-based filtering is not supported.
  • Unsupported formats: SharePoint Lists and .aspx site pages are not supported. Only files in Document Libraries are supported.
  • Writing back to a SharePoint server is not supported.
  • Auto Loader cleanSource (deleting or archiving files at the source after ingestion) is not supported.

Next steps