Create a Microsoft SharePoint ingestion pipeline
The Microsoft SharePoint connector is in Beta.
This page describes how to create a Microsoft SharePoint ingestion pipeline using Databricks Lakeflow Connect. The following interfaces are supported:
- Databricks Asset Bundles
- Databricks APIs
- Databricks SDKs
- Databricks CLI
Before you begin
To create the ingestion pipeline, you must meet the following requirements:
-
Your workspace must be enabled for Unity Catalog.
-
Serverless compute must be enabled for your workspace. See Enable serverless compute.
-
If you plan to create a new connection: You must have
CREATE CONNECTION
privileges on the metastore.If your connector supports UI-based pipeline authoring, you can create the connection and the pipeline at the same time by completing the steps on this page. However, if you use API-based pipeline authoring, you must create the connection in Catalog Explorer before you complete the steps on this page. See Connect to managed ingestion sources.
-
If you plan to use an existing connection: You must have
USE CONNECTION
privileges orALL PRIVILEGES
on the connection object. -
You must have
USE CATALOG
privileges on the target catalog. -
You must have
USE SCHEMA
andCREATE TABLE
privileges on an existing schema orCREATE SCHEMA
privileges on the target catalog.
To ingest from SharePoint, you must configure one of the supported authentication methods:
Option 1: Databricks notebook
-
Import the following notebook into your workspace:
Create a SharePoint ingestion pipeline notebook
-
Leave the default values in cell 1. Don't modify this cell.
-
If you want to ingest all drives in your SharePoint site, modify the schema spec in cell 2. If you only want to ingest some drives in your SharePoint site, delete cell 2 and modify the table spec in cell 3 instead.
Don't modify
channel
. This must bePREVIEW
.Cell 2 values to modify:
name
: A unique name for the pipeline.connection_name
: The Unity Catalog connection that stores the authentication details for SharePoint.source_schema
: Your SharePoint site ID.destination_catalog
: A name for the destination catalog that will contain the ingested data.destination_schema
: A name for the destination schema that will contain the ingested data.scd_type
: The SCD method to use:SCD_TYPE_1
orSCD_TYPE_2
. The default is SCD type 1. For more information, see History tracking.
Cell 3 values to modify:
name
: A unique name for the pipeline.connection_name
: The Unity Catalog connection that stores the authentication details for SharePoint.source_schema
: SharePoint site ID.source_table
: SharePoint drive names.destination_catalog
: A name for the destination catalog that will contain the ingested data.destination_schema
: A name for the destination schema that will contain the ingested data.destination_table
: If your drive name has spaces or special characters in it, you must specify a destination table with a valid name. For example, if the drive name ismy drive
, you must specify a destination table name likemy_drive
.scd_type
: The SCD method to use:SCD_TYPE_1
orSCD_TYPE_2
. The default is SCD type 1. For more information, see History tracking.
-
Click Run all.
Option 2: Databricks CLI
Run the following command:
databricks pipelines create --json "<pipeline definition or json file path>"
Pipeline definition templates
If you want to ingest all drives in your SharePoint site, use the schema spec format for your pipeline definition. If you only want to ingest some drives in your SharePoint site, use the table spec definition format instead. Don't modify channel
. This must be PREVIEW
.
Schema spec values to modify:
name
: A unique name for the pipeline.connection_name
: The Unity Catalog connection that stores the authentication details for SharePoint.source_schema
: Your SharePoint site ID.destination_catalog
: A name for the destination catalog that will contain the ingested data.destination_schema
: A name for the destination schema that will contain the ingested data.scd_type
: The SCD method to use:SCD_TYPE_1
orSCD_TYPE_2
. The default is SCD type 1. For more information, see History tracking.
Schema spec template:
pipeline_spec = """
{
"name": "<YOUR_PIPELINE_NAME>",
"ingestion_definition": {
"connection_name": "<YOUR_CONNECTON_NAME>",
"objects": [
{
"schema": {
"source_schema": "<YOUR_SHAREPOINT_SITE_ID>",
"destination_catalog": "<YOUR_DATABRICKS_CATALOG>",
"destination_schema": "<YOUR_DATABRICKS_SCHEMA>",
"table_configuration": {
"scd_type": "SCD_TYPE_1"
}
}
]
},
"channel": "PREVIEW"
}
"""
Table spec values to modify:
name
: A unique name for the pipeline.connection_name
: The Unity Catalog connection that stores the authentication details for SharePoint.source_schema
: SharePoint site ID.source_table
: SharePoint drive names.destination_catalog
: where you want to store the datadestination_catalog
: A name for the destination catalog that will contain the ingested data.destination_schema
: A name for the destination schema that will contain the ingested data.scd_type
: The SCD method to use:SCD_TYPE_1
orSCD_TYPE_2
. The default is SCD type 1. For more information, see History tracking.
Table spec template:
pipeline_spec = """
{
"name": "<YOUR_PIPELINE_NAME>",
"ingestion_definition": {
"connection_name": "<YOUR_CONNECTON_NAME>",
"objects": [
{
"table": {
"source_schema": "<YOUR_SHAREPOINT_SITE_ID>",
"source_table": "<YOUR_SHAREPOINT_DRIVE_NAME>",
"destination_catalog": "<YOUR_DATABRICKS_CATALOG>",
"destination_schema": "<YOUR_DATABRICKS_SCHEMA>",
"destination_table": "<NAME"> # e.g., "my_drive",
"table_configuration": {
"scd_type": "SCD_TYPE_1"
}
}
]
},
"channel": "PREVIEW"
}
"""
Next steps
- Start, schedule, and set alerts on your pipeline.
- You can parse the raw documents to text, chunk the parsed data, create embeddings from the chunks, and more. You can then use
readStream
on the output table directly in your downstream pipeline. See Downstream RAG use case.