Skip to main content

Create a Microsoft SharePoint ingestion pipeline

Preview

The Microsoft SharePoint connector is in Beta.

This page describes how to create a Microsoft SharePoint ingestion pipeline using Databricks Lakeflow Connect. The following interfaces are supported:

  • Databricks Asset Bundles
  • Databricks APIs
  • Databricks SDKs
  • Databricks CLI

Before you begin

To create the ingestion pipeline, you must meet the following requirements:

  • Your workspace must be enabled for Unity Catalog.

  • Serverless compute must be enabled for your workspace. See Enable serverless compute.

  • If you plan to create a new connection: You must have CREATE CONNECTION privileges on the metastore.

    If your connector supports UI-based pipeline authoring, you can create the connection and the pipeline at the same time by completing the steps on this page. However, if you use API-based pipeline authoring, you must create the connection in Catalog Explorer before you complete the steps on this page. See Connect to managed ingestion sources.

  • If you plan to use an existing connection: You must have USE CONNECTION privileges or ALL PRIVILEGES on the connection object.

  • You must have USE CATALOG privileges on the target catalog.

  • You must have USE SCHEMA and CREATE TABLE privileges on an existing schema or CREATE SCHEMA privileges on the target catalog.

To ingest from SharePoint, you must configure one of the supported authentication methods:

Option 1: Databricks notebook

  1. Import the following notebook into your workspace:

    Create a SharePoint ingestion pipeline notebook

    Open notebook in new tab
  2. Leave the default values in cell 1. Don't modify this cell.

  3. If you want to ingest all drives in your SharePoint site, modify the schema spec in cell 2. If you only want to ingest some drives in your SharePoint site, delete cell 2 and modify the table spec in cell 3 instead.

    Don't modify channel. This must be PREVIEW.

    Cell 2 values to modify:

    • name: A unique name for the pipeline.
    • connection_name: The Unity Catalog connection that stores the authentication details for SharePoint.
    • source_schema: Your SharePoint site ID.
    • destination_catalog: A name for the destination catalog that will contain the ingested data.
    • destination_schema: A name for the destination schema that will contain the ingested data.
    • scd_type: The SCD method to use: SCD_TYPE_1 or SCD_TYPE_2. The default is SCD type 1. For more information, see History tracking.

    Cell 3 values to modify:

    • name: A unique name for the pipeline.
    • connection_name: The Unity Catalog connection that stores the authentication details for SharePoint.
    • source_schema: SharePoint site ID.
    • source_table: SharePoint drive names.
    • destination_catalog: A name for the destination catalog that will contain the ingested data.
    • destination_schema: A name for the destination schema that will contain the ingested data.
    • destination_table: If your drive name has spaces or special characters in it, you must specify a destination table with a valid name. For example, if the drive name is my drive, you must specify a destination table name like my_drive.
    • scd_type: The SCD method to use: SCD_TYPE_1 or SCD_TYPE_2. The default is SCD type 1. For more information, see History tracking.
  4. Click Run all.

Option 2: Databricks CLI

Run the following command:

databricks pipelines create --json "<pipeline definition or json file path>"

Pipeline definition templates

If you want to ingest all drives in your SharePoint site, use the schema spec format for your pipeline definition. If you only want to ingest some drives in your SharePoint site, use the table spec definition format instead. Don't modify channel. This must be PREVIEW.

Schema spec values to modify:

  • name: A unique name for the pipeline.
  • connection_name: The Unity Catalog connection that stores the authentication details for SharePoint.
  • source_schema: Your SharePoint site ID.
  • destination_catalog: A name for the destination catalog that will contain the ingested data.
  • destination_schema: A name for the destination schema that will contain the ingested data.
  • scd_type: The SCD method to use: SCD_TYPE_1 or SCD_TYPE_2. The default is SCD type 1. For more information, see History tracking.

Schema spec template:

JSON
pipeline_spec = """
{
"name": "<YOUR_PIPELINE_NAME>",
"ingestion_definition": {
"connection_name": "<YOUR_CONNECTON_NAME>",
"objects": [
{
"schema": {
"source_schema": "<YOUR_SHAREPOINT_SITE_ID>",
"destination_catalog": "<YOUR_DATABRICKS_CATALOG>",
"destination_schema": "<YOUR_DATABRICKS_SCHEMA>",
"table_configuration": {
"scd_type": "SCD_TYPE_1"
}
}
]
},
"channel": "PREVIEW"
}
"""

Table spec values to modify:

  • name: A unique name for the pipeline.
  • connection_name: The Unity Catalog connection that stores the authentication details for SharePoint.
  • source_schema: SharePoint site ID.
  • source_table: SharePoint drive names.
  • destination_catalog: where you want to store the data
  • destination_catalog: A name for the destination catalog that will contain the ingested data.
  • destination_schema: A name for the destination schema that will contain the ingested data.
  • scd_type: The SCD method to use: SCD_TYPE_1 or SCD_TYPE_2. The default is SCD type 1. For more information, see History tracking.

Table spec template:

JSON
pipeline_spec = """
{
"name": "<YOUR_PIPELINE_NAME>",
"ingestion_definition": {
"connection_name": "<YOUR_CONNECTON_NAME>",
"objects": [
{
"table": {
"source_schema": "<YOUR_SHAREPOINT_SITE_ID>",
"source_table": "<YOUR_SHAREPOINT_DRIVE_NAME>",
"destination_catalog": "<YOUR_DATABRICKS_CATALOG>",
"destination_schema": "<YOUR_DATABRICKS_SCHEMA>",
"destination_table": "<NAME"> # e.g., "my_drive",
"table_configuration": {
"scd_type": "SCD_TYPE_1"
}
}
]
},
"channel": "PREVIEW"
}
"""

Next steps

  • Start, schedule, and set alerts on your pipeline.
  • You can parse the raw documents to text, chunk the parsed data, create embeddings from the chunks, and more. You can then use readStream on the output table directly in your downstream pipeline. See Downstream RAG use case.

Additional resources