Skip to main content

Ingest data from SharePoint

Beta

This feature is in Beta. Workspace admins can control access to this feature from the Previews page. See Manage Databricks previews.

This page shows how to create a managed Microsoft SharePoint ingestion pipeline using Lakeflow Connect.

Before you begin

  • To create the ingestion pipeline, you must first meet the following requirements:

    • Your workspace must be enabled for Unity Catalog.

    • Serverless compute must be enabled for your workspace. See Serverless compute requirements.

    • If you plan to create a new connection: You must have CREATE CONNECTION privileges on the metastore. See Manage privileges in Unity Catalog.

      If the connector supports UI-based pipeline authoring, an admin can create the connection and the pipeline at the same time by completing the steps on this page. However, if the users who create pipelines use API-based pipeline authoring or are non-admin users, an admin must first create the connection in Catalog Explorer. See Connect to managed ingestion sources.

    • If you plan to use an existing connection: You must have USE CONNECTION privileges or ALL PRIVILEGES on the connection object.

    • You must have USE CATALOG privileges on the target catalog.

    • You must have USE SCHEMA and CREATE TABLE privileges on an existing schema or CREATE SCHEMA privileges on the target catalog.

  • To ingest from SharePoint, you must first configure a supported authentication methods. See Overview of SharePoint ingestion setup.

Create an ingestion pipeline

  1. Import the following notebook into your workspace:

    Open notebook in new tab
  2. Leave the default values in cell 1. Don't modify this cell.

  3. If you want to ingest all drives in your SharePoint site, modify the schema spec in cell 2. If you only want to ingest some drives in your SharePoint site, delete cell 2 and modify the table spec in cell 3 instead.

    Don't modify channel. This must be PREVIEW.

  4. Click Run all.

Pipeline definition templates

If you want to ingest all drives in your SharePoint site, use the schema spec format for your pipeline definition. If you only want to ingest some drives in your SharePoint site, use the table spec definition format instead. Don't modify channel. This must be PREVIEW.

Schema spec values to modify:

  • name: A unique name for the pipeline.
  • connection_name: The Unity Catalog connection that stores the authentication details for SharePoint.
  • source_schema: Your SharePoint site ID.
  • destination_catalog: A name for the destination catalog that will contain the ingested data.
  • destination_schema: A name for the destination schema that will contain the ingested data.
  • scd_type: The SCD method to use: SCD_TYPE_1 or SCD_TYPE_2. The default is SCD type 1. For more information, see Enable history tracking (SCD type 2).

Schema spec template:

JSON
pipeline_spec = """
{
"name": "<YOUR_PIPELINE_NAME>",
"catalog": "<YOUR_DATABRICKS_CATALOG>",
"schema": "<YOUR_DATABRICKS_SCHEMA>",
"ingestion_definition": {
"connection_name": "<YOUR_CONNECTON_NAME>",
"objects": [
{
"schema": {
"source_schema": "<YOUR_SHAREPOINT_SITE_ID>",
"destination_catalog": "<YOUR_DATABRICKS_CATALOG>",
"destination_schema": "<YOUR_DATABRICKS_SCHEMA>",
"table_configuration": {
"scd_type": "SCD_TYPE_1"
}
}
}
]
},
"channel": "PREVIEW"
}
"""

Table spec values to modify:

  • name: A unique name for the pipeline.
  • connection_name: The Unity Catalog connection that stores the authentication details for SharePoint.
  • source_schema: SharePoint site ID.
  • source_table: SharePoint drive names.
  • destination_catalog: where you want to store the data
  • destination_catalog: A name for the destination catalog that will contain the ingested data.
  • destination_schema: A name for the destination schema that will contain the ingested data.
  • scd_type: The SCD method to use: SCD_TYPE_1 or SCD_TYPE_2. The default is SCD type 1. For more information, see Enable history tracking (SCD type 2).

Table spec template:

JSON
pipeline_spec = """
{
"name": "<YOUR_PIPELINE_NAME>",
"catalog": "<YOUR_DATABRICKS_CATALOG>",
"schema": "<YOUR_DATABRICKS_SCHEMA>",
"ingestion_definition": {
"connection_name": "<YOUR_CONNECTON_NAME>",
"objects": [
{
"table": {
"source_schema": "<YOUR_SHAREPOINT_SITE_ID>",
"source_table": "<YOUR_SHAREPOINT_DRIVE_NAME>",
"destination_catalog": "<YOUR_DATABRICKS_CATALOG>",
"destination_schema": "<YOUR_DATABRICKS_SCHEMA>",
"destination_table": "<NAME"> # e.g., "my_drive",
"table_configuration": {
"scd_type": "SCD_TYPE_1"
}
}
}
]
},
"channel": "PREVIEW"
}
"""

Common patterns

For advanced pipeline configurations, see Common patterns for managed ingestion pipelines.

Next steps

  • Start, schedule, and set alerts on your pipeline. See Common pipeline maintenance tasks.
  • You can parse the raw documents to text, chunk the parsed data, create embeddings from the chunks, and more. You can then use readStream on the output table directly in your downstream pipeline. See Downstream RAG use case.

Additional resources