Skip to main content

Microsoft SharePoint connector reference

Preview

The Microsoft SharePoint connector is in Beta.

This page contains reference material for the Microsoft SharePoint connector in Databricks Lakeflow Connect.

Ingested data format

The ingested data lands in the following format. A site in SharePoint maps to a schema in Databricks. A drive in the SharePoint site maps to a table in the destination schema.

Field

Type

Description

file_id

String

The unique SharePoint identifier of the file.

file_metadata

Struct

Contains generic file metadata:

  • name (string): The name of the file, as it appears in SharePoint.
  • size_in_bytes (bigint): The size of the file.
  • created_timestamp (timestamp): The timestamp at which the file was created in SharePoint.
  • last_modified_timestamp (timestamp): The timestamp at which the file was last modified in SharePoint.

source_metadata

Struct

Contains SharePoint-specific metadata for the file:

  • site_id (string): The SharePoint site identifier.
  • drive_id (string): The SharePoint drive identifier.
  • file_folder_path (string): The file path of the file in SharePoint (for example, /drives/d1/root:/folder1).
  • quick_xor_hash (string): A custom hash provided by Microsoft that can be used to validate that your downloaded content is accurate. This value can be NULL (for example, if the format does not support hashing). See Code Snippets: QuickXorHash Algorithm in the Microsoft documentation. mime_type (string): The MIME type (format) of the file.
  • web_url (string): A link to the file in SharePoint.

content

Struct

Contains file content. Databricks does not recommend accessing this struct directly. Instead, access it using the UDFs in Downstream RAG use case.

sequence_id

Long

A sequencing key for ordering different versions of the same file.

is_deleted

Boolean

Ignore this column. The value will always be false. If you need to identify deleted columns, Databricks recommends enabling SCD type 2 and using the \_\_END_AT column.