SharePointからデータを取り込む

備考

ベータ版

この機能はベータ版です。ワークスペース管理者は、 プレビュー ページからこの機能へのアクセスを制御できます。「Databricks プレビューの管理」を参照してください。

:::注記コンプライアンス

管理対象の SharePoint コネクタは、セキュリティとコンプライアンスの強化設定が有効になっているワークスペースでの使用をサポートしています。

:::

このページではLakeFlow Connectを使用してマネージドMicrosoft SharePoint インジェストパイプラインを作成する方法を示します。

始める前に

取り込みパイプラインを作成するには、まず次の要件を満たす必要があります。
- ワークスペースでUnity Catalogが有効になっている必要があります。
- サーバレスコンピュートは、ワークスペースで有効にする必要があります。サーバレスコンピュートの要件を参照してください。
- 新しい接続を作成するには、メタストアに対する CREATE CONNECTION 特権が必要です。Unity Catalogでの特権の管理を参照してください。
  
  コネクタが UI ベースのパイプラインオーサリングをサポートしている場合、管理者はこのページのステップを完了することで、接続とパイプラインを同時に作成できます。ただし、パイプラインを作成するユーザーが API ベースのパイプラインオーサリングを使用している場合、または管理者以外のユーザーである場合、管理者はまずカタログエクスプローラーで接続を作成する必要があります。「管理対象取り込みソースへの接続」を参照してください。
- 既存の接続を使用するには、接続オブジェクトに対するUSE CONNECTION特権またはALL PRIVILEGES権限が必要です。
- ターゲット・カタログに対する USE CATALOG 権限が必要です。
- 既存のスキーマに対する USE SCHEMA 権限と CREATE TABLE 権限、またはターゲット・カタログに対する CREATE SCHEMA 権限が必要です。
SharePoint から取り込むには、まずサポートされている認証方法を設定する必要があります。OAuth U2M の設定 (Databricks 管理型)」を参照してください。

取り込みパイプラインを作成する

サポートされているファイル形式とコネクタ固有の制限事項のリストについては、Microsoft SharePoint コネクタの制限事項を参照してください。

Databricks UI
Databricks notebook
Declarative Automation Bundles

Databricksワークスペースのサイドバーで、 データ取り込み をクリックします。
データの追加 ページで、 Databricks コネクタ セクションで、 Microsoft SharePoint をクリックします。
取り込みウィザードの**接続**ステップで、SharePoint アクセス認証情報を保存している接続を選択します。CREATE CONNECTIONメタストアに対する特権がある場合は、SharePoint 取り込み設定の概要に記載されている認証の詳細を使用して、接続の作成をクリックすると、新しい接続を作成できます。
次へをクリックします。
**インジェストセットアップ**ステップで、パイプラインの名前を入力します。
パイプラインのカタログとスキーマを選択します。カタログに対するUSE CATALOG権限とCREATE SCHEMA権限がある場合は、ドロップダウンメニューで スキーマを作成 をクリックして、新しいスキーマを作成できます。
「取り込みパイプラインを作成し、コンピュートを起動します」をクリックしてください。
ソース ステップで、SharePoint URLとファイル取り込みオプションを構成します。
保存して続行 をクリックします。
宛先ステップで、データをロードするカタログとスキーマを選択してください。カタログに対するUSE CATALOG権限とCREATE SCHEMA権限がある場合は、ドロップダウンメニューで スキーマを作成 をクリックして、新しいスキーマを作成できます。
保存して続行 をクリックします。
（オプション） スケジュールと通知 ステップで、 スケジュールの作成 をクリックします。宛先テーブルを更新する頻度を設定します。
(オプション) 通知の追加 をクリックして、パイプライン操作の成功または失敗に関するEメール通知を設定し、次に パイプラインを保存して実行 をクリックします。

宣言型オートメーションバンドルを使用して、SharePoint パイプラインをコードとして管理します。バンドルには、ジョブとタスクの YAML 定義を含めることができ、Databricks CLI を使用して管理され、異なるターゲットワークスペース（開発、ステージング、本番運用など）で共有および実行できます。詳細については、「宣言型オートメーションバンドルとは」をご覧ください。

Databricks CLIを使用してバンドルを作成するには、次の手順を実行します。
Bash
```
databricks bundle init
```
バンドルに2つの新しいリソースファイルを追加：
- パイプライン定義ファイル（例えば、resources/sharepoint_pipeline.yml）です。「パイプライン.ingestion_definition」を参照してください。と例
- データ取り込みの頻度を制御するジョブ定義ファイル（例: resources/sharepoint_job.yml）。
Databricks CLIを使用してパイプラインをデプロイする：
Bash
```
databricks bundle deploy
```

例

これらの例を使用してパイプラインを構成してください。

ファイルをバイナリ（非構造化）として取り込みます

SharePointサイトにあるすべてのファイルをバイナリコンテンツとして取り込みます。このアプローチを、PDF、Officeドキュメント、およびダウンストリームで処理するその他のファイル（例えばRAGアプリケーションなど）にご使用ください。

Declarative Automation Bundles
Databricks notebook

YAML
resources:
  pipelines:
    sharepoint_binary_pipeline:
      name: sharepoint_binary_pipeline
      catalog: main
      schema: ingest_destination_schema
      channel: PREVIEW
      ingestion_definition:
        connection_name: <sharepoint-connection>
        objects:
          - table:
              destination_catalog: main
              destination_schema: ingest_destination_schema
              destination_table: site_files
              connector_options:
                sharepoint_options:
                  entity_type: FILE
                  url: https://<tenant>.sharepoint.com/sites/<site>
                  file_ingestion_options:
                    format: BINARYFILE
                    schema_evolution_mode: NONE

Python
pipeline_spec = """
{
  "name": "<pipeline-name>",
  "catalog": "main",
  "schema": "ingest_destination_schema",
  "ingestion_definition": {
    "connection_name": "<sharepoint-connection>",
    "objects": [
      {
        "table": {
          "destination_catalog": "main",
          "destination_schema": "ingest_destination_schema",
          "destination_table": "site_files",
          "connector_options": {
            "sharepoint_options": {
              "entity_type": "FILE",
              "url": "https://<tenant>.sharepoint.com/sites/<site>",
              "file_ingestion_options": {
                "format": "BINARYFILE",
                "schema_evolution_mode": "NONE"
              }
            }
          }
        }
      }
    ]
  },
  "channel": "PREVIEW"
}
"""
create_pipeline(pipeline_spec)

構造化ファイルを取り込む

SharePointフォルダーから構造化ファイル（例えば、JSONファイル）を取り込みます。ソースファイル内の各行は、ターゲットテーブルの行になります。

Declarative Automation Bundles
Databricks notebook

YAML
resources:
  pipelines:
    sharepoint_json_pipeline:
      name: sharepoint_json_pipeline
      catalog: main
      schema: ingest_destination_schema
      channel: PREVIEW
      ingestion_definition:
        connection_name: <sharepoint-connection>
        objects:
          - table:
              destination_catalog: main
              destination_schema: ingest_destination_schema
              destination_table: json_files
              connector_options:
                sharepoint_options:
                  entity_type: FILE
                  url: https://<tenant>.sharepoint.com/sites/<site>/<json_folder>
                  file_ingestion_options:
                    format: JSON
                    schema_evolution_mode: NONE

Python
pipeline_spec = """
{
  "name": "<pipeline-name>",
  "catalog": "main",
  "schema": "ingest_destination_schema",
  "ingestion_definition": {
    "connection_name": "<sharepoint-connection>",
    "objects": [
      {
        "table": {
          "destination_catalog": "main",
          "destination_schema": "ingest_destination_schema",
          "destination_table": "json_files",
          "connector_options": {
            "sharepoint_options": {
              "entity_type": "FILE",
              "url": "https://<tenant>.sharepoint.com/sites/<site>/<json_folder>",
              "file_ingestion_options": {
                "format": "JSON",
                "schema_evolution_mode": "NONE"
              }
            }
          }
        }
      }
    ]
  },
  "channel": "PREVIEW"
}
"""
create_pipeline(pipeline_spec)

ファイルメタデータのみを取り込みます

ファイルのメタデータ（名前、サイズ、タイムスタンプ、パス）を、ファイルの内容をダウンロードせずに取り込みます。このアプローチは、ファイルの内容を取り込むオーバーヘッドなしに、ファイルの一覧が必要な場合に有効です。

Declarative Automation Bundles
Databricks notebook

YAML
resources:
  pipelines:
    sharepoint_metadata_pipeline:
      name: sharepoint_metadata_pipeline
      catalog: main
      schema: ingest_destination_schema
      channel: PREVIEW
      ingestion_definition:
        connection_name: <sharepoint-connection>
        objects:
          - table:
              destination_catalog: main
              destination_schema: ingest_destination_schema
              destination_table: file_metadata
              connector_options:
                sharepoint_options:
                  entity_type: FILE_METADATA
                  url: https://<tenant>.sharepoint.com/sites/<site>/<library>
                  file_ingestion_options:
                    format: BINARYFILE
                    schema_evolution_mode: NONE

Python
pipeline_spec = """
{
  "name": "<pipeline-name>",
  "catalog": "main",
  "schema": "ingest_destination_schema",
  "ingestion_definition": {
    "connection_name": "<sharepoint-connection>",
    "objects": [
      {
        "table": {
          "destination_catalog": "main",
          "destination_schema": "ingest_destination_schema",
          "destination_table": "file_metadata",
          "connector_options": {
            "sharepoint_options": {
              "entity_type": "FILE_METADATA",
              "url": "https://<tenant>.sharepoint.com/sites/<site>/<library>",
              "file_ingestion_options": {
                "format": "BINARYFILE",
                "schema_evolution_mode": "NONE"
              }
            }
          }
        }
      }
    ]
  },
  "channel": "PREVIEW"
}
"""
create_pipeline(pipeline_spec)

宣言型オートメーションバンドルジョブ定義ファイル

以下は、宣言型オートメーションバンドルで使用するジョブ定義ファイルの例です。ジョブは毎日、前回の実行からちょうど1日後に実行されます。

Declarative Automation Bundles

YAML
resources:
  jobs:
    sharepoint_job:
      name: sharepoint_job

      trigger:
        periodic:
          interval: 1
          unit: DAYS

      email_notifications:
        on_failure:
          - <email-address>

      tasks:
        - task_key: refresh_pipeline
          pipeline_task:
            pipeline_id: ${resources.pipelines.sharepoint_binary_pipeline.id}

ファイルインジェストオプションを構成する

file_ingestion_optionsブロックは、ファイルの処理方法を制御します。すべてのオプションは、パイプラインの定義内のsharepoint_options.file_ingestion_optionsブロックで設定されます。

ファイルフィルター

file_filters を使用して、ソースURLから取り込むファイルを制限します。

JSON
"file_ingestion_options": {
  "format": "CSV",
  "file_filters": [
    { "path_filter": "invoices/*.csv" },
    { "modified_after": "2026-01-01T00:00:00" }
  ]
}

For full file_ingestion_options パラメーター reference, see Microsoft SharePoint connector reference.

スキーマ進化

schema_evolution_modeを設定して、受信ファイル内の新しいカラムの処理方法を制御します。モードはAuto Loaderのスキーマ進化モードに一致します。詳細については、「Microsoft SharePoint コネクタのリファレンス」を参照してください。

スキーマのヒント

schema_hints を使用して推論された列タイプを上書き：

JSON
"file_ingestion_options": {
  "format": "CSV",
  "schema_hints": "order_id INT, amount DOUBLE, ts TIMESTAMP"
}

使用方法の詳細については、スキーマヒントを使用してスキーマ推論をオーバーライドするを参照してください。

形式固有のオプション

format_options を使用して、形式固有のオプションを指定します：

JSON
"file_ingestion_options": {
  "format": "CSV",
  "format_options": {
    "header": "true",
    "sep": ","
  }
}

サポートされているキーは、標準のAuto Loaderフォーマットオプションです。「形式オプション」を参照してください。

一般的なパターン

高度なパイプライン構成については、「管理された取り込みパイプラインの一般的なパターン」を参照してください。

次のステップ

パイプラインを開始、スケジュールし、アラートを設定します。一般的なパイプラインメンテナンスタスクを参照してください。
生のドキュメントをテキストに解析したり、解析されたデータをチャンク化したり、チャンクから埋め込みを作成したりできます。その後、出力テーブルで readStream をダウンストリームパイプラインで直接使用できます。「ダウンストリームRAGの使用例」を参照してください。

始める前に​

取り込みパイプラインを作成する​

例​

ファイルをバイナリ（非構造化）として取り込みます​

構造化ファイルを取り込む​

ファイルメタデータのみを取り込みます​

宣言型オートメーションバンドル ジョブ定義ファイル​

ファイルインジェストオプションを構成する​

ファイルフィルター​

スキーマ進化​

スキーマのヒント​

形式固有のオプション​

一般的なパターン​

次のステップ​

追加のリソース​