Google Drive からデータを取り込む

備考

ベータ版

この機能はベータ版です。ワークスペース管理者は、 プレビュー ページからこの機能へのアクセスを制御できます。Databricksのプレビューを管理するを参照してください。

このページでは、LakeFlow Connect を使用して、マネージド Google Drive インジェストパイプラインを作成する方法について説明します。

始める前に

取り込みパイプラインを作成するには、まず次の要件を満たす必要があります。
- ワークスペースでUnity Catalogが有効になっている必要があります。
- サーバレスコンピュートがワークスペースで有効になっている必要があります。サーバレスコンピュートの要件を参照してください。
- 新しい接続を作成するには、メタストアに対するCREATE CONNECTION権限が必要です。Unity Catalogでの特権の管理を参照してください。
  
  コネクタが UI ベースのパイプラインオーサリングをサポートしている場合、管理者はこのページのステップを完了することで、接続とパイプラインを同時に作成できます。ただし、パイプラインを作成するユーザーが API ベースのパイプラインオーサリングを使用している場合、または管理者以外のユーザーである場合、管理者はまずカタログエクスプローラーで接続を作成する必要があります。「管理対象取り込みソースへの接続」を参照してください。
- 既存の接続を使用するには、接続オブジェクトに対して USE CONNECTION または ALL PRIVILEGES の権限が必要です。
- ターゲットカタログに対するUSE CATALOG権限が必要です。
- 既存のスキーマに対するUSE SCHEMAおよびCREATE TABLE権限、またはターゲットカタログに対するCREATE SCHEMA権限を持っている必要があります。
Google ドライブから取り込むには、まず OAuth 2.0 を構成し、Unity Catalog 接続を作成する必要があります。「管理対象の取り込み用 Google ドライブをセットアップする」をご覧ください。

取り込みパイプラインを作成

サポートされているファイル形式とコネクタ固有の制限事項の一覧については、Google Drive コネクタの制限事項を参照してください。

Declarative Automation Bundles
Databricks notebook

Google Drive のパイプラインをコードとして管理するには、宣言型オートメーションバンドルを使用してください。バンドルには、ジョブとタスクの YAML 定義を含めることができ、Databricks CLI を使用して管理され、異なるターゲットワークスペース（開発、ステージング、本番運用など）で共有および実行できます。詳細については、「宣言型オートメーションバンドルとは」をご覧ください。

Databricks CLIを使用してバンドルを作成するには、次の手順を実行します。
Bash
```
databricks bundle init
```
バンドルに2つの新しいリソースファイルを追加：
- パイプライン定義ファイル（例えば、resources/gdrive_pipeline.yml）です。「パイプライン.ingestion_definition」を参照してください。と例。
- データ取り込みの頻度を制御するジョブ定義ファイル（例: resources/gdrive_job.yml）。
Databricks CLIを使用してパイプラインをデプロイする：
Bash
```
databricks bundle deploy
```

例

これらの例を使用してパイプラインを設定してください。

ファイルをバイナリ（非構造化）として取り込む

Google ドライブフォルダー内のすべてのファイルをバイナリコンテンツとして取り込みます。このアプローチは、PDF、Officeドキュメント、およびRAGアプリケーションなどでダウンストリーム処理することを意図したその他のファイルにご利用ください。

Declarative Automation Bundles
Databricks notebook

YAML
resources:
  pipelines:
    gdrive_binary_pipeline:
      name: gdrive_binary_pipeline
      catalog: main
      schema: ingest_destination_schema
      channel: PREVIEW
      ingestion_definition:
        connection_name: <gdrive-connection>
        objects:
          - table:
              destination_catalog: main
              destination_schema: ingest_destination_schema
              destination_table: drive_folder
              connector_options:
                gdrive_options:
                  entity_type: FILE
                  url: https://drive.google.com/drive/folders/<folder_id>
                  file_ingestion_options:
                    format: BINARYFILE
                    schema_evolution_mode: NONE

Python
pipeline_spec = """
{
  "name": "<pipeline-name>",
  "catalog": "main",
  "schema": "ingest_destination_schema",
  "ingestion_definition": {
    "connection_name": "<gdrive-connection>",
    "objects": [
      {
        "table": {
          "destination_catalog": "main",
          "destination_schema": "ingest_destination_schema",
          "destination_table": "drive_folder",
          "connector_options": {
            "gdrive_options": {
              "entity_type": "FILE",
              "url": "https://drive.google.com/drive/folders/<folder_id>",
              "file_ingestion_options": {
                "format": "BINARYFILE",
                "schema_evolution_mode": "NONE"
              }
            }
          }
        }
      }
    ]
  },
  "channel": "PREVIEW"
}
"""
create_pipeline(pipeline_spec)

パスフィルターを使用して構造化ファイルを取り込む

Google DriveフォルダーからCSVファイルを取り込むfile_filters を使用して、glob パターンに一致するファイルへの取り込みを制限します。

Declarative Automation Bundles
Databricks notebook

YAML
resources:
  pipelines:
    gdrive_csv_pipeline:
      name: gdrive_csv_pipeline
      catalog: main
      schema: ingest_destination_schema
      channel: PREVIEW
      ingestion_definition:
        connection_name: <gdrive-connection>
        objects:
          - table:
              destination_catalog: main
              destination_schema: ingest_destination_schema
              destination_table: csv_files
              connector_options:
                gdrive_options:
                  entity_type: FILE
                  url: https://drive.google.com/drive/folders/<folder_id>
                  file_ingestion_options:
                    format: CSV
                    schema_evolution_mode: NONE
                    file_filters:
                      - path_filter: '*.csv'

Python
pipeline_spec = """
{
  "name": "<pipeline-name>",
  "catalog": "main",
  "schema": "ingest_destination_schema",
  "ingestion_definition": {
    "connection_name": "<gdrive-connection>",
    "objects": [
      {
        "table": {
          "destination_catalog": "main",
          "destination_schema": "ingest_destination_schema",
          "destination_table": "csv_files",
          "connector_options": {
            "gdrive_options": {
              "entity_type": "FILE",
              "url": "https://drive.google.com/drive/folders/<folder_id>",
              "file_ingestion_options": {
                "format": "CSV",
                "schema_evolution_mode": "NONE",
                "file_filters": [
                  { "path_filter": "*.csv" }
                ]
              }
            }
          }
        }
      }
    ]
  },
  "channel": "PREVIEW"
}
"""
create_pipeline(pipeline_spec)

ファイルメタデータのみ取り込み

ファイルコンテンツをダウンロードせずに、ファイルメタデータ（名前、サイズ、タイムスタンプ、パス）を取り込みます。コンテンツの取り込みに伴うオーバーヘッドなしに、ファイルのインベントリが必要な場合は、このアプローチを使用してください。

Declarative Automation Bundles
Databricks notebook

YAML
resources:
  pipelines:
    gdrive_metadata_pipeline:
      name: gdrive_metadata_pipeline
      catalog: main
      schema: ingest_destination_schema
      channel: PREVIEW
      ingestion_definition:
        connection_name: <gdrive-connection>
        objects:
          - table:
              destination_catalog: main
              destination_schema: ingest_destination_schema
              destination_table: file_metadata
              connector_options:
                gdrive_options:
                  entity_type: FILE_METADATA
                  url: https://drive.google.com/drive/folders/<folder_id>
                  file_ingestion_options:
                    format: BINARYFILE
                    schema_evolution_mode: NONE

Python
pipeline_spec = """
{
  "name": "<pipeline-name>",
  "catalog": "main",
  "schema": "ingest_destination_schema",
  "ingestion_definition": {
    "connection_name": "<gdrive-connection>",
    "objects": [
      {
        "table": {
          "destination_catalog": "main",
          "destination_schema": "ingest_destination_schema",
          "destination_table": "file_metadata",
          "connector_options": {
            "gdrive_options": {
              "entity_type": "FILE_METADATA",
              "url": "https://drive.google.com/drive/folders/<folder_id>",
              "file_ingestion_options": {
                "format": "BINARYFILE",
                "schema_evolution_mode": "NONE"
              }
            }
          }
        }
      }
    ]
  },
  "channel": "PREVIEW"
}
"""
create_pipeline(pipeline_spec)

宣言型オートメーションバンドルジョブ定義ファイル

以下は、宣言型オートメーションバンドルで使用するジョブ定義ファイルの例です。ジョブは、前回の実行からちょうど1日後に毎日実行されます。

Declarative Automation Bundles

YAML
resources:
  jobs:
    gdrive_job:
      name: gdrive_job

      trigger:
        periodic:
          interval: 1
          unit: DAYS

      email_notifications:
        on_failure:
          - <email-address>

      tasks:
        - task_key: refresh_pipeline
          pipeline_task:
            pipeline_id: ${resources.pipelines.gdrive_binary_pipeline.id}

インジェスト用のファイルオプションを構成する

file_ingestion_options ブロックがファイルの処理方法を制御します。すべてのオプションは、パイプライン定義の「gdrive_options.file_ingestion_options」ブロック内に設定されます。

ファイルフィルター

ソースURLから取り込まれるファイルを制限するには、file_filtersを使用します。

JSON
"file_ingestion_options": {
  "format": "CSV",
  "file_filters": [
    { "path_filter": "invoices/*.csv" },
    { "modified_after": "2026-01-01T00:00:00" }
  ]
}

完全なfile_ingestion_optionsパラメーターリファレンスについては、Google Drive コネクタリファレンスを参照してください。

スキーマ進化

新しい列が入力ファイルでどのように処理されるかを制御するには、schema_evolution_mode を設定します。モードはAuto Loaderのスキーマ進化モードと一致します。詳細については、「Google Drive コネクタリファレンス」を参照してください。

スキーマのヒント

推論された列のデータ型をschema_hintsで上書きする：

JSON
"file_ingestion_options": {
  "format": "CSV",
  "schema_hints": "order_id INT, amount DOUBLE, ts TIMESTAMP"
}

使用法の詳細については、スキーマヒントを使用してスキーマ推論をオーバーライドするを参照してください。

形式固有のオプション

形式固有のオプションをformat_optionsを使用して指定します：

JSON
"file_ingestion_options": {
  "format": "CSV",
  "format_options": {
    "header": "true",
    "sep": ","
  }
}

サポートされているキーは、標準のAuto Loader形式オプションです。「形式オプション」を参照してください。

一般的なパターン

高度なパイプライン構成については、「マネージド取り込みパイプラインの一般的なパターン」を参照してください。

次のステップ

パイプラインの開始、スケジュール、アラートを設定する共通パイプラインメンテナンスタスクを参照してください。

始める前に​

取り込みパイプラインを作成​

例​

ファイルをバイナリ（非構造化）として取り込む​

パスフィルターを使用して構造化ファイルを取り込む​

ファイルメタデータのみ取り込み​

宣言型オートメーションバンドル ジョブ定義ファイル​

インジェスト用のファイルオプションを構成する​

ファイルフィルター​

スキーマ進化​

スキーマのヒント​

形式固有のオプション​

一般的なパターン​

次のステップ​

その他のリソース​