Skip to main content

GitHub connector limitations

Beta

This feature is in Beta. Workspace admins can control access to this feature from the Previews page. See Manage Databricks previews.

This page contains information about known limitations of the managed GitHub connector in Lakeflow Connect.

General limitations

  • When you run a scheduled pipeline, alerts don't trigger immediately. Instead, they trigger when the next update runs.
  • When a source table is deleted, the destination table is not automatically deleted. You must delete the destination table manually. This behavior is not consistent with Lakeflow Spark Declarative Pipelines behavior.
  • During source maintenance periods, Databricks might not be able to access your data.
  • If a source table name conflicts with an existing destination table name, the pipeline update fails.
  • Multi-destination pipeline support is API-only.
  • You can optionally rename a table that you ingest. If you rename a table in your pipeline, it becomes an API-only pipeline, and you can no longer edit the pipeline in the UI.
  • Column-level selection and deselection are API-only.
  • If you select a column after a pipeline has already started, the connector does not automatically backfill data for the new column. To ingest historical data, manually run a full refresh on the table.
  • Databricks can't ingest two or more tables with the same name in the same pipeline, even if they come from different source schemas.
  • The source system assumes that the cursor columns are monotonically increasing.
  • The connector ingests raw data without transformations. Use downstream Lakeflow Spark Declarative Pipelines pipelines for transformations.

Deletes not supported

The GitHub connector doesn't support fetching deletes. This is a GitHub API limitation.

Limited incremental support

Most tables don't support incremental updates because the GitHub API doesn't provide a way to filter records based on a cursor. These tables are fully refreshed on each pipeline update. For a list of tables and their update patterns, see Supported data.

Performance guidance for large organizations

Tables such as commits, pull_requests, and issues can contain millions of records in large organizations. Because these tables are fully refreshed on every pipeline run, ingestion cost scales with organization size and pipeline frequency.

To reduce per-run volume:

  • Use column selection to limit the columns ingested for these tables.
  • Use a lower pipeline frequency for pipelines that include high-volume tables.

Supported data

Tables with incremental updates

The following tables support incremental updates:

  • repositories
  • audit_logs: Organization accounts only. On github.com free plan, audit log history is limited to 90 days.

Tables with batch updates only

The following tables are fully refreshed on each pipeline update (non-incremental):

  • branches
  • collaborators
  • commits
  • deployments
  • deployment_statuses
  • discussions
  • issues
  • labels
  • milestones
  • org_members
  • pull_request_commits
  • pull_request_review_comments
  • pull_request_reviews
  • pull_requests
  • releases
  • tags
  • team_members
  • teams
  • workflows