GitHub connector limitations
This feature is in Beta. Workspace admins can control access to this feature from the Previews page. See Manage Databricks previews.
This page contains information about known limitations of the managed GitHub connector in Lakeflow Connect.
General limitations
- When you run a scheduled pipeline, alerts don't trigger immediately. Instead, they trigger when the next update runs.
- When a source table is deleted, the destination table is not automatically deleted. You must delete the destination table manually. This behavior is not consistent with Lakeflow Spark Declarative Pipelines behavior.
- During source maintenance periods, Databricks might not be able to access your data.
- If a source table name conflicts with an existing destination table name, the pipeline update fails.
- Multi-destination pipeline support is API-only.
- You can optionally rename a table that you ingest. If you rename a table in your pipeline, it becomes an API-only pipeline, and you can no longer edit the pipeline in the UI.
- Column-level selection and deselection are API-only.
- If you select a column after a pipeline has already started, the connector does not automatically backfill data for the new column. To ingest historical data, manually run a full refresh on the table.
- Databricks can't ingest two or more tables with the same name in the same pipeline, even if they come from different source schemas.
- The source system assumes that the cursor columns are monotonically increasing.
- The connector ingests raw data without transformations. Use downstream Lakeflow Spark Declarative Pipelines pipelines for transformations.
Deletes not supported
The GitHub connector doesn't support fetching deletes. This is a GitHub API limitation.
Limited incremental support
Most tables don't support incremental updates because the GitHub API doesn't provide a way to filter records based on a cursor. These tables are fully refreshed on each pipeline update. For a list of tables and their update patterns, see Supported data.
Performance guidance for large organizations
Tables such as commits, pull_requests, and issues can contain millions of records in large organizations. Because these tables are fully refreshed on every pipeline run, ingestion cost scales with organization size and pipeline frequency.
To reduce per-run volume:
- Use column selection to limit the columns ingested for these tables.
- Use a lower pipeline frequency for pipelines that include high-volume tables.
Supported data
Tables with incremental updates
The following tables support incremental updates:
repositoriesaudit_logs: Organization accounts only. Ongithub.comfree plan, audit log history is limited to 90 days.
Tables with batch updates only
The following tables are fully refreshed on each pipeline update (non-incremental):
branchescollaboratorscommitsdeploymentsdeployment_statusesdiscussionsissueslabelsmilestonesorg_memberspull_request_commitspull_request_review_commentspull_request_reviewspull_requestsreleasestagsteam_membersteamsworkflows