Use Git with Lakeflow Jobs
Job tasks can check out source code directly from a remote Git repository.
The following task types support remote Git repositories:
- Notebooks
- Python scripts
- SQL files
- data build tool (dbt) projects
All tasks in a job must reference the same commit in the remote repository. When a job run begins, Databricks takes a snapshot of the specified branch or commit, so that all tasks in that run use the same version of the code.
When you view the run history of a task that runs code stored in a remote Git repository, the Task run details pane includes Git details, including the commit SHA associated with the run. See View task run history.
Tasks configured to use a remote Git repository cannot write to workspace files. These tasks must write temporary data to ephemeral storage attached to the driver node and persistent data to a volume or table.
Using a Git repository source vs using Git folders.
This page discusses tasks that can pull source code directly from a remote Git repository. Workspaces also support a feature called Git folders, where a folder in your workspace is synced with a Git repository. A task can use a Git folder as its source. However, you must manage syncing with the repository. Using a remote Git repository as described here automatically pulls new source, if available, at job run time.
Databricks recommends referencing workspace paths in Git folders only for rapid iteration and testing during development. For staging and production jobs, configure tasks to reference a remote Git repository instead.
Configure a Git provider for a job
The jobs UI has a dialog to configure a remote Git repository. This dialog is accessible from the Job details pane under the Git heading, or in any task configured to use a Git provider. To access the dialog, click Add Git settings in the Job details pane.
In the Git dialog (labeled Git information if accessed during task configuration), enter the following details:
- The Git repository URL.
- Select your Git provider from the drop-down list.
- In the Git reference field, enter the identifier for a branch, tag, or commit that corresponds to the version of the source code you want to run.
- Select branch, tag, or commit from the drop-down.
You must specify only one of the following:
- branch: The name of the branch, for example,
main. - tag: The tag's name, for example,
release-1.0.0. - commit: The hash of a specific commit, for example,
e0056d01.
The dialog might prompt you with the following: Git credentials for this account are missing. Add credentials. You must configure a remote Git repository before using it as a reference. See Configure Git integration for Git folders.
When you view the run history of a task that runs code stored in a remote Git repository, the Task run details panel includes Git details, including the commit SHA associated with the run. See View task run history.
Sparse checkout for large repositories
For large repositories, you can use sparse checkout to import only specific directories rather than the full repository. Sparse checkout reduces checkout time and resource usage per job run.
However, improper configuration can cause cache fragmentation, which degrades execution times across your entire workspace. This section describes trade-offs and issues that might arise when using sparse checkout.
How Databricks caches repository checkouts
Databricks caches each Git checkout based on four values:
- Workspace
- Repository URL
- Exact commit hash
- Fingerprint of the sparse checkout pattern (the exact set of folder paths)
Any job run that matches all four criteria reuses a cache entry, which remains valid for up to one week. For example, if you have 3 different jobs, and they all have the same criteria, they use the same cache to the repository until there is a new commit (of after 1 week).
Every unique sparse checkout pattern creates a separate fingerprint, and therefore a separate cache entry. If 20 users each add a custom folder to their pattern, the system creates 20 distinct cache keys and imports the shared folder tree 20 times — multiplying load on your workspace. Creating a single sparse checkout pattern that includes all 20 of their folders (for example, is a parent folder), allows a single cache to work more often and have better performance in your jobs. The trade off is a larger number of files in your checkout.
Decide whether to use sparse checkout
Only enable sparse checkout if your use case meets both of the following criteria:
- Size: Your repository is large (for example, it exceeds 2,500 files).
- Stable targeting: The target branch is updated infrequently (for example, about one commit per hour or less). Avoid branches that change rapidly due to automated CI/CD workflows.
If you use sparse checkout, your organization should also adopt one or both of the following pattern strategies:
- Standardization: Use three or fewer shared checkout patterns across the organization to maximize cache hits.
- Micro-targeting: Structure patterns so that each targets a small number of files. For best performance, target fewer than 200 files.
These can help minimize your import rate.
Calculate your import rate
Before enabling sparse checkout, estimate your projected Files Per Hour import rate. Limits apply at the workspace level across all jobs and users.
Files Per Hour = Job Runs Per Hour × Cache Miss Rate × Files Imported Per Miss
Factor | What drives it |
|---|---|
Job Runs Per Hour | Trigger frequency across all users |
Cache Miss Rate | Commit frequency on the target branch and number of unique sparse patterns |
Files Imported Per Miss | Total repository size or sparse checkout subset size |
Example: 180 runs/hour × 10% miss rate × 6,000 files/miss = 108,000 files/hour
Compare your result against these thresholds:
Files imported per hour | Expected workspace impact |
|---|---|
Below 150,000 | Normal operation |
150,000 – 300,000 | Degraded performance. Some jobs may experience delays or failures. |
Above 300,000 | Jobs do not complete reliably. |
Best practices
Standardize patterns
- Do: Publish three or fewer approved sparse patterns per repository. Shared patterns consolidate load and maximize cache hits.
- Don't: Allow custom per-team patterns. Even one extra folder creates a new cache entry and triggers a full re-import.
Manage commit churn
- Do: Point jobs at a stable release branch. Batch merges into scheduled release windows so multiple runs share the same cached commit.
- Don't: Use sparse checkouts with frequently updated branches like
masterormain. Because the cache is based on the exact commit hash, every new commit invalidates the cache and causes a full re-import for each job run.
Manage load
- Do: Remove large binaries, generated artifacts, and data files from source control to reduce the repository size unconditionally.
- Don't: Leave redundant jobs running at high frequency. Lower trigger frequency for jobs that don't require continuous execution, stagger schedules, or consolidate jobs that share the same checkout.
Manage commit churn with a release branch
When jobs target a fast-moving branch like master or main, the commit hash changes frequently, causing cache misses on nearly every run. Using a dedicated release branch that updates on a fixed schedule improves cache hit rates.
By pointing all jobs to an hourly release branch, all runs within that hour resolve to the same commit hash and share the same cache entry.
To configure a release branch:
- Create a long-lived branch (for example,
release-candidate) in your Git repository. - Automate updating this branch to match
masteron a fixed schedule, such as the top of every hour. - Configure your Git-backed jobs to use
release-candidateas their target Git reference.
Review these trade-offs before implementing:
Consideration | Description |
|---|---|
Commit lag | Jobs run against code up to one hour behind |
Failure window | If the release cut job fails, the branch is not updated for that hour and jobs continue running against the previous commit. Databricks recommends alerting on the cut job. |
Example: automate with GitHub Actions
The following GitHub Actions workflow automates an hourly release branch cut.
Step 1: Commit a .github/workflows/cut-release-branch.yml file to your repository:
name: Cut Hourly Release Candidate
on:
schedule:
- cron: '0 * * * *'
workflow_dispatch:
jobs:
update-branch:
runs-on: ubuntu-latest
permissions:
contents: write
steps:
- name: Checkout main branch
uses: actions/checkout@v4
with:
ref: main
fetch-depth: 0
- name: Update release-candidate branch
run: |
git push origin HEAD:release-candidate --force
Step 2: Manually trigger the GitHub Action to verify that the release-candidate branch is created.
Step 3: Update your existing jobs to use release-candidate as the target Git reference.
Enable sparse checkout using the Jobs API
To enable sparse checkout, include a sparse_checkout block inside git_source when creating or updating a job:
{
"git_source": {
"git_url": "https://github.com/example/my-repo",
"git_provider": "gitHub",
"git_branch": "release-candidate",
"sparse_checkout": {
"patterns": ["src/models", "src/utils"]
}
}
}
Each string in patterns is a directory path relative to the repository root. All files within each specified directory are included in the checkout.