Skip to main content

Use Git with Lakeflow Jobs

Job tasks can check out source code directly from a remote Git repository.

The following task types support remote Git repositories:

  • Notebooks
  • Python scripts
  • SQL files
  • data build tool (dbt) projects

All tasks in a job must reference the same commit in the remote repository. When a job run begins, Databricks takes a snapshot of the specified branch or commit, so that all tasks in that run use the same version of the code.

When you view the run history of a task that runs code stored in a remote Git repository, the Task run details pane includes Git details, including the commit SHA associated with the run. See View task run history.

note

Tasks configured to use a remote Git repository cannot write to workspace files. These tasks must write temporary data to ephemeral storage attached to the driver node and persistent data to a volume or table.

Using a Git repository source vs using Git folders.

This page discusses tasks that can pull source code directly from a remote Git repository. Workspaces also support a feature called Git folders, where a folder in your workspace is synced with a Git repository. A task can use a Git folder as its source. However, you must manage syncing with the repository. Using a remote Git repository as described here automatically pulls new source, if available, at job run time.

Databricks recommends referencing workspace paths in Git folders only for rapid iteration and testing during development. For staging and production jobs, configure tasks to reference a remote Git repository instead.

Configure a Git provider for a job

The jobs UI has a dialog to configure a remote Git repository. This dialog is accessible from the Job details pane under the Git heading, or in any task configured to use a Git provider. To access the dialog, click Add Git settings in the Job details pane.

In the Git dialog (labeled Git information if accessed during task configuration), enter the following details:

  • The Git repository URL.
  • Select your Git provider from the drop-down list.
  • In the Git reference field, enter the identifier for a branch, tag, or commit that corresponds to the version of the source code you want to run.
  • Select branch, tag, or commit from the drop-down.

You must specify only one of the following:

  • branch: The name of the branch, for example, main.
  • tag: The tag's name, for example, release-1.0.0.
  • commit: The hash of a specific commit, for example, e0056d01.
note

The dialog might prompt you with the following: Git credentials for this account are missing. Add credentials. You must configure a remote Git repository before using it as a reference. See Configure Git integration for Git folders.

When you view the run history of a task that runs code stored in a remote Git repository, the Task run details panel includes Git details, including the commit SHA associated with the run. See View task run history.

Sparse checkout for large repositories

For large repositories, you can use sparse checkout to import only specific directories rather than the full repository. Sparse checkout reduces checkout time and resource usage per job run.

However, improper configuration can cause cache fragmentation, which degrades execution times across your entire workspace. This section describes trade-offs and issues that might arise when using sparse checkout.

How Databricks caches repository checkouts

Databricks caches each Git checkout based on four values:

  • Workspace
  • Repository URL
  • Exact commit hash
  • Fingerprint of the sparse checkout pattern (the exact set of folder paths)

Any job run that matches all four criteria reuses a cache entry, which remains valid for up to one week. For example, if you have 3 different jobs, and they all have the same criteria, they use the same cache to the repository until there is a new commit (of after 1 week).

Every unique sparse checkout pattern creates a separate fingerprint, and therefore a separate cache entry. If 20 users each add a custom folder to their pattern, the system creates 20 distinct cache keys and imports the shared folder tree 20 times — multiplying load on your workspace. Creating a single sparse checkout pattern that includes all 20 of their folders (for example, is a parent folder), allows a single cache to work more often and have better performance in your jobs. The trade off is a larger number of files in your checkout.

Decide whether to use sparse checkout

Only enable sparse checkout if your use case meets both of the following criteria:

  • Size: Your repository is large (for example, it exceeds 2,500 files).
  • Stable targeting: The target branch is updated infrequently (for example, about one commit per hour or less). Avoid branches that change rapidly due to automated CI/CD workflows.

If you use sparse checkout, your organization should also adopt one or both of the following pattern strategies:

  • Standardization: Use three or fewer shared checkout patterns across the organization to maximize cache hits.
  • Micro-targeting: Structure patterns so that each targets a small number of files. For best performance, target fewer than 200 files.

These can help minimize your import rate.

Calculate your import rate

Before enabling sparse checkout, estimate your projected Files Per Hour import rate. Limits apply at the workspace level across all jobs and users.

Files Per Hour = Job Runs Per Hour × Cache Miss Rate × Files Imported Per Miss

Factor

What drives it

Job Runs Per Hour

Trigger frequency across all users

Cache Miss Rate

Commit frequency on the target branch and number of unique sparse patterns

Files Imported Per Miss

Total repository size or sparse checkout subset size

Example: 180 runs/hour × 10% miss rate × 6,000 files/miss = 108,000 files/hour

Compare your result against these thresholds:

Files imported per hour

Expected workspace impact

Below 150,000

Normal operation

150,000 – 300,000

Degraded performance. Some jobs may experience delays or failures.

Above 300,000

Jobs do not complete reliably.

Best practices

Standardize patterns

  • Do: Publish three or fewer approved sparse patterns per repository. Shared patterns consolidate load and maximize cache hits.
  • Don't: Allow custom per-team patterns. Even one extra folder creates a new cache entry and triggers a full re-import.

Manage commit churn

  • Do: Point jobs at a stable release branch. Batch merges into scheduled release windows so multiple runs share the same cached commit.
  • Don't: Use sparse checkouts with frequently updated branches like master or main. Because the cache is based on the exact commit hash, every new commit invalidates the cache and causes a full re-import for each job run.

Manage load

  • Do: Remove large binaries, generated artifacts, and data files from source control to reduce the repository size unconditionally.
  • Don't: Leave redundant jobs running at high frequency. Lower trigger frequency for jobs that don't require continuous execution, stagger schedules, or consolidate jobs that share the same checkout.

Manage commit churn with a release branch

When jobs target a fast-moving branch like master or main, the commit hash changes frequently, causing cache misses on nearly every run. Using a dedicated release branch that updates on a fixed schedule improves cache hit rates.

By pointing all jobs to an hourly release branch, all runs within that hour resolve to the same commit hash and share the same cache entry.

To configure a release branch:

  1. Create a long-lived branch (for example, release-candidate) in your Git repository.
  2. Automate updating this branch to match master on a fixed schedule, such as the top of every hour.
  3. Configure your Git-backed jobs to use release-candidate as their target Git reference.

Review these trade-offs before implementing:

Consideration

Description

Commit lag

Jobs run against code up to one hour behind master. Acceptable for most batch workloads, but may not suit jobs requiring the latest commit.

Failure window

If the release cut job fails, the branch is not updated for that hour and jobs continue running against the previous commit. Databricks recommends alerting on the cut job.

Example: automate with GitHub Actions

The following GitHub Actions workflow automates an hourly release branch cut.

Step 1: Commit a .github/workflows/cut-release-branch.yml file to your repository:

YAML
name: Cut Hourly Release Candidate

on:
schedule:
- cron: '0 * * * *'
workflow_dispatch:

jobs:
update-branch:
runs-on: ubuntu-latest
permissions:
contents: write

steps:
- name: Checkout main branch
uses: actions/checkout@v4
with:
ref: main
fetch-depth: 0

- name: Update release-candidate branch
run: |
git push origin HEAD:release-candidate --force

Step 2: Manually trigger the GitHub Action to verify that the release-candidate branch is created.

Step 3: Update your existing jobs to use release-candidate as the target Git reference.

Enable sparse checkout using the Jobs API

To enable sparse checkout, include a sparse_checkout block inside git_source when creating or updating a job:

JSON
{
"git_source": {
"git_url": "https://github.com/example/my-repo",
"git_provider": "gitHub",
"git_branch": "release-candidate",
"sparse_checkout": {
"patterns": ["src/models", "src/utils"]
}
}
}

Each string in patterns is a directory path relative to the repository root. All files within each specified directory are included in the checkout.