Skip to main content

Load test your Databricks Apps agent

Load testing finds the maximum queries per second (QPS) your Databricks Apps agent can sustain before performance degrades. This page shows you how to do the following:

  1. Deploy a mock version of your agent to isolate infrastructure throughput from LLM latency.
  2. Run a ramp-to-saturation load test with Locust.
  3. Analyze results with an interactive dashboard.

You can follow the AI-assisted path using a Claude Code skill, or set up each step manually.

Animated preview of the load testing dashboard showing QPS, latency, and ramp progression charts across compute configurations.

Requirements

  • A Databricks workspace with Databricks Apps enabled.
  • An agent app deployed (or ready to deploy) on Databricks Apps using the OpenAI Agents SDK, LangGraph, or a custom framework. See Author an AI agent and deploy it on Databricks Apps.
  • The Databricks CLI installed and authenticated. See Install or update the Databricks CLI.
  • Python 3.10+ with uv package manager.
  • (For the AI-assisted path) Claude Code installed.
  • (For load tests longer than ~1 hour) A service principal with M2M OAuth credentials (client_id and client_secret). See Authorize service principal access to Databricks with OAuth.
    • For short load tests (less than ~1 hour), your existing user (U2M) OAuth credentials from databricks auth login work fine. For longer tests, use M2M OAuth with a Databricks service principal — U2M tokens expire during long runs and cause mid-test failures. Creating a Databricks service principal requires workspace admin access.

If you use Claude Code, the /load-testing skill automates the workflow. It reads your agent code, generates a mock, creates load testing scripts, and walks you through deployment.

prompt

Tell Claude Code to do it for you:

Clone https://github.com/databricks/app-templates and run the /load-testing skill against the {your-template} template.

Or follow the steps below.

Step 1: Clone an agent template

The /load-testing skill is included in the databricks/app-templates repository, both as the top-level agent-load-testing skill and pre-synced into every individual agent template. If you already have a project from app-templates, you already have the skill.

Clone the repo and change into the template directory for the agent you want to load test:

Bash
git clone https://github.com/databricks/app-templates.git
cd app-templates/{your-template}

Step 2: Run the load testing skill

In Claude Code, run:

Text
/load-testing

The skill interactively walks you through the following steps. You can skip mocking to test your real agent, or skip deployment if your apps are already running.

  1. Gathering parameters: asks about your deployment status, compute sizes, worker configurations, and OAuth credentials.
  2. Creating load test scripts: generates locustfile.py, run_load_test.py, and dashboard_template.py tailored to your project.
  3. Mocking your LLM: creates a mock client specific to your SDK (OpenAI Agents SDK, LangGraph, or custom) that replaces real LLM calls with configurable streaming delays.
  4. Deploying test apps: guides you through deploying multiple app configurations with different compute sizes and worker counts.
  5. Running tests: executes the load test with M2M OAuth authentication and ramp-to-saturation.
  6. Generating results: produces an interactive HTML dashboard with QPS, latency, and failure metrics.

Manual setup

Follow these steps to set up and run load tests without AI assistance.

Step 1: Mock your agent's LLM calls (optional)

Skip this step if you want end-to-end results that include real LLM latency. To measure Databricks Apps infrastructure throughput in isolation, mock the LLM so its per-request latency (typically 1-30 seconds) doesn't become the bottleneck.

A mock returns canned responses with a configurable streaming delay, preserving the full request/response pipeline (SSE streaming, tool dispatch, SDK runner) and swapping out only the LLM. This surfaces the maximum QPS the Databricks Apps platform can deliver and avoids Foundation Model API token costs during load tests.

The mock timing is controlled by two environment variables:

Variable

Default

Description

MOCK_CHUNK_DELAY_MS

10

Delay in milliseconds between streamed text chunks

MOCK_CHUNK_COUNT

80

Number of text chunks per response

With the defaults, each mock response takes approximately 800 ms (10 ms x 80 chunks), significantly faster than a real LLM response (3-15 seconds). Throughput numbers then reflect the platform, not the model.

Create a mock client that replaces the real LLM client. The rest of your agent code stays unchanged, and the approach depends on your SDK. For OpenAI, see the mock_openai_client.py reference implementation in databricks/app-templates. The same pattern adapts to other SDKs.

Create agent_server/mock_openai_client.py — a MockAsyncOpenAI class that implements chat.completions.create() with streaming. It returns tool call chunks instantly (simulating the LLM deciding to call a tool) and text response chunks with configurable delay from MOCK_CHUNK_DELAY_MS and MOCK_CHUNK_COUNT environment variables.

Swap it into your agent:

Python
from agent_server.mock_openai_client import MockAsyncOpenAI
from agents import set_default_openai_client, set_default_openai_api

set_default_openai_client(MockAsyncOpenAI())
set_default_openai_api("chat_completions")

The rest of your agent code (handlers, tools, streaming logic) stays unchanged.

Step 2: Set up load testing scripts

Create a load-test-scripts/ directory in your project. The load testing framework consists of three scripts that are framework-agnostic and work with any Databricks Apps agent.

Text
<project-root>/
agent_server/ # Your existing agent code
load-test-scripts/ # Load testing scripts (create this)
run_load_test.py # CLI orchestrator
locustfile.py # Locust test with SSE streaming + TTFT tracking
dashboard_template.py # Interactive HTML dashboard generator
load-test-runs/ # Results (auto-created per run)
<run-name>/
dashboard.html # Interactive dashboard
test_config.json # Test parameters for reproducibility
<label>/ # Per-config Locust CSV output

The framework includes the following files:

  • locustfile.py: A Locust load test that sends POST /invocations requests with stream: true, parses SSE streams, tracks time to first token (TTFT) as a custom metric, uses M2M OAuth token exchange with auto-refresh, and implements a StepRampShape that ramps users from step_size to max_users while holding each level for step_duration seconds.
  • run_load_test.py: A CLI orchestrator that tests each app URL sequentially with isolated metrics per configuration. It handles OAuth token refresh, runs a healthcheck and warmup before each test, and saves results to load-test-runs/<run-name>/<label>/.
  • dashboard_template.py: Generates a self-contained HTML dashboard using Chart.js with KPI cards, bar charts (QPS, latency, TTFT by config), QPS ramp progression line charts, and a full results table. Can be run standalone: uv run dashboard_template.py ../load-test-runs/<run-name>/.

Install dependencies

The load testing scripts use their own pyproject.toml inside load-test-scripts/ to avoid polluting your agent's production dependencies. Create load-test-scripts/pyproject.toml:

Toml
[project]
name = "load-test-scripts"
version = "0.1.0"
requires-python = ">=3.10"
dependencies = [
"locust>=2.32,<2.40",
"urllib3<2.3",
"requests",
]
note

Pin locust to <2.40. Newer versions (>=2.43) have a known RecursionError that breaks long load tests.

Install from within the load-test-scripts/ directory:

Bash
cd load-test-scripts/
uv sync

Step 3: Deploy test apps with varying configurations

Deploy multiple Databricks Apps with different compute sizes and worker counts to find the optimal configuration for your workload.

The configurations below focus on the sweet spot identified from prior testing. If you want broader coverage, add one config on either side (for example, medium-w1 or large-w12), but the six below are usually enough.

Compute size

Workers

Suggested app name

Medium

2

<your-app>-medium-w2

Medium

3

<your-app>-medium-w3

Medium

4

<your-app>-medium-w4

Large

6

<your-app>-large-w6

Large

8

<your-app>-large-w8

Large

10

<your-app>-large-w10

Configure compute size

Use the Databricks CLI to set compute size when creating or updating an app:

Bash
# Create a new app with Medium compute
databricks apps create <app-name> --compute-size MEDIUM

# Update an existing app to Large compute
databricks apps update <app-name> --compute-size LARGE

Configure worker count with Declarative Automation Bundles

start-server (via AgentServer.run()) accepts a --workers flag directly. Pass the worker count in the command array using a DAB variable:

YAML
variables:
app_name:
default: 'my-agent-medium-w2'
workers:
default: '2'

resources:
apps:
load_test_app:
name: ${var.app_name}
source_code_path: .
config:
command: ['uv', 'run', 'start-server', '--workers', '${var.workers}']
env:
- name: MOCK_CHUNK_DELAY_MS
value: '10'
- name: MOCK_CHUNK_COUNT
value: '80'

targets:
medium-w2:
default: true
variables:
app_name: 'my-agent-medium-w2'
workers: '2'
large-w8:
variables:
app_name: 'my-agent-large-w8'
workers: '8'

Deploy and verify

Deploy each target with the Databricks CLI:

Bash
databricks bundle deploy --target medium-w2
databricks bundle run load_test_app --target medium-w2

Verify that apps are active before running load tests:

Bash
databricks apps get <app-name> --output json | jq '{app_status, compute_status, url}'
note

Wait for all apps to reach ACTIVE status before proceeding. Apps that are still starting produce misleading results.

Step 4: Run load tests

Set up authentication

Select your authentication based on how long you plan to run:

  • Short tests (less than ~1 hour): use your existing user credentials from databricks auth login. No extra setup required.
  • Long tests (more than ~1 hour, such as overnight runs): use M2M OAuth with a Databricks service principal. U2M tokens expire and break your test mid-run. Creating a Databricks service principal requires workspace admin access.

For M2M OAuth, export the Databricks service principal credentials before running tests:

Bash
export DATABRICKS_HOST=https://your-workspace.cloud.databricks.com
export DATABRICKS_CLIENT_ID=<your-client-id>
export DATABRICKS_CLIENT_SECRET=<your-client-secret>

Parameters reference

Parameter

Required

Default

Description

--app-url

Yes

App URL(s) to test (repeatable)

--client-id

For long tests

DATABRICKS_CLIENT_ID env

Service principal client ID (M2M OAuth)

--client-secret

For long tests

DATABRICKS_CLIENT_SECRET env

Service principal client secret (M2M OAuth)

--label

No

Auto-derived from URL

Human-readable label per app (repeatable)

--compute-size

No

Auto-detected or medium

Compute size tag per app: medium, large (repeatable)

--max-users

No

300

Maximum concurrent simulated users

--step-size

No

20

Users added per ramp step

--step-duration

No

30

Seconds per ramp step

--spawn-rate

No

20

User spawn rate (users/sec)

--run-name

No

<timestamp>

Name for this run — results saved to load-test-runs/<run-name>/

--dashboard

No

Off

Generate interactive HTML dashboard after tests complete

Example commands

Quick single-app test (short run — uses your databricks auth login session):

Bash
cd load-test-scripts/

uv run run_load_test.py \
--app-url https://my-app.aws.databricksapps.com \
--dashboard --run-name quick-test

Full matrix across the recommended 6 configurations (long run — pass M2M credentials). Pass --compute-size flags in the same order as --app-url:

Bash
uv run run_load_test.py \
--app-url https://my-app-medium-w2.aws.databricksapps.com \
--app-url https://my-app-medium-w3.aws.databricksapps.com \
--app-url https://my-app-medium-w4.aws.databricksapps.com \
--app-url https://my-app-large-w6.aws.databricksapps.com \
--app-url https://my-app-large-w8.aws.databricksapps.com \
--app-url https://my-app-large-w10.aws.databricksapps.com \
--compute-size medium --compute-size medium --compute-size medium \
--compute-size large --compute-size large --compute-size large \
--client-id $DATABRICKS_CLIENT_ID \
--client-secret $DATABRICKS_CLIENT_SECRET \
--dashboard --run-name overnight-sweep

Multiple runs for statistical consistency:

Bash
for RUN in r1 r2 r3 r4 r5; do
uv run run_load_test.py \
--app-url https://my-app.aws.databricksapps.com \
--client-id $DATABRICKS_CLIENT_ID \
--client-secret $DATABRICKS_CLIENT_SECRET \
--max-users 1000 --step-size 20 --step-duration 10 \
--run-name my_test_${RUN} --dashboard || break
done

What happens during a run

  1. Healthcheck: verifies the app streams correctly (receives [DONE]).
  2. Warmup: sends sequential requests to warm up the app.
  3. Ramp-to-saturation: steps up concurrent users every step_duration seconds.
  4. Saturation detection: when QPS plateaus despite adding more users, you've hit the throughput ceiling.

Estimated duration

Each app under test runs through its own ramp, so total run time scales with the number of configurations in your matrix. Use the formula below to plan your run window.

Duration per app: (max_users / step_size) * step_duration seconds.

With defaults (--max-users 300 --step-size 20 --step-duration 30):

  • 15 steps x 30 seconds = approximately 7.5 minutes per app
  • For the recommended 6-configuration matrix: approximately 45 minutes per run

Step 5: View and interpret results

  1. Open the dashboard:

    Bash
    open load-test-runs/<run-name>/dashboard.html
  2. (Optional) Regenerate the dashboard from existing data, for example after updating the template:

    Bash
    cd load-test-scripts/
    uv run dashboard_template.py ../load-test-runs/<run-name>/

Dashboard sections

The interactive dashboard includes:

  • KPI cards: best configuration (by peak successful QPS), overall peak QPS, lowest latency, and total requests served.
  • QPS by Config: grouped bar chart showing median QPS, peak QPS excluding failures, and peak QPS side-by-side for each configuration.
  • Latency by Config: grouped bars showing p50 and p95 latency.
  • TTFT by Config: time to first token (p50 and p95).
  • Total Requests Served: request count per configuration.
  • QPS Ramp Progression: line charts with tabs for QPS, QPS (excluding failures), Latency, and Failures. Includes a max-users slider to zoom into lower concurrency ranges. Charts are grouped by compute size (medium and large side-by-side).
  • Full Results Table: all configurations with peak QPS, users at peak, latency percentiles, and failure rate.
  • Test Parameters: configuration summary for reproducibility.

How to interpret results

  • Peak QPS: the maximum QPS achieved at any ramp step. This is the throughput ceiling for that configuration.
  • Users at Peak: the number of concurrent users when peak QPS was achieved. Adding more users beyond this point does not increase throughput.
  • Failure Rate: should be 0% or very low. A high failure rate means the app is overloaded at that concurrency level.
  • QPS Ramp Chart: look for where the line flattens. That's the saturation point: adding more users won't increase throughput.

Troubleshooting

Issue

Solution

Auth token expired mid-test

For tests longer than ~1 hour, switch from U2M to M2M OAuth by passing --client-id and --client-secret

Healthcheck fails

Verify the app is ACTIVE: databricks apps get <name> --output json

0 QPS or no results

Check load-test-runs/<run-name>/<label>/locust_output.log for errors

Low QPS despite high user count

The app is saturated. Try more workers or larger compute.

High failure rate

The app is overloaded. Reduce --max-users or increase workers/compute.

Dashboard shows no ramp data

Verify results_stats_history.csv exists in each result subdirectory

Next steps

  • Test with real LLM calls: skip the mocking step and deploy your actual agent to measure end-to-end latency including LLM response time.
  • Tune worker count: use the test matrix results to find the optimal worker count for your compute size.
  • Tutorial: Evaluate and improve a GenAI application to measure accuracy, relevance, and safety alongside throughput.