Skip to main content

Configure a load test for vector search endpoints

This page provides guidance, example code, and an example notebook for load testing vector search endpoints. Load testing helps you understand the performance and production readiness of a vector search endpoint before it's deployed to production. Load testing can tell you about:

  • Latency at different scaling levels
  • Throughput limits and bottlenecks (requests per second, latency breakdown)
  • Error rates under sustained load
  • Resource utilization and capacity planning

For more information about load testing and related concepts, see Load testing for serving endpoints.

Requirements

Before starting these steps, you must have a deployed vector search endpoint and a service principal with Can Query permissions on the endpoint. See Step 1: Set up service principal authentication.

Download and import a copy of the following files and example notebook to your Databricks workspace:

  • input.json. This is an example of the input.json file that specifies the payload that is sent by all concurrent connections to your endpoint. You can have multiple files if needed. If you use the example notebook, this file is generated automatically from the provided input table.
  • fast_vs_load_test_async_load.py. Upload this script to your workspace (for example, /Workspace/Users/<your-username>/fast_vs_load_test_async_load.py) and set the locust_script_path notebook parameter to its path. This script handles authentication, payload delivery, and debug metrics collection.
  • The following example notebook, which runs the load tests. For best performance, run this notebook on a single-node cluster with a large number of cores (Locust scales across all available CPUs). High memory is recommended for queries with pre-generated embeddings.

Example notebook and quickstart

Use the following example notebook to get started. It supports two exploration modes: a gradual sweep that tests specific concurrency levels you define, and a binary search mode that automatically finds the maximum sustainable QPS (breaking point) in a few steps. All parameters are configured using widgets, so the notebook can run interactively or as a Databricks Job without code edits.

Locust load test notebook

Open notebook in new tab

Load testing framework: Locust

Locust is an open-source load testing framework that allows you to do the following:

  • Vary the number of concurrent client connections
  • Control how fast connections spawn
  • Measure endpoint performance throughout the test
  • Auto-detect and use all available CPU cores

The example notebook uses the --processes -1 flag to auto-detect CPU cores and fully utilize them.

If Locust is bottlenecked by the CPU, a message appears in the output.

Step 1: Set up service principal authentication

important

For production-like performance testing, always use OAuth service principal authentication. Service principals provide up to 100ms faster response time and higher request rate limits compared to Personal Access Tokens (PATs).

Create and configure service principal

  1. Create a Databricks service principal. For instructions, see Add service principals to your account.

  2. Grant permissions:

    • Navigate to your vector search endpoint page.
    • Click Permissions.
    • Give the service principal Can Query permissions.
  3. Create OAuth secret.

    • Go to the service principal details page.
    • Click the Secrets tab.
    • Click Generate secret.
    • Set lifetime (recommend 365 days for long-term testing).
    • Copy both the Client ID and Secret immediately.
  4. Store credentials securely.

    • Create a Databricks secret scope. For instructions, see Tutorial: Create and use a Databricks secret.
    • As shown in the following code example, store the service principal Client ID as service_principal_client_id and store the OAuth secret as service_principal_client_secret.
    Python
    # In a Databricks notebook
    dbutils.secrets.put("load-test-auth", "service_principal_client_id", "<CLIENT_ID>")
    dbutils.secrets.put("load-test-auth", "service_principal_client_secret", "<SECRET>")

Step 2: Configure your load test

Notebook configuration

Configure the notebook parameters using the widgets at the top of the notebook. When running the notebook as a Databricks Job, pass these values as Job parameters. No code edits are needed.

Parameter

Description

Recommended Value

endpoint_name

Name of your vector search endpoint

Your endpoint name

index_name

Full index name (catalog.schema.index)

Your index name

test_table

Source table to sample queries from (catalog.schema.table)

Your index input table

query_column

Text column to use for managed embeddings

Leave as text or set to your column name

embedding_column

Column containing precomputed embedding vectors. Only used for self-managed embeddings.

Leave blank for managed embeddings

sample_size

Number of queries to sample for the test

1000

target_concurrencies

Comma-separated list of concurrent client counts to test

5,10,20,50

step_duration_seconds

Duration in seconds per concurrency level. One value applies to all levels, or provide one per level as a comma-separated list.

300 (5 minutes)

secret_scope_name

Name of your Databricks secret scope

Your scope name

locust_script_path

Workspace path to the fast_vs_load_test_async_load.py script

/Workspace/Users/<your-username>/fast_vs_load_test_async_load.py

output_table

(Optional) Delta table to store results in (catalog.schema.table). Created automatically on first run.

catalog.schema.load_test_results

run_name

Name or comment to tag this run for later analysis

A descriptive label

exploration_mode

gradual sweeps through target_concurrencies in order. binary_search finds the breaking point automatically (see Breaking point exploration).

gradual

max_target_qps

(binary_search only) Upper bound for the QPS search

500

exploration_steps

(binary_search only) Maximum number of binary search iterations

8

error_rate_threshold

(binary_search only) Maximum acceptable error rate (%) for a step to be counted as a success

1.0

num_results

Number of results to return per query

10

columns_to_return

Comma-separated list of columns to return in query results (for example, id,text). Leave blank to return all columns.

Leave blank for default

Managed vs. self-managed embeddings

The notebook supports both managed embeddings (where Databricks generates embeddings at query time) and self-managed embeddings (where you pass precomputed vectors directly). Configure the appropriate parameters based on your index type.

Index type

Parameter to set

Leave unset

Managed embeddings (Delta Sync index with Databricks-managed embedding model)

query_column — the text column name to use as the query

embedding_column (leave blank)

Self-managed embeddings (Delta Sync or Direct Vector Access index with precomputed vectors)

embedding_column — the column containing precomputed embedding vectors

query_column

note

For managed embedding indexes, the load test measures end-to-end latency including embedding generation time. If your embedding endpoint scales to zero, cold-start overhead will appear in the first test run. See Identify the embedding model bottleneck for how to isolate embedding latency from search latency.

Why 5-10 minutes?

A minimum test duration of 5 minutes is critical.

  • Initial queries may include cold-start overhead.
  • Endpoints need time to reach steady-state performance.
  • Auto-scaling of the model serving endpoints (if enabled) takes time to activate.
  • Short tests miss throttling behaviors under sustained load.

The following table shows recommended test durations depending on your test goal.

Test type

Test duration

Goals of test

Quick smoke test

2-3 minutes

Verify basic functionality

Performance baseline

5-10 minutes

Reliable steady-state metrics

Stress testing

15-30 minutes

Identify resource exhaustion

Endurance testing

1-4 hours

Degradation, latency stability

Breaking point exploration (binary search mode)

In addition to the gradual sweep (exploration_mode=gradual), the notebook supports an automatic binary search mode that locates the maximum sustainable QPS without requiring you to specify the concurrency levels manually.

How it works

Set exploration_mode=binary_search and specify max_target_qps (for example, 500). The notebook uses Little's Law (concurrency = QPS × avg_latency_sec) to convert each QPS target to an estimated concurrency level, then runs a binary search as follows:

  1. Start at max_target_qps / 2 (250 in the example).
  2. If the error rate is below error_rate_threshold (success), raise the lower bound and try a higher QPS (375, then 500, and so on).
  3. If the error rate exceeds the threshold (failure), lower the upper bound and try halfway between the last success and failure.
  4. Repeat for up to exploration_steps steps (default 8) or until the search range narrows to within 5% of max_target_qps.

The following table shows how the search converges for a hypothetical endpoint with a breaking point around 430 QPS:

Step

Target QPS

Error rate

Outcome

New range

1

250

0.1%

SUCCESS

[250, 500]

2

375

0.3%

SUCCESS

[375, 500]

3

437

4.5%

FAILURE

[375, 437]

4

406

0.8%

SUCCESS

[406, 437]

5

421

2.1%

FAILURE

[406, 421]

After 5–8 steps the search converges on the breaking point — in this example, roughly 406–421 QPS — with far fewer test runs than an exhaustive sweep.

When to use each mode

Mode

When to use

gradual

You already know the expected operating range and want to characterize performance at specific concurrency levels.

binary_search

You want to find the maximum sustainable QPS quickly, without knowing the concurrency levels in advance.

Step 3. Design your query set

When possible, the query set should reflect the expected production traffic as closely as possible. Specifically, you should try to match the expected distribution of queries in terms of content, complexity, and diversity.

  • Use realistic queries. Don't use random text such as "test query 1234".

  • Match the expected production traffic distribution. If you expect 80% common queries, 15% medium-frequency queries, and 5% infrequent queries, your query set should reflect that distribution.

  • Match the type of query you expect to see in production. For example, if you expect production queries to use hybrid search or filters, you should also use those in your query set.

    Example query using filters:

    JSON
    {
    "query_text": "wireless headphones",
    "num_results": 10,
    "filters": { "brand": "Sony", "noise_canceling": true }
    }

    Example query using hybrid search:

    JSON
    {
    "query_text": "best noise canceling headphones for travel",
    "query_type": "hybrid",
    "num_results": 10
    }

Query diversity and caching

Vector search endpoints cache several types of query results to improve performance. This caching can affect load test results. For this reason, it's important to pay attention to the diversity of the query set. For example, if you repeatedly send the same set of queries, you're testing the cache, not the actual search performance.

Use:

When:

Example

Identical or few queries

  • Your production traffic has high query repetition (for example, "popular products")

  • You're testing cache effectiveness specifically

  • Your application benefits from caching (for example, dashboards with fixed queries)

  • You want to measure best-case cached performance

A product recommendation widget that shows "trending items" - the same query runs thousands of times per hour.

Diverse queries

  • Your production traffic has unique user queries (for example, search engines or chatbots)

  • You want to measure worst-case uncached performance

  • You want to test index scan performance, not cache performance

  • Queries have high cardinality (millions of possible variations)

An e-commerce search where every user types different product searches.

For additional recommendations, see Summary of best practices.

Options for creating a query set

The code tabs show three options for creating a diverse query set. There is no one-size-fits-all. Pick the one that works the best for you.

  • (Recommended) Random sampling from the index input table. This is a good general starting point.
  • Sampling from production logs. This is a good start if you have production logs. Keep in mind that queries typically change over time, so refresh the test set regularly to keep it up to date.
  • Generating synthetic queries. This is useful if you don't have production logs or if you are using complex filters.

The following code samples random queries from your index input table.

Python
import pandas as pd
import random

# Read the index input table
input_table = spark.table("catalog.schema.index_input_table").toPandas()

# Sample random rows
n_samples = 1000
if len(input_table) < n_samples:
print(f"Warning: Only {len(input_table)} rows available, using all")
sample_queries = input_table
else:
sample_queries = input_table.sample(n=n_samples, random_state=42)

# Extract the text column (adjust column name as needed)
queries = sample_queries['text_column'].tolist()

# Create query payloads
query_payloads = [{"query_text": q, "num_results": 10} for q in queries]

# Save to input.json
pd.DataFrame(query_payloads).to_json("input.json", orient="records", lines=True)

print(f"Created {len(query_payloads)} diverse queries from index input table")

Step 4. Test your payload

Before running the full load test, validate your payload:

  1. In the Databricks workspace, navigate to your vector search endpoint.
  2. In the left sidebar, click Serving.
  3. Select your endpoint.
  4. Click UseQuery.
  5. Paste your input.json content into the query box.
  6. Verify the endpoint returns expected results.

This ensures your load test will measure realistic queries, not error responses.

Step 5. Run the load test

Connectivity check and warmup

Before the load test begins, the notebook performs two setup steps:

  1. Connectivity check: Sends a single probe query using the service principal credentials. If the endpoint returns a 401 or 403 error, the notebook fails immediately with a clear PermissionError instead of running a full load test that produces only error data. This saves time when credentials or permissions are misconfigured.

  2. Warmup test (1 minute): Runs a short low-concurrency test that warms up endpoint caches and validates end-to-end request flow. The warmup results are not used for performance metrics. In binary search mode, the warmup latency is also used as the baseline for Little's Law concurrency estimation.

Main load test series

The notebook runs a series of tests with increasing client concurrency:

  • Start: Low concurrency (for example, 5 concurrent clients)
  • Middle: Medium concurrency (for example, 10, 20, or 50 clients)
  • End: High concurrency (for example, over 100 clients)

Each test runs for the duration configured in step_duration_seconds (5-10 minutes recommended).

What the notebook measures

The notebook measures and reports the following:

Latency metrics:

  • P50 (median): Half of queries are faster than this.
  • P95: 95% of queries are faster than this. This is a key SLA metric.
  • P99: 99% of queries are faster than this.
  • Max: Worst-case latency.

Throughput metrics:

  • RPS (requests per second): Successful queries per second.
  • Total queries: Number of completed queries.
  • Success rate: Percentage of successful queries.

Errors:

  • Query failures by type
  • Exception messages
  • Timeout counts

Results storage

If the output_table parameter is set, the notebook stores one row per concurrency level (or per binary search step) into a Unity Catalog Delta table. The table is created automatically on the first run and appended to on subsequent runs. Each row includes run_name, exploration_mode, concurrency, success/failure rates, latency percentiles, RPS, and binary-search-specific fields (bs_step, bs_target_qps, bs_outcome). This lets you compare runs over time using SQL or BI tools.

Running as a Databricks Job

All notebook parameters are defined as dbutils.widgets, which map directly to Databricks Job parameters. To schedule or automate load tests:

  1. Create a Job with the notebook as the task.
  2. Set the widget values as Job parameters. No code edits are needed.
  3. Attach the Job to a single-node cluster with many CPU cores (Locust benefits from parallel workers).
  4. Run on demand or on a schedule for recurring baseline tests.

Step 6. Interpret results

The following table shows targets for good performance:

Metric

Target

Comment

P95 latency

< 500ms

Most queries are fast

P99 latency

< 1s

Reasonable performance on long-tail queries

Success rate

> 99.5%

Low failure rate

Latency over time

Stable

No degradation observed during test

Queries per second

Meets target

Endpoint can handle expected traffic

The following results indicate poor performance:

  • P95 > 1s. Indicates queries are too slow for real-time use.
  • P99 > 3s. Latency on long-tail queries will hurt user experience.
  • Success rate < 99%. Too many failures.
  • Increasing latency. Indicates resource exhaustion or memory leak.
  • Rate limiting errors (429). Indicates that higher endpoint capacity is required.

Tradeoff between RPS and latency

The maximum RPS is not the optimal point for production throughput. Latency increases non-linearly as you approach maximum throughput. Operating at maximum RPS often results in 2-5x higher latency compared to operating at 60-70% of maximum capacity.

The following example shows how to analyze the results to find the optimal operating point.

  • The maximum RPS is 480 at 150 concurrent clients.
  • The optimal operating point is 310 RPS at 50 concurrent clients (65% capacity).
  • The latency penalty at max: P95 is 4.3x higher (1.5s vs. 350ms)
  • In this example, the recommendation is to size the endpoint for 480 RPS capacity and operate at ~310 RPS.

Concurrency

P50

P95

P99

RPS

Success

Capacity

5

80ms

120ms

150ms

45

100%

10%

10

85ms

140ms

180ms

88

100%

20%

20

95ms

180ms

250ms

165

99.8%

35%

50

150ms

350ms

500ms

310

99.2%

65% ← Sweet spot

100

250ms

800ms

1.2s

420

97.5%

90% ⚠️ Approaching max

150

450ms

1.5s

2.5s

480

95.0%

100% ❌ Maximum RPS

Operating at the maximum RPS can lead to the following issues:

  • Latency degradation. In the example, P95 is 350ms at 65% capacity but is 1.5s at 100% capacity.
  • No room to accommodate traffic bursts or spikes. At 100% capacity, any spike causes a timeout. At 65% capacity, a 50% spike in traffic can be handled without a problem.
  • Increased error rates. In the example, the success rate is 99.2% at 65% capacity but 95.0% — a 5% failure rate — at 100% capacity.
  • Risk of resource exhaustion. At maximum load, queues increase, memory pressure increases, connection pools start to saturate, and the recovery time after incidents increases.

The following table shows recommended operating points for different use cases.

Use case

Target capacity

Rationale

Latency-sensitive (search, chat)

50-60% of max

Prioritize low P95/P99 latency

Balanced (recommendations)

60-70% of max

Good balance of cost and latency

Cost-optimized (batch jobs)

70-80% of max

Acceptable higher latency

Not recommended

> 85% of max

Latency spikes, no burst capacity

Helper functions for calculating operating point and endpoint size

The following code plots QPS vs P95 latency. In the plot, look for the point where the curve starts to bend sharply upward. This is the optimal operating point.

Python
import matplotlib.pyplot as plt

# Plot QPS vs. P95 latency
qps_values = [45, 88, 165, 310, 420, 480]
p95_latency = [120, 140, 180, 350, 800, 1500]

plt.plot(qps_values, p95_latency, marker='o')
plt.axvline(x=310, color='green', linestyle='--', label='Optimal (65% capacity)')
plt.axvline(x=480, color='red', linestyle='--', label='Maximum (100% capacity)')
plt.xlabel('Queries Per Second (QPS)')
plt.ylabel('P95 Latency (ms)')
plt.title('QPS vs. Latency: Finding the Sweet Spot')
plt.legend()
plt.grid(True)
plt.show()

Identify the embedding model bottleneck

If your index uses managed embeddings, the load test notebook captures per-component timing through the debug_level=1 parameter on each query. The results table includes:

  • ann_time — time spent on approximate nearest neighbor search
  • embedding_gen_time — time spent generating the query embedding on the model serving endpoint
  • reranker_time — time spent on reranking (if enabled)
  • response_time — total end-to-end response time

If embedding_gen_time is consistently large relative to ann_time, the embedding endpoint is the bottleneck, not the vector search endpoint. Common causes:

  • The embedding model serving endpoint has Scale to zero enabled. Disable it for production load testing. See Avoid scale-to-zero for production.
  • The embedding endpoint does not have enough provisioned concurrency for the query rate you are testing.
  • The embedding model endpoint is shared with other workloads. Use a dedicated endpoint for load testing.
tip

To isolate vector search performance from embedding model performance, switch to self-managed embeddings for load testing. Pass precomputed vectors in the EMBEDDING_COLUMN parameter instead of text queries. This removes embedding latency from the measurement entirely.

Step 7: Size your endpoint

Use the notebook's recommendation

After analyzing results, the notebook asks you to:

  1. Select the row that best meets your latency requirements.
  2. Input your application's desired RPS.

The notebook then displays a recommended endpoint size. It calculates the required capacity based on the following:

  • Your target RPS
  • Observed latency at different concurrency levels
  • Success rate thresholds
  • Safety margin (typically 2x expected peak load)

Scaling considerations

Standard endpoints:

  • Scale up automatically to support index size
  • Scale up manually to support throughput
  • Scale down automatically when indexes are deleted
  • Scale down manually to reduce capacity

Storage-optimized endpoints:

  • Scale up automatically to support index size
  • Scale down automatically when indexes are deleted

Step 8: Validate Final Configuration

After updating your endpoint configuration:

  1. Wait for the endpoint to be ready. This can take several minutes.
  2. Run the final validation test in the notebook.
  3. Confirm performance meets your requirements:
    • RPS ≥ target throughput
    • P95 latency meets SLA
    • Success rate > 99.5%
    • No sustained errors

If validation fails, try the following:

  • Increase endpoint capacity
  • Optimize query complexity
  • Review filter performance
  • Check embedding endpoint configuration

When to re-test

To maintain performance visibility, it's a good idea to run baseline load tests quarterly. You should also re-test when you make any of the following changes:

  • Change query patterns or complexity
  • Update the vector search index
  • Modify filter configurations
  • Expect significant traffic increases
  • Deploy new features or optimizations
  • Change from standard to storage-optimized endpoint types

Troubleshooting

All requests fail with ~10ms latency and 240-byte responses

This indicates the service principal is receiving a 401/403 response. Verify:

  1. The service principal has Can Query permissions on the vector search endpoint (not just the index).
  2. The secret scope contains valid service_principal_client_id and service_principal_client_secret keys.
  3. The OAuth secret has not expired.

The notebook includes a connectivity check that catches this before running the full load test.

Running multiple load test jobs on the same cluster

If you run two load test jobs concurrently on the same cluster, one job might receive stale OAuth tokens or experience CPU contention with the other job's Locust workers. For reliable results, run load test jobs one at a time on a dedicated cluster.

Component timing graphs are empty

The component timing graphs (ann_time, embedding_gen_time, reranker_time) require the endpoint to return debug_info in query responses. If these graphs are empty:

  • Verify you are using the fast_vs_load_test_async_load.py script (which parses debug_info from responses) as the locust_script_path.
  • Some endpoint configurations may not return debug_info. Self-managed embedding indexes typically return ann_time and response_time but not embedding_gen_time or reranker_time.

Results table not queryable from a SQL warehouse

The notebook writes results from the cluster's Spark session. If a SQL warehouse shows 0 rows for a table that the notebook reports as populated, the problem might be a Unity Catalog metadata sync delay. Wait a few minutes and retry, or query the table directly from a notebook attached to the same cluster.

Summary of best practices

Test configuration

  • Run tests for at least 5 minutes at peak load.

  • Use OAuth service principals for authentication.

  • Create realistic query payloads that match expected production queries.

  • Test with production-like filters and parameters.

  • Include a warmup period before measuring.

  • Test at multiple concurrency levels.

  • Track P95/P99 latencies, not just averages.

  • Test both cached and uncached performance.

    Python
    # Conservative approach: Size endpoint for UNCACHED performance
    uncached_results = run_load_test(diverse_queries, duration=600)
    endpoint_size = calculate_capacity(uncached_results, target_rps=500)

    # Then verify cached performance is even better
    cached_results = run_load_test(repetitive_queries, duration=300)
    print(f"Cached P95: {cached_results['p95']}ms (bonus performance)")

Query set design

  • Match your test query diversity to real traffic distribution (frequent and rare queries).
  • Use actual queries from logs (anonymized).
  • Include different query complexities.
  • Test both cached and uncached scenarios and track the results separately.
  • Test with expected filter combinations.
  • Use the same parameters that you will use in production. For example, if you use hybrid search in production, include hybrid search queries. Use a similar num_results parameter as in production.
  • Don't use queries that will never occur in production.

Performance optimization

If latencies are too high, try the following:

  1. Use OAuth service principals (not PATs) - 100ms improvement
  2. Reduce num_results - Fetching 100 results is slower than 10
  3. Optimize filters - Complex or overly restrictive filters slow down queries
  4. Check embedding endpoint - Ensure it's not scaled to zero or has enough bandwidth

If you are hitting rate limits, try the following:

  1. Increase endpoint capacity - Scale up your endpoint
  2. Implement client-side rate limiting or spread queries in time
  3. Use connection pooling - Reuse connections
  4. Add retry logic - Use exponential backoff (already a part of the Python SDK)

Additional resources