Configure a load test for vector search endpoints

This page provides guidance, example code, and an example notebook for load testing vector search endpoints. Load testing helps you understand the performance and production readiness of a vector search endpoint before it's deployed to production. Load testing can tell you about:

Latency at different scaling levels
Throughput limits and bottlenecks (requests per second, latency breakdown)
Error rates under sustained load
Resource utilization and capacity planning

For more information about load testing and related concepts, see Load testing for serving endpoints.

Requirements

Before starting these steps, you must have a deployed vector search endpoint and a service principal with Can Query permissions on the endpoint. See Step 1: Set up service principal authentication.

Download and import a copy of the following files and example notebook to your Databricks workspace:

input.json. This is an example of the input.json file that specifies the payload that is sent by all concurrent connections to your endpoint. You can have multiple files if needed. If you use the example notebook, this file is generated automatically from the provided input table.
fast_vs_load_test_async_load.py. This script is used by the example notebook for authentication and payload handling.
The following example notebook, which runs the load tests. For best performance, run this notebook on a cluster with a large number of cores and high memory. The memory is required for queries with pre-generated embeddings, as embeddings are often memory-intensive.

Example notebook and quickstart

Use the following example notebook to get started. It includes all of the steps to run a load test. You must enter a few parameters, such as Databricks secrets, the endpoint name, and so on.

Locust load test notebook

Open notebook in new tab

Load testing framework: Locust

Locust is an open-source load testing framework that allows you to do the following:

Vary the number of concurrent client connections
Control how fast connections spawn
Measure endpoint performance throughout the test
Auto-detect and use all available CPU cores

The example notebook uses the --processes -1 flag to auto-detect CPU cores and fully utilize them.

If Locust is bottlenecked by the CPU, a message appears in the output.

Step 1: Set up service principal authentication

important

For production-like performance testing, always use OAuth service principal authentication. Service principals provide up to 100ms faster response time and higher request rate limits compared to Personal Access Tokens (PATs).

Create and configure service principal

Create a Databricks service principal. For instructions, see Add service principals to your account.
Grant permissions:
- Navigate to your vector search endpoint page.
- Click Permissions.
- Give the service principal Can Query permissions.
Create OAuth secret.
- Go to the service principal details page.
- Click the Secrets tab.
- Click Generate secret.
- Set lifetime (recommend 365 days for long-term testing).
- Copy both the Client ID and Secret immediately.
Store credentials securely.
- Create a Databricks secret scope. For instructions, see Tutorial: Create and use a Databricks secret.
- As shown in the following code example, store the service principal Client ID as service_principal_client_id and store the OAuth secret as service_principal_client_secret.
Python
```
# In a Databricks notebook
dbutils.secrets.put("load-test-auth", "service_principal_client_id", "<CLIENT_ID>")
dbutils.secrets.put("load-test-auth", "service_principal_client_secret", "<SECRET>")
```

Step 2: Configure your load test

Notebook configuration

In your copy of the example notebook, configure these parameters:

Parameter	Description	Recommended Value
`endpoint_name`	Name of your vector search endpoint	Your endpoint name
`index_name`	Full index name (`catalog.schema.index`)	Your index name
`locust_run_time`	Duration for each individual load test	300-600 seconds (5-10 minutes)
`csv_output_prefix`	Prefix for CSV output files	`load_test_`
`secret_scope_name`	Name of your Databricks secret scope	Your scope name

Managed vs. self-managed embeddings

The notebook supports both managed embeddings (where Databricks generates embeddings at query time) and self-managed embeddings (where you pass precomputed vectors directly). Configure the appropriate parameters based on your index type.

Index type	Parameter to set	Leave unset
Managed embeddings (Delta Sync index with Databricks-managed embedding model)	`QUERY_COLUMN` — the text column name to use as the query	`EMBEDDING_COLUMN` (set to `None`)
Self-managed embeddings (Delta Sync or Direct Vector Access index with precomputed vectors)	`EMBEDDING_COLUMN` — the column containing precomputed embedding vectors	`QUERY_COLUMN`

note

For managed embedding indexes, the load test measures end-to-end latency including embedding generation time. If your embedding endpoint scales to zero, cold-start overhead will appear in the first test run. See Identify the embedding model bottleneck for how to isolate embedding latency from search latency.

Why 5-10 minutes?

A minimum test duration of 5 minutes is critical.

Initial queries may include cold-start overhead.
Endpoints need time to reach steady-state performance.
Auto-scaling of the model serving endpoints (if enabled) takes time to activate.
Short tests miss throttling behaviors under sustained load.

The following table shows recommended test durations depending on your test goal.

Test type	Test duration	Goals of test
Quick smoke test	2-3 minutes	Verify basic functionality
Performance baseline	5-10 minutes	Reliable steady-state metrics
Stress testing	15-30 minutes	Identify resource exhaustion
Endurance testing	1-4 hours	Degradation, latency stability

Step 3. Design your query set

When possible, the query set should reflect the expected production traffic as closely as possible. Specifically, you should try to match the expected distribution of queries in terms of content, complexity, and diversity.

Use realistic queries. Don't use random text such as "test query 1234".
Match the expected production traffic distribution. If you expect 80% common queries, 15% medium-frequency queries, and 5% infrequent queries, your query set should reflect that distribution.

Match the type of query you expect to see in production. For example, if you expect production queries to use hybrid search or filters, you should also use those in your query set.

Example query using filters:

JSON
{
  "query_text": "wireless headphones",
  "num_results": 10,
  "filters": { "brand": "Sony", "noise_canceling": true }
}

Example query using hybrid search:

JSON
{
  "query_text": "best noise canceling headphones for travel",
  "query_type": "hybrid",
  "num_results": 10
}

Query diversity and caching

Vector search endpoints cache several types of query results to improve performance. This caching can affect load test results. For this reason, it's important to pay attention to the diversity of the query set. For example, if you repeatedly send the same set of queries, you're testing the cache, not the actual search performance.

Use:	When:	Example
Identical or few queries	Your production traffic has high query repetition (for example, "popular products") You're testing cache effectiveness specifically Your application benefits from caching (for example, dashboards with fixed queries) You want to measure best-case cached performance	A product recommendation widget that shows "trending items" - the same query runs thousands of times per hour.
Diverse queries	Your production traffic has unique user queries (for example, search engines or chatbots) You want to measure worst-case uncached performance You want to test index scan performance, not cache performance Queries have high cardinality (millions of possible variations)	An e-commerce search where every user types different product searches.

For additional recommendations, see Summary of best practices.

Options for creating a query set

The code tabs show three options for creating a diverse query set. There is no one-size-fits-all. Pick the one that works the best for you.

(Recommended) Random sampling from the index input table. This is a good general starting point.
Sampling from production logs. This is a good start if you have production logs. Keep in mind that queries typically change over time, so refresh the test set regularly to keep it up to date.
Generating synthetic queries. This is useful if you don't have production logs or if you are using complex filters.

Random sampling from input table
Sample from production logs
Synthetic queries

The following code samples random queries from your index input table.

Python
import pandas as pd
import random

# Read the index input table
input_table = spark.table("catalog.schema.index_input_table").toPandas()

# Sample random rows
n_samples = 1000
if len(input_table) < n_samples:
    print(f"Warning: Only {len(input_table)} rows available, using all")
    sample_queries = input_table
else:
    sample_queries = input_table.sample(n=n_samples, random_state=42)

# Extract the text column (adjust column name as needed)
queries = sample_queries['text_column'].tolist()

# Create query payloads
query_payloads = [{"query_text": q, "num_results": 10} for q in queries]

# Save to input.json
pd.DataFrame(query_payloads).to_json("input.json", orient="records", lines=True)

print(f"Created {len(query_payloads)} diverse queries from index input table")

The following code samples proportionally from production queries.

Python
# Sample proportionally from production queries
production_queries = pd.read_csv("queries.csv")

# Take stratified sample maintaining frequency distribution
def create_test_set(df, n_queries=1000):
    # Group by frequency buckets
    df['frequency'] = df.groupby('query_text')['query_text'].transform('count')

    # Stratified sample
    high_freq = df[df['frequency'] > 100].sample(n=200)  # 20%
    med_freq = df[df['frequency'].between(10, 100)].sample(n=300)  # 30%
    low_freq = df[df['frequency'] < 10].sample(n=500)  # 50%

    return pd.concat([high_freq, med_freq, low_freq])

test_queries = create_test_set(production_queries)
test_queries.to_json("input.json", orient="records", lines=True)

If you don't have production logs yet, you can generate synthetic diverse queries.

Python
# Generate diverse queries programmatically
import random

# Define query templates and variations
templates = [
    "find {product} under ${price}",
    "best {product} for {use_case}",
    "{adjective} {product} recommendations",
    "compare {product1} and {product2}",
]

products = ["laptop", "headphones", "monitor", "keyboard", "mouse", "webcam", "speaker"]
prices = ["500", "1000", "1500", "2000"]
use_cases = ["gaming", "work", "travel", "home office", "students"]
adjectives = ["affordable", "premium", "budget", "professional", "portable"]

diverse_queries = []
for _ in range(1000):
    template = random.choice(templates)
    query = template.format(
        product=random.choice(products),
        product1=random.choice(products),
        product2=random.choice(products),
        price=random.choice(prices),
        use_case=random.choice(use_cases),
        adjective=random.choice(adjectives)
    )
    diverse_queries.append(query)

print(f"Generated {len(set(diverse_queries))} unique queries")

Step 4. Test your payload

Before running the full load test, validate your payload:

In the Databricks workspace, navigate to your vector search endpoint.
In the left sidebar, click Serving.
Select your endpoint.
Click Use → Query.
Paste your input.json content into the query box.
Verify the endpoint returns expected results.

This ensures your load test will measure realistic queries, not error responses.

Step 5. Run the load test

Initial warmup test (30 seconds)

The notebook first runs a 30-second test that does the following:

Confirms the endpoint is online and responding
Warms up any caches
Validates authentication

The results of this warmup test include cold-start overhead, so shouldn't be used for performance metrics.

Main load test series

The notebook runs a series of tests with increasing client concurrency:

Start: Low concurrency (for example, 5 concurrent clients)
Middle: Medium concurrency (for example, 10, 20, or 50 clients)
End: High concurrency (for example, over 100 clients)

Each test runs for the configured locust_run_time (5-10 minutes recommended).

What the notebook measures

The notebook measures and reports the following:

Latency metrics:

P50 (median): Half of queries are faster than this.
P95: 95% of queries are faster than this. This is a key SLA metric.
P99: 99% of queries are faster than this.
Max: Worst-case latency.

Throughput metrics:

RPS (requests per second): Successful queries per second.
Total queries: Number of completed queries.
Success rate: Percentage of successful queries.

Errors:

Query failures by type
Exception messages
Timeout counts

Step 6. Interpret results

The following table shows targets for good performance:

Metric	Target	Comment
P95 latency	< 500ms	Most queries are fast
P99 latency	< 1s	Reasonable performance on long-tail queries
Success rate	> 99.5%	Low failure rate
Latency over time	Stable	No degradation observed during test
Queries per second	Meets target	Endpoint can handle expected traffic

The following results indicate poor performance:

P95 > 1s. Indicates queries are too slow for real-time use.
P99 > 3s. Latency on long-tail queries will hurt user experience.
Success rate < 99%. Too many failures.
Increasing latency. Indicates resource exhaustion or memory leak.
Rate limiting errors (429). Indicates that higher endpoint capacity is required.

Tradeoff between RPS and latency

The maximum RPS is not the optimal point for production throughput. Latency increases non-linearly as you approach maximum throughput. Operating at maximum RPS often results in 2-5x higher latency compared to operating at 60-70% of maximum capacity.

The following example shows how to analyze the results to find the optimal operating point.

The maximum RPS is 480 at 150 concurrent clients.
The optimal operating point is 310 RPS at 50 concurrent clients (65% capacity).
The latency penalty at max: P95 is 4.3x higher (1.5s vs. 350ms)
In this example, the recommendation is to size the endpoint for 480 RPS capacity and operate at ~310 RPS.

Concurrency	P50	P95	P99	RPS	Success	Capacity
5	80ms	120ms	150ms	45	100%	10%
10	85ms	140ms	180ms	88	100%	20%
20	95ms	180ms	250ms	165	99.8%	35%
50	150ms	350ms	500ms	310	99.2%	65% ← Sweet spot
100	250ms	800ms	1.2s	420	97.5%	90% ⚠️ Approaching max
150	450ms	1.5s	2.5s	480	95.0%	100% ❌ Maximum RPS

Operating at the maximum RPS can lead to the following issues:

Latency degradation. In the example, P95 is 350ms at 65% capacity but is 1.5s at 100% capacity.
No room to accommodate traffic bursts or spikes. At 100% capacity, any spike causes a timeout. At 65% capacity, a 50% spike in traffic can be handled without a problem.
Increased error rates. In the example, the success rate is 99.2% at 65% capacity but 95.0% — a 5% failure rate — at 100% capacity.
Risk of resource exhaustion. At maximum load, queues increase, memory pressure increases, connection pools start to saturate, and the recovery time after incidents increases.

The following table shows recommended operating points for different use cases.

Use case	Target capacity	Rationale
Latency-sensitive (search, chat)	50-60% of max	Prioritize low P95/P99 latency
Balanced (recommendations)	60-70% of max	Good balance of cost and latency
Cost-optimized (batch jobs)	70-80% of max	Acceptable higher latency
Not recommended	> 85% of max	Latency spikes, no burst capacity

Helper functions for calculating operating point and endpoint size

Find the optimal point
Size recommendation formula

The following code plots QPS vs P95 latency. In the plot, look for the point where the curve starts to bend sharply upward. This is the optimal operating point.

Python
import matplotlib.pyplot as plt

# Plot QPS vs. P95 latency
qps_values = [45, 88, 165, 310, 420, 480]
p95_latency = [120, 140, 180, 350, 800, 1500]

plt.plot(qps_values, p95_latency, marker='o')
plt.axvline(x=310, color='green', linestyle='--', label='Optimal (65% capacity)')
plt.axvline(x=480, color='red', linestyle='--', label='Maximum (100% capacity)')
plt.xlabel('Queries Per Second (QPS)')
plt.ylabel('P95 Latency (ms)')
plt.title('QPS vs. Latency: Finding the Sweet Spot')
plt.legend()
plt.grid(True)
plt.show()

Python
def calculate_endpoint_size(target_qps, optimal_capacity_percent=0.65):
    """
    Calculate required endpoint capacity

    Args:
        target_qps: Your expected peak production QPS
        optimal_capacity_percent: Target utilization (default 65%)

    Returns:
        Required maximum endpoint QPS
    """
    required_max_qps = target_qps / optimal_capacity_percent

    # Add 20% safety margin for unexpected bursts
    recommended_max_qps = required_max_qps * 1.2

    return {
        "target_production_qps": target_qps,
        "operate_at_capacity": f"{optimal_capacity_percent*100:.0f}%",
        "required_max_qps": required_max_qps,
        "recommended_max_qps": recommended_max_qps,
        "burst_capacity": f"{(1 - optimal_capacity_percent)*100:.0f}% headroom"
    }

# Example
result = calculate_endpoint_size(target_qps=200)
print(f"Target production QPS: {result['target_production_qps']}")
print(f"Size endpoint for: {result['recommended_max_qps']:.0f} QPS")
print(f"Operate at: {result['operate_at_capacity']}")
print(f"Available burst capacity: {result['burst_capacity']}")

# Output:
# Target production QPS: 200
# Size endpoint for: 369 QPS
# Operate at: 65%
# Available burst capacity: 35% headroom

Identify the embedding model bottleneck

If your index uses managed embeddings, the load test notebook captures per-component timing through the debug_level=1 parameter on each query. The results table includes:

ann_time — time spent on approximate nearest neighbor search
embedding_gen_time — time spent generating the query embedding on the model serving endpoint
reranker_time — time spent on reranking (if enabled)
response_time — total end-to-end response time

If embedding_gen_time is consistently large relative to ann_time, the embedding endpoint is the bottleneck, not the vector search endpoint. Common causes:

The embedding model serving endpoint has Scale to zero enabled. Disable it for production load testing. See Avoid scale-to-zero for production.
The embedding endpoint does not have enough provisioned concurrency for the query rate you are testing.
The embedding model endpoint is shared with other workloads. Use a dedicated endpoint for load testing.

tip

To isolate vector search performance from embedding model performance, switch to self-managed embeddings for load testing. Pass precomputed vectors in the EMBEDDING_COLUMN parameter instead of text queries. This removes embedding latency from the measurement entirely.

Step 7: Size your endpoint

Use the notebook's recommendation

After analyzing results, the notebook asks you to:

Select the row that best meets your latency requirements.
Input your application's desired RPS.

The notebook then displays a recommended endpoint size. It calculates the required capacity based on the following:

Your target RPS
Observed latency at different concurrency levels
Success rate thresholds
Safety margin (typically 2x expected peak load)

Scaling considerations

Standard endpoints:

Scale up automatically to support index size
Scale up manually to support throughput
Scale down automatically when indexes are deleted
Scale down manually to reduce capacity

Storage-optimized endpoints:

Scale up automatically to support index size
Scale down automatically when indexes are deleted

Step 8: Validate Final Configuration

After updating your endpoint configuration:

Wait for the endpoint to be ready. This can take several minutes.
Run the final validation test in the notebook.
Confirm performance meets your requirements:
- RPS ≥ target throughput
- P95 latency meets SLA
- Success rate > 99.5%
- No sustained errors

If validation fails, try the following:

Increase endpoint capacity
Optimize query complexity
Review filter performance
Check embedding endpoint configuration

When to re-test

To maintain performance visibility, it's a good idea to run baseline load tests quarterly. You should also re-test when you make any of the following changes:

Change query patterns or complexity
Update the vector search index
Modify filter configurations
Expect significant traffic increases
Deploy new features or optimizations
Change from standard to storage-optimized endpoint types

Summary of best practices

Test configuration

Run tests for at least 5 minutes at peak load.
Use OAuth service principals for authentication.
Create realistic query payloads that match expected production queries.
Test with production-like filters and parameters.
Include a warmup period before measuring.
Test at multiple concurrency levels.
Track P95/P99 latencies, not just averages.

Test both cached and uncached performance.

Python
# Conservative approach: Size endpoint for UNCACHED performance
uncached_results = run_load_test(diverse_queries, duration=600)
endpoint_size = calculate_capacity(uncached_results, target_rps=500)

# Then verify cached performance is even better
cached_results = run_load_test(repetitive_queries, duration=300)
print(f"Cached P95: {cached_results['p95']}ms (bonus performance)")

Query set design

Match your test query diversity to real traffic distribution (frequent and rare queries).
Use actual queries from logs (anonymized).
Include different query complexities.
Test both cached and uncached scenarios and track the results separately.
Test with expected filter combinations.
Use the same parameters that you will use in production. For example, if you use hybrid search in production, include hybrid search queries. Use a similar num_results parameter as in production.
Don't use queries that will never occur in production.

Performance optimization

If latencies are too high, try the following:

Use OAuth service principals (not PATs) - 100ms improvement
Reduce num_results - Fetching 100 results is slower than 10
Optimize filters - Complex or overly restrictive filters slow down queries
Check embedding endpoint - Ensure it's not scaled to zero or has enough bandwidth

If you are hitting rate limits, try the following:

Increase endpoint capacity - Scale up your endpoint
Implement client-side rate limiting or spread queries in time
Use connection pooling - Reuse connections
Add retry logic - Use exponential backoff (already a part of the Python SDK)

Requirements​

Example notebook and quickstart​

Locust load test notebook

Load testing framework: Locust​

Step 1: Set up service principal authentication​

Create and configure service principal​

Step 2: Configure your load test​

Notebook configuration​

Managed vs. self-managed embeddings​

Why 5-10 minutes?​

Step 3. Design your query set​

Query diversity and caching​

Options for creating a query set​

Step 4. Test your payload​

Step 5. Run the load test​

Initial warmup test (30 seconds)​

Main load test series​

What the notebook measures​

Step 6. Interpret results​

Tradeoff between RPS and latency​

Helper functions for calculating operating point and endpoint size​

Identify the embedding model bottleneck​

Step 7: Size your endpoint​

Use the notebook's recommendation​

Scaling considerations​

Step 8: Validate Final Configuration​

When to re-test​

Summary of best practices​

Test configuration​

Query set design​

Performance optimization​

Additional resources​