Configure a load test for vector search endpoints
This page provides guidance, example code, and an example notebook for load testing vector search endpoints. Load testing helps you understand the performance and production readiness of a vector search endpoint before it's deployed to production. Load testing can tell you about:
- Latency at different scaling levels
- Throughput limits and bottlenecks (requests per second, latency breakdown)
- Error rates under sustained load
- Resource utilization and capacity planning
For more information about load testing and related concepts, see Load testing for serving endpoints.
Requirements
Before starting these steps, you must have a deployed vector search endpoint and a service principal with Can Query permissions on the endpoint. See Step 1: Set up service principal authentication.
Download and import a copy of the following files and example notebook to your Databricks workspace:
- input.json. This is an example of the
input.jsonfile that specifies the payload that is sent by all concurrent connections to your endpoint. You can have multiple files if needed. If you use the example notebook, this file is generated automatically from the provided input table. - fast_vs_load_test_async_load.py. This script is used by the example notebook for authentication and payload handling.
- The following example notebook, which runs the load tests. For best performance, run this notebook on a cluster with a large number of cores and high memory. The memory is required for queries with pre-generated embeddings, as embeddings are often memory-intensive.
Example notebook and quickstart
Use the following example notebook to get started. It includes all of the steps to run a load test. You must enter a few parameters, such as Databricks secrets, the endpoint name, and so on.
Locust load test notebook
Load testing framework: Locust
Locust is an open-source load testing framework that allows you to do the following:
- Vary the number of concurrent client connections
- Control how fast connections spawn
- Measure endpoint performance throughout the test
- Auto-detect and use all available CPU cores
The example notebook uses the --processes -1 flag to auto-detect CPU cores and fully utilize them.
If Locust is bottlenecked by the CPU, a message appears in the output.
Step 1: Set up service principal authentication
For production-like performance testing, always use OAuth service principal authentication. Service principals provide up to 100ms faster response time and higher request rate limits compared to Personal Access Tokens (PATs).
Create and configure service principal
-
Create a Databricks service principal. For instructions, see Add service principals to your account.
-
Grant permissions:
- Navigate to your vector search endpoint page.
- Click Permissions.
- Give the service principal Can Query permissions.
-
Create OAuth secret.
- Go to the service principal details page.
- Click the Secrets tab.
- Click Generate secret.
- Set lifetime (recommend 365 days for long-term testing).
- Copy both the Client ID and Secret immediately.
-
Store credentials securely.
- Create a Databricks secret scope. For instructions, see Tutorial: Create and use a Databricks secret.
- As shown in the following code example, store the service principal Client ID as
service_principal_client_idand store the OAuth secret asservice_principal_client_secret.
Python# In a Databricks notebook
dbutils.secrets.put("load-test-auth", "service_principal_client_id", "<CLIENT_ID>")
dbutils.secrets.put("load-test-auth", "service_principal_client_secret", "<SECRET>")
Step 2: Configure your load test
Notebook configuration
In your copy of the example notebook, configure these parameters:
Parameter | Description | Recommended Value |
|---|---|---|
| Name of your vector search endpoint | Your endpoint name |
| Full index name ( | Your index name |
| Duration for each individual load test | 300-600 seconds (5-10 minutes) |
| Prefix for CSV output files |
|
| Name of your Databricks secret scope | Your scope name |
Why 5-10 minutes?
A minimum test duration of 5 minutes is critical.
- Initial queries may include cold-start overhead.
- Endpoints need time to reach steady-state performance.
- Auto-scaling of the model serving endpoints (if enabled) takes time to activate.
- Short tests miss throttling behaviors under sustained load.
The following table shows recommended test durations depending on your test goal.
Test type | Test duration | Goals of test |
|---|---|---|
Quick smoke test | 2-3 minutes | Verify basic functionality |
Performance baseline | 5-10 minutes | Reliable steady-state metrics |
Stress testing | 15-30 minutes | Identify resource exhaustion |
Endurance testing | 1-4 hours | Degradation, latency stability |
Step 3. Design your query set
When possible, the query set should reflect the expected production traffic as closely as possible. Specifically, you should try to match the expected distribution of queries in terms of content, complexity, and diversity.
-
Use realistic queries. Don't use random text such as "test query 1234".
-
Match the expected production traffic distribution. If you expect 80% common queries, 15% medium-frequency queries, and 5% infrequent queries, your query set should reflect that distribution.
-
Match the type of query you expect to see in production. For example, if you expect production queries to use hybrid search or filters, you should also use those in your query set.
Example query using filters:
JSON{
"query_text": "wireless headphones",
"num_results": 10,
"filters": { "brand": "Sony", "noise_canceling": true }
}Example query using hybrid search:
JSON{
"query_text": "best noise canceling headphones for travel",
"query_type": "hybrid",
"num_results": 10
}
Query diversity and caching
Vector search endpoints cache several types of query results to improve performance. This caching can affect load test results. For this reason, it's important to pay attention to the diversity of the query set. For example, if you repeatedly send the same set of queries, you're testing the cache, not the actual search performance.
Use: | When: | Example |
|---|---|---|
Identical or few queries |
| A product recommendation widget that shows "trending items" - the same query runs thousands of times per hour. |
Diverse queries |
| An e-commerce search where every user types different product searches. |
For additional recommendations, see Summary of best practices.
Options for creating a query set
The code tabs show three options for creating a diverse query set. There is no one-size-fits-all. Pick the one that works the best for you.
- (Recommended) Random sampling from the index input table. This is a good general starting point.
- Sampling from production logs. This is a good start if you have production logs. Keep in mind that queries typically change over time, so refresh the test set regularly to keep it up to date.
- Generating synthetic queries. This is useful if you don't have production logs or if you are using complex filters.
- Random sampling from input table
- Sample from production logs
- Synthetic queries
The following code samples random queries from your index input table.
import pandas as pd
import random
# Read the index input table
input_table = spark.table("catalog.schema.index_input_table").toPandas()
# Sample random rows
n_samples = 1000
if len(input_table) < n_samples:
print(f"Warning: Only {len(input_table)} rows available, using all")
sample_queries = input_table
else:
sample_queries = input_table.sample(n=n_samples, random_state=42)
# Extract the text column (adjust column name as needed)
queries = sample_queries['text_column'].tolist()
# Create query payloads
query_payloads = [{"query_text": q, "num_results": 10} for q in queries]
# Save to input.json
pd.DataFrame(query_payloads).to_json("input.json", orient="records", lines=True)
print(f"Created {len(query_payloads)} diverse queries from index input table")
The following code samples proportionally from production queries.
# Sample proportionally from production queries
production_queries = pd.read_csv("queries.csv")
# Take stratified sample maintaining frequency distribution
def create_test_set(df, n_queries=1000):
# Group by frequency buckets
df['frequency'] = df.groupby('query_text')['query_text'].transform('count')
# Stratified sample
high_freq = df[df['frequency'] > 100].sample(n=200) # 20%
med_freq = df[df['frequency'].between(10, 100)].sample(n=300) # 30%
low_freq = df[df['frequency'] < 10].sample(n=500) # 50%
return pd.concat([high_freq, med_freq, low_freq])
test_queries = create_test_set(production_queries)
test_queries.to_json("input.json", orient="records", lines=True)
If you don't have production logs yet, you can generate synthetic diverse queries.
# Generate diverse queries programmatically
import random
# Define query templates and variations
templates = [
"find {product} under ${price}",
"best {product} for {use_case}",
"{adjective} {product} recommendations",
"compare {product1} and {product2}",
]
products = ["laptop", "headphones", "monitor", "keyboard", "mouse", "webcam", "speaker"]
prices = ["500", "1000", "1500", "2000"]
use_cases = ["gaming", "work", "travel", "home office", "students"]
adjectives = ["affordable", "premium", "budget", "professional", "portable"]
diverse_queries = []
for _ in range(1000):
template = random.choice(templates)
query = template.format(
product=random.choice(products),
product1=random.choice(products),
product2=random.choice(products),
price=random.choice(prices),
use_case=random.choice(use_cases),
adjective=random.choice(adjectives)
)
diverse_queries.append(query)
print(f"Generated {len(set(diverse_queries))} unique queries")
Step 4. Test your payload
Before running the full load test, validate your payload:
- In the Databricks workspace, navigate to your vector search endpoint.
- In the left sidebar, click Serving.
- Select your endpoint.
- Click Use → Query.
- Paste your
input.jsoncontent into the query box. - Verify the endpoint returns expected results.
This ensures your load test will measure realistic queries, not error responses.
Step 5. Run the load test
Initial warmup test (30 seconds)
The notebook first runs a 30-second test that does the following:
- Confirms the endpoint is online and responding
- Warms up any caches
- Validates authentication
The results of this warmup test include cold-start overhead, so shouldn't be used for performance metrics.
Main load test series
The notebook runs a series of tests with increasing client concurrency:
- Start: Low concurrency (for example, 5 concurrent clients)
- Middle: Medium concurrency (for example, 10, 20, or 50 clients)
- End: High concurrency (for example, over 100 clients)
Each test runs for the configured locust_run_time (5-10 minutes recommended).
What the notebook measures
The notebook measures and reports the following:
Latency metrics:
- P50 (median): Half of queries are faster than this.
- P95: 95% of queries are faster than this. This is a key SLA metric.
- P99: 99% of queries are faster than this.
- Max: Worst-case latency.
Throughput metrics:
- RPS (requests per second): Successful queries per second.
- Total queries: Number of completed queries.
- Success rate: Percentage of successful queries.
Errors:
- Query failures by type
- Exception messages
- Timeout counts
Step 6. Interpret results
The following table shows targets for good performance:
Metric | Target | Comment |
|---|---|---|
P95 latency | < 500ms | Most queries are fast |
P99 latency | < 1s | Reasonable performance on long-tail queries |
Success rate | > 99.5% | Low failure rate |
Latency over time | Stable | No degradation observed during test |
Queries per second | Meets target | Endpoint can handle expected traffic |
The following results indicate poor performance:
- P95 > 1s. Indicates queries are too slow for real-time use.
- P99 > 3s. Latency on long-tail queries will hurt user experience.
- Success rate < 99%. Too many failures.
- Increasing latency. Indicates resource exhaustion or memory leak.
- Rate limiting errors (429). Indicates that higher endpoint capacity is required.
Tradeoff between RPS and latency
The maximum RPS is not the optimal point for production throughput. Latency increases non-linearly as you approach maximum throughput. Operating at maximum RPS often results in 2-5x higher latency compared to operating at 60-70% of maximum capacity.
The following example shows how to analyze the results to find the optimal operating point.
- The maximum RPS is 480 at 150 concurrent clients.
- The optimal operating point is 310 RPS at 50 concurrent clients (65% capacity).
- The latency penalty at max: P95 is 4.3x higher (1.5s vs. 350ms)
- In this example, the recommendation is to size the endpoint for 480 RPS capacity and operate at ~310 RPS.
Concurrency | P50 | P95 | P99 | RPS | Success | Capacity |
|---|---|---|---|---|---|---|
5 | 80ms | 120ms | 150ms | 45 | 100% | 10% |
10 | 85ms | 140ms | 180ms | 88 | 100% | 20% |
20 | 95ms | 180ms | 250ms | 165 | 99.8% | 35% |
50 | 150ms | 350ms | 500ms | 310 | 99.2% | 65% ← Sweet spot |
100 | 250ms | 800ms | 1.2s | 420 | 97.5% | 90% ⚠️ Approaching max |
150 | 450ms | 1.5s | 2.5s | 480 | 95.0% | 100% ❌ Maximum RPS |
Operating at the maximum RPS can lead to the following issues:
- Latency degradation. In the example, P95 is 350ms at 65% capacity but is 1.5s at 100% capacity.
- No room to accommodate traffic bursts or spikes. At 100% capacity, any spike causes a timeout. At 65% capacity, a 50% spike in traffic can be handled without a problem.
- Increased error rates. In the example, the success rate is 99.2% at 65% capacity but 95.0% — a 5% failure rate — at 100% capacity.
- Risk of resource exhaustion. At maximum load, queues increase, memory pressure increases, connection pools start to saturate, and the recovery time after incidents increases.
The following table shows recommended operating points for different use cases.
Use case | Target capacity | Rationale |
|---|---|---|
Latency-sensitive (search, chat) | 50-60% of max | Prioritize low P95/P99 latency |
Balanced (recommendations) | 60-70% of max | Good balance of cost and latency |
Cost-optimized (batch jobs) | 70-80% of max | Acceptable higher latency |
Not recommended | > 85% of max | Latency spikes, no burst capacity |
Helper functions for calculating operating point and endpoint size
- Find the optimal point
- Size recommendation formula
The following code plots QPS vs P95 latency. In the plot, look for the point where the curve starts to bend sharply upward. This is the optimal operating point.
import matplotlib.pyplot as plt
# Plot QPS vs. P95 latency
qps_values = [45, 88, 165, 310, 420, 480]
p95_latency = [120, 140, 180, 350, 800, 1500]
plt.plot(qps_values, p95_latency, marker='o')
plt.axvline(x=310, color='green', linestyle='--', label='Optimal (65% capacity)')
plt.axvline(x=480, color='red', linestyle='--', label='Maximum (100% capacity)')
plt.xlabel('Queries Per Second (QPS)')
plt.ylabel('P95 Latency (ms)')
plt.title('QPS vs. Latency: Finding the Sweet Spot')
plt.legend()
plt.grid(True)
plt.show()
def calculate_endpoint_size(target_qps, optimal_capacity_percent=0.65):
"""
Calculate required endpoint capacity
Args:
target_qps: Your expected peak production QPS
optimal_capacity_percent: Target utilization (default 65%)
Returns:
Required maximum endpoint QPS
"""
required_max_qps = target_qps / optimal_capacity_percent
# Add 20% safety margin for unexpected bursts
recommended_max_qps = required_max_qps * 1.2
return {
"target_production_qps": target_qps,
"operate_at_capacity": f"{optimal_capacity_percent*100:.0f}%",
"required_max_qps": required_max_qps,
"recommended_max_qps": recommended_max_qps,
"burst_capacity": f"{(1 - optimal_capacity_percent)*100:.0f}% headroom"
}
# Example
result = calculate_endpoint_size(target_qps=200)
print(f"Target production QPS: {result['target_production_qps']}")
print(f"Size endpoint for: {result['recommended_max_qps']:.0f} QPS")
print(f"Operate at: {result['operate_at_capacity']}")
print(f"Available burst capacity: {result['burst_capacity']}")
# Output:
# Target production QPS: 200
# Size endpoint for: 369 QPS
# Operate at: 65%
# Available burst capacity: 35% headroom
Step 7: Size your endpoint
Use the notebook's recommendation
After analyzing results, the notebook asks you to:
- Select the row that best meets your latency requirements.
- Input your application's desired RPS.
The notebook then displays a recommended endpoint size. It calculates the required capacity based on the following:
- Your target RPS
- Observed latency at different concurrency levels
- Success rate thresholds
- Safety margin (typically 2x expected peak load)
Scaling considerations
Standard endpoints:
- Scale up automatically to support index size
- Scale up manually to support throughput
- Scale down automatically when indexes are deleted
- Scale down manually to reduce capacity
Storage-optimized endpoints:
- Scale up automatically to support index size
- Scale down automatically when indexes are deleted
Step 8: Validate Final Configuration
After updating your endpoint configuration:
- Wait for the endpoint to be ready. This can take several minutes.
- Run the final validation test in the notebook.
- Confirm performance meets your requirements:
- RPS ≥ target throughput
- P95 latency meets SLA
- Success rate > 99.5%
- No sustained errors
If validation fails, try the following:
- Increase endpoint capacity
- Optimize query complexity
- Review filter performance
- Check embedding endpoint configuration
When to re-test
To maintain performance visibility, it's a good idea to run baseline load tests quarterly. You should also re-test when you make any of the following changes:
- Change query patterns or complexity
- Update the vector search index
- Modify filter configurations
- Expect significant traffic increases
- Deploy new features or optimizations
- Change from standard to storage-optimized endpoint types
Summary of best practices
Test configuration
-
Run tests for at least 5 minutes at peak load.
-
Use OAuth service principals for authentication.
-
Create realistic query payloads that match expected production queries.
-
Test with production-like filters and parameters.
-
Include a warmup period before measuring.
-
Test at multiple concurrency levels.
-
Track P95/P99 latencies, not just averages.
-
Test both cached and uncached performance.
Python# Conservative approach: Size endpoint for UNCACHED performance
uncached_results = run_load_test(diverse_queries, duration=600)
endpoint_size = calculate_capacity(uncached_results, target_rps=500)
# Then verify cached performance is even better
cached_results = run_load_test(repetitive_queries, duration=300)
print(f"Cached P95: {cached_results['p95']}ms (bonus performance)")
Query set design
- Match your test query diversity to real traffic distribution (frequent and rare queries).
- Use actual queries from logs (anonymized).
- Include different query complexities.
- Test both cached and uncached scenarios and track the results separately.
- Test with expected filter combinations.
- Use the same parameters that you will use in production. For example, if you use hybrid search in production, include hybrid search queries. Use a similar
num_resultsparameter as in production. - Don't use queries that will never occur in production.
Performance optimization
If latencies are too high, try the following:
- Use OAuth service principals (not PATs) - 100ms improvement
- Reduce
num_results- Fetching 100 results is slower than 10 - Optimize filters - Complex or overly restrictive filters slow down queries
- Check embedding endpoint - Ensure it's not scaled to zero or has enough bandwidth
If you are hitting rate limits, try the following:
- Increase endpoint capacity - Scale up your endpoint
- Implement client-side rate limiting or spread queries in time
- Use connection pooling - Reuse connections
- Add retry logic - Use exponential backoff (already a part of the Python SDK)