Skip to main content

Configure a load test for vector search endpoints

This page provides guidance, example code, and an example notebook for load testing vector search endpoints. Load testing helps you understand the performance and production readiness of a vector search endpoint before it's deployed to production. Load testing can tell you about:

  • Latency at different scaling levels
  • Throughput limits and bottlenecks (requests per second, latency breakdown)
  • Error rates under sustained load
  • Resource utilization and capacity planning

For more information about load testing and related concepts, see Load testing for serving endpoints.

Requirements

Before starting these steps, you must have a deployed vector search endpoint and a service principal with Can Query permissions on the endpoint. See Step 1: Set up service principal authentication.

Download and import a copy of the following files and example notebook to your Databricks workspace:

  • input.json. This is an example of the input.json file that specifies the payload that is sent by all concurrent connections to your endpoint. You can have multiple files if needed. If you use the example notebook, this file is generated automatically from the provided input table.
  • fast_vs_load_test_async_load.py. This script is used by the example notebook for authentication and payload handling.
  • The following example notebook, which runs the load tests. For best performance, run this notebook on a cluster with a large number of cores and high memory. The memory is required for queries with pre-generated embeddings, as embeddings are often memory-intensive.

Example notebook and quickstart

Use the following example notebook to get started. It includes all of the steps to run a load test. You must enter a few parameters, such as Databricks secrets, the endpoint name, and so on.

Locust load test notebook

Open notebook in new tab

Load testing framework: Locust

Locust is an open-source load testing framework that allows you to do the following:

  • Vary the number of concurrent client connections
  • Control how fast connections spawn
  • Measure endpoint performance throughout the test
  • Auto-detect and use all available CPU cores

The example notebook uses the --processes -1 flag to auto-detect CPU cores and fully utilize them.

If Locust is bottlenecked by the CPU, a message appears in the output.

Step 1: Set up service principal authentication

important

For production-like performance testing, always use OAuth service principal authentication. Service principals provide up to 100ms faster response time and higher request rate limits compared to Personal Access Tokens (PATs).

Create and configure service principal

  1. Create a Databricks service principal. For instructions, see Add service principals to your account.

  2. Grant permissions:

    • Navigate to your vector search endpoint page.
    • Click Permissions.
    • Give the service principal Can Query permissions.
  3. Create OAuth secret.

    • Go to the service principal details page.
    • Click the Secrets tab.
    • Click Generate secret.
    • Set lifetime (recommend 365 days for long-term testing).
    • Copy both the Client ID and Secret immediately.
  4. Store credentials securely.

    • Create a Databricks secret scope. For instructions, see Tutorial: Create and use a Databricks secret.
    • As shown in the following code example, store the service principal Client ID as service_principal_client_id and store the OAuth secret as service_principal_client_secret.
    Python
    # In a Databricks notebook
    dbutils.secrets.put("load-test-auth", "service_principal_client_id", "<CLIENT_ID>")
    dbutils.secrets.put("load-test-auth", "service_principal_client_secret", "<SECRET>")

Step 2: Configure your load test

Notebook configuration

In your copy of the example notebook, configure these parameters:

Parameter

Description

Recommended Value

endpoint_name

Name of your vector search endpoint

Your endpoint name

index_name

Full index name (catalog.schema.index)

Your index name

locust_run_time

Duration for each individual load test

300-600 seconds (5-10 minutes)

csv_output_prefix

Prefix for CSV output files

load_test_

secret_scope_name

Name of your Databricks secret scope

Your scope name

Why 5-10 minutes?

A minimum test duration of 5 minutes is critical.

  • Initial queries may include cold-start overhead.
  • Endpoints need time to reach steady-state performance.
  • Auto-scaling of the model serving endpoints (if enabled) takes time to activate.
  • Short tests miss throttling behaviors under sustained load.

The following table shows recommended test durations depending on your test goal.

Test type

Test duration

Goals of test

Quick smoke test

2-3 minutes

Verify basic functionality

Performance baseline

5-10 minutes

Reliable steady-state metrics

Stress testing

15-30 minutes

Identify resource exhaustion

Endurance testing

1-4 hours

Degradation, latency stability

Step 3. Design your query set

When possible, the query set should reflect the expected production traffic as closely as possible. Specifically, you should try to match the expected distribution of queries in terms of content, complexity, and diversity.

  • Use realistic queries. Don't use random text such as "test query 1234".

  • Match the expected production traffic distribution. If you expect 80% common queries, 15% medium-frequency queries, and 5% infrequent queries, your query set should reflect that distribution.

  • Match the type of query you expect to see in production. For example, if you expect production queries to use hybrid search or filters, you should also use those in your query set.

    Example query using filters:

    JSON
    {
    "query_text": "wireless headphones",
    "num_results": 10,
    "filters": { "brand": "Sony", "noise_canceling": true }
    }

    Example query using hybrid search:

    JSON
    {
    "query_text": "best noise canceling headphones for travel",
    "query_type": "hybrid",
    "num_results": 10
    }

Query diversity and caching

Vector search endpoints cache several types of query results to improve performance. This caching can affect load test results. For this reason, it's important to pay attention to the diversity of the query set. For example, if you repeatedly send the same set of queries, you're testing the cache, not the actual search performance.

Use:

When:

Example

Identical or few queries

  • Your production traffic has high query repetition (for example, "popular products")

  • You're testing cache effectiveness specifically

  • Your application benefits from caching (for example, dashboards with fixed queries)

  • You want to measure best-case cached performance

A product recommendation widget that shows "trending items" - the same query runs thousands of times per hour.

Diverse queries

  • Your production traffic has unique user queries (for example, search engines or chatbots)

  • You want to measure worst-case uncached performance

  • You want to test index scan performance, not cache performance

  • Queries have high cardinality (millions of possible variations)

An e-commerce search where every user types different product searches.

For additional recommendations, see Summary of best practices.

Options for creating a query set

The code tabs show three options for creating a diverse query set. There is no one-size-fits-all. Pick the one that works the best for you.

  • (Recommended) Random sampling from the index input table. This is a good general starting point.
  • Sampling from production logs. This is a good start if you have production logs. Keep in mind that queries typically change over time, so refresh the test set regularly to keep it up to date.
  • Generating synthetic queries. This is useful if you don't have production logs or if you are using complex filters.

The following code samples random queries from your index input table.

Python
import pandas as pd
import random

# Read the index input table
input_table = spark.table("catalog.schema.index_input_table").toPandas()

# Sample random rows
n_samples = 1000
if len(input_table) < n_samples:
print(f"Warning: Only {len(input_table)} rows available, using all")
sample_queries = input_table
else:
sample_queries = input_table.sample(n=n_samples, random_state=42)

# Extract the text column (adjust column name as needed)
queries = sample_queries['text_column'].tolist()

# Create query payloads
query_payloads = [{"query_text": q, "num_results": 10} for q in queries]

# Save to input.json
pd.DataFrame(query_payloads).to_json("input.json", orient="records", lines=True)

print(f"Created {len(query_payloads)} diverse queries from index input table")

Step 4. Test your payload

Before running the full load test, validate your payload:

  1. In the Databricks workspace, navigate to your vector search endpoint.
  2. In the left sidebar, click Serving.
  3. Select your endpoint.
  4. Click UseQuery.
  5. Paste your input.json content into the query box.
  6. Verify the endpoint returns expected results.

This ensures your load test will measure realistic queries, not error responses.

Step 5. Run the load test

Initial warmup test (30 seconds)

The notebook first runs a 30-second test that does the following:

  • Confirms the endpoint is online and responding
  • Warms up any caches
  • Validates authentication

The results of this warmup test include cold-start overhead, so shouldn't be used for performance metrics.

Main load test series

The notebook runs a series of tests with increasing client concurrency:

  • Start: Low concurrency (for example, 5 concurrent clients)
  • Middle: Medium concurrency (for example, 10, 20, or 50 clients)
  • End: High concurrency (for example, over 100 clients)

Each test runs for the configured locust_run_time (5-10 minutes recommended).

What the notebook measures

The notebook measures and reports the following:

Latency metrics:

  • P50 (median): Half of queries are faster than this.
  • P95: 95% of queries are faster than this. This is a key SLA metric.
  • P99: 99% of queries are faster than this.
  • Max: Worst-case latency.

Throughput metrics:

  • RPS (requests per second): Successful queries per second.
  • Total queries: Number of completed queries.
  • Success rate: Percentage of successful queries.

Errors:

  • Query failures by type
  • Exception messages
  • Timeout counts

Step 6. Interpret results

The following table shows targets for good performance:

Metric

Target

Comment

P95 latency

< 500ms

Most queries are fast

P99 latency

< 1s

Reasonable performance on long-tail queries

Success rate

> 99.5%

Low failure rate

Latency over time

Stable

No degradation observed during test

Queries per second

Meets target

Endpoint can handle expected traffic

The following results indicate poor performance:

  • P95 > 1s. Indicates queries are too slow for real-time use.
  • P99 > 3s. Latency on long-tail queries will hurt user experience.
  • Success rate < 99%. Too many failures.
  • Increasing latency. Indicates resource exhaustion or memory leak.
  • Rate limiting errors (429). Indicates that higher endpoint capacity is required.

Tradeoff between RPS and latency

The maximum RPS is not the optimal point for production throughput. Latency increases non-linearly as you approach maximum throughput. Operating at maximum RPS often results in 2-5x higher latency compared to operating at 60-70% of maximum capacity.

The following example shows how to analyze the results to find the optimal operating point.

  • The maximum RPS is 480 at 150 concurrent clients.
  • The optimal operating point is 310 RPS at 50 concurrent clients (65% capacity).
  • The latency penalty at max: P95 is 4.3x higher (1.5s vs. 350ms)
  • In this example, the recommendation is to size the endpoint for 480 RPS capacity and operate at ~310 RPS.

Concurrency

P50

P95

P99

RPS

Success

Capacity

5

80ms

120ms

150ms

45

100%

10%

10

85ms

140ms

180ms

88

100%

20%

20

95ms

180ms

250ms

165

99.8%

35%

50

150ms

350ms

500ms

310

99.2%

65% ← Sweet spot

100

250ms

800ms

1.2s

420

97.5%

90% ⚠️ Approaching max

150

450ms

1.5s

2.5s

480

95.0%

100% ❌ Maximum RPS

Operating at the maximum RPS can lead to the following issues:

  • Latency degradation. In the example, P95 is 350ms at 65% capacity but is 1.5s at 100% capacity.
  • No room to accommodate traffic bursts or spikes. At 100% capacity, any spike causes a timeout. At 65% capacity, a 50% spike in traffic can be handled without a problem.
  • Increased error rates. In the example, the success rate is 99.2% at 65% capacity but 95.0% — a 5% failure rate — at 100% capacity.
  • Risk of resource exhaustion. At maximum load, queues increase, memory pressure increases, connection pools start to saturate, and the recovery time after incidents increases.

The following table shows recommended operating points for different use cases.

Use case

Target capacity

Rationale

Latency-sensitive (search, chat)

50-60% of max

Prioritize low P95/P99 latency

Balanced (recommendations)

60-70% of max

Good balance of cost and latency

Cost-optimized (batch jobs)

70-80% of max

Acceptable higher latency

Not recommended

> 85% of max

Latency spikes, no burst capacity

Helper functions for calculating operating point and endpoint size

The following code plots QPS vs P95 latency. In the plot, look for the point where the curve starts to bend sharply upward. This is the optimal operating point.

Python
import matplotlib.pyplot as plt

# Plot QPS vs. P95 latency
qps_values = [45, 88, 165, 310, 420, 480]
p95_latency = [120, 140, 180, 350, 800, 1500]

plt.plot(qps_values, p95_latency, marker='o')
plt.axvline(x=310, color='green', linestyle='--', label='Optimal (65% capacity)')
plt.axvline(x=480, color='red', linestyle='--', label='Maximum (100% capacity)')
plt.xlabel('Queries Per Second (QPS)')
plt.ylabel('P95 Latency (ms)')
plt.title('QPS vs. Latency: Finding the Sweet Spot')
plt.legend()
plt.grid(True)
plt.show()

Step 7: Size your endpoint

Use the notebook's recommendation

After analyzing results, the notebook asks you to:

  1. Select the row that best meets your latency requirements.
  2. Input your application's desired RPS.

The notebook then displays a recommended endpoint size. It calculates the required capacity based on the following:

  • Your target RPS
  • Observed latency at different concurrency levels
  • Success rate thresholds
  • Safety margin (typically 2x expected peak load)

Scaling considerations

Standard endpoints:

  • Scale up automatically to support index size
  • Scale up manually to support throughput
  • Scale down automatically when indexes are deleted
  • Scale down manually to reduce capacity

Storage-optimized endpoints:

  • Scale up automatically to support index size
  • Scale down automatically when indexes are deleted

Step 8: Validate Final Configuration

After updating your endpoint configuration:

  1. Wait for the endpoint to be ready. This can take several minutes.
  2. Run the final validation test in the notebook.
  3. Confirm performance meets your requirements:
    • RPS ≥ target throughput
    • P95 latency meets SLA
    • Success rate > 99.5%
    • No sustained errors

If validation fails, try the following:

  • Increase endpoint capacity
  • Optimize query complexity
  • Review filter performance
  • Check embedding endpoint configuration

When to re-test

To maintain performance visibility, it's a good idea to run baseline load tests quarterly. You should also re-test when you make any of the following changes:

  • Change query patterns or complexity
  • Update the vector search index
  • Modify filter configurations
  • Expect significant traffic increases
  • Deploy new features or optimizations
  • Change from standard to storage-optimized endpoint types

Summary of best practices

Test configuration

  • Run tests for at least 5 minutes at peak load.

  • Use OAuth service principals for authentication.

  • Create realistic query payloads that match expected production queries.

  • Test with production-like filters and parameters.

  • Include a warmup period before measuring.

  • Test at multiple concurrency levels.

  • Track P95/P99 latencies, not just averages.

  • Test both cached and uncached performance.

    Python
    # Conservative approach: Size endpoint for UNCACHED performance
    uncached_results = run_load_test(diverse_queries, duration=600)
    endpoint_size = calculate_capacity(uncached_results, target_rps=500)

    # Then verify cached performance is even better
    cached_results = run_load_test(repetitive_queries, duration=300)
    print(f"Cached P95: {cached_results['p95']}ms (bonus performance)")

Query set design

  • Match your test query diversity to real traffic distribution (frequent and rare queries).
  • Use actual queries from logs (anonymized).
  • Include different query complexities.
  • Test both cached and uncached scenarios and track the results separately.
  • Test with expected filter combinations.
  • Use the same parameters that you will use in production. For example, if you use hybrid search in production, include hybrid search queries. Use a similar num_results parameter as in production.
  • Don't use queries that will never occur in production.

Performance optimization

If latencies are too high, try the following:

  1. Use OAuth service principals (not PATs) - 100ms improvement
  2. Reduce num_results - Fetching 100 results is slower than 10
  3. Optimize filters - Complex or overly restrictive filters slow down queries
  4. Check embedding endpoint - Ensure it's not scaled to zero or has enough bandwidth

If you are hitting rate limits, try the following:

  1. Increase endpoint capacity - Scale up your endpoint
  2. Implement client-side rate limiting or spread queries in time
  3. Use connection pooling - Reuse connections
  4. Add retry logic - Use exponential backoff (already a part of the Python SDK)

Additional resources