Configure a load test for vector search endpoints

This page provides an example notebook for load testing and covers setup requirements, authentication, cluster configuration, and step-by-step instructions for running load tests to optimize vector search endpoint performance.

For more information about load testing and related concepts, see Load testing for serving endpoints.

Requirements

Download and import a copy of the following files and example notebook to your Databricks workspace:

input.json. This file specifies the payload that is sent by all concurrent connections to your endpoint. If you're testing an endpoint that is sensitive to the size of the payload, verify that the input payload reflects how you expect the endpoint to be used. See Test the payload.
fast_vs_load_test_async_load.py. This script is used by the example notebook to validate your authentication token and read the input.json file contents.

Locust load test notebook

Open notebook in new tab

For best performance, select a large number of cores and high memory for the workers on the cluster you use to run the notebook.

Locust

Locust is an open source framework for load testing that is commonly used to evaluate production-grade endpoints. The Locust framework allows you to modify various parameters, like the number of client connections and how fast client connections spawn, while measuring your endpoint’s performance throughout the test. Locust is used for all of the example code as it standardizes and automates the approach.

Locust relies on CPU resources to run its tests. Depending on payload, this facilitates roughly 4000 requests per second per CPU core. In the example notebook, the --processes -1 flag is set to allow Locust to auto-detect the number of CPU cores on your driver and fully use them.

If Locust is bottlenecked by the CPU, an output message appears.

Set up service principal

Do this outside of the example notebook.

To interact with the route-optimized endpoint, the Locust test needs to be able to generate OAuth tokens with permissions to query the endpoint. Follow these steps to set up authentication:

Create a Databricks Service Principal.
Navigate to the vector search endpoint webpage. Click Permissions and give the Service Principal Can Query level permissions.
Create a databricks secret scope named with two keys:
1. The ID of your Databricks service principal. For example: service_principal_client_id .
2. The client secret for the Databricks service principal. For example service_principal_client_secret.
Put the client ID and client Secret of your service principal into a Databricks secret.

Set up notebook

The following sections describe how to set up your example notebook and the supporting files that you downloaded.

Configure variables

In your copy of the example notebook, configure the following parameters:

Parameter	Descriptions
`endpoint_name`	The name of your vector search endpoint.
`locust_run_time`	How long to run each individual load test. Many load tests are run, so a duration of 5-10 minutes is a good default.
`csv_output_prefix`	Locust load tests output CSV files of information and metrics. This string defines a prefix that is prepended to the CSV files.
`secret_scope_name`	The name of your Databricks secret scope which contains the service principal information.

Specify a payload

Specify your payload into the input.json file alongside the example notebook.

To check the validity of the load test results, it is important to consider the payload that should be sent by the Locust clients. Choose a payload that accurately represents the type of payload that you plan to send in production. For example, if your model is a fraud detection model that assesses credit card transactions in real time, such as one transaction per request, verify your payload represents only one typical transaction.

Test the payload

Test your payload by copying and pasting the full input.json data into the Query window on your vector search endpoint and ensuring your model is responding with the desired outputs.

To open the Query box for your endpoint:

From the left sidebar in your Databricks workspace, select Serving.
Select the endpoint to use for load testing.
In the upper right corner, from the Use dropdown menu, select Query.

The endpoint concurrency required to achieve a certain percentile of latency scales linearly with the number of concurrent connections. This means you can test on a small endpoint and calculate the endpoint size you need before performing a final test.

Run the load test

After your endpoint, notebooks, and payload are configured, you can begin to step through the notebook execution.
The notebook runs a 30-second load test against your endpoint to confirm that the endpoint is online and responding.

In the example notebook, you can run a series of load tests using different amounts of client side concurrency. After completing the series of load tests, the notebook results show the content of any request failures or exceptions, and also display a plot of latency percentiles against client concurrency.

The notebook shows a table of results, and you must select the row that best meets your latency requirements and input the application's desired RPS. Based on the information you provide, the notebook recommends how to size your endpoint to meet your RPS and latency goals.

After updating your endpoint configuration to match the notebook's recommendations, you can run the notebook's final load test to confirm that the endpoint is meeting both latency and RPS requirements.

Requirements​

Locust load test notebook

Locust​

Set up service principal​

Set up notebook​

Configure variables​

Specify a payload​

Test the payload​

Run the load test​