Scale endpoint throughput with high QPS
This feature is in Public Preview.
By default, standard endpoints support 20–200 QPS depending on index size. Real-time applications such as search bars, recommendation systems, and entity matching often require 100–1000+ QPS. On standard endpoints only, you can set a target QPS. Databricks provisions the infrastructure to best match that throughput level (best-effort, not guaranteed).
Setting a target QPS provisions additional capacity, which increases the cost of the endpoint. You are charged for this additional capacity regardless of actual query traffic. Throughput scaling is best-effort and not guaranteed during Public Preview.
Use high QPS when:
- Your application requires more than 50 QPS of sustained throughput.
- You receive 429 (Too Many Requests) errors under normal load.
- Latency degrades as traffic ramps up, even when average utilization appears low.
Requirements
- High QPS is available for standard endpoints only. Storage-optimized endpoints are not supported.
- Use service principal (OAuth) authentication for high-QPS production workloads. Service principal traffic routes through performance-optimized networks built for high-QPS workloads. Personal access tokens (PATs) route through networks capped at a few tens of QPS — fine for prototyping, not for production. See Use service principals with OAuth tokens.
Configure target QPS
Set a target QPS when creating a new endpoint or updating an existing one. The additional capacity needed to best match the target throughput is provisioned automatically. In Public Preview, throughput scaling is best-effort and not guaranteed: actual QPS depends on your index size, vector dimensionality, query complexity, and filter usage.
- Databricks UI
- Python SDK
- REST API
When creating a new endpoint:
-
In the left sidebar, click Compute.
-
Click the AI Search tab and click Create endpoint.

-
Under Advanced Settings, enter the Target QPS value.

When updating an existing endpoint:
-
Navigate to the endpoint detail page.
-
In the right panel, click the pencil icon
next to Target QPS.

-
Enter the new value and click Save.
from databricks.vector_search.client import VectorSearchClient
client = VectorSearchClient()
# Create a new endpoint with target QPS
endpoint = client.create_endpoint(
name="my-high-qps-endpoint",
endpoint_type="STANDARD",
target_qps=500,
)
# Update an existing endpoint's target QPS
response = client.update_endpoint(name="my-endpoint", target_qps=500)
# Check scaling status
scaling_info = response.get("endpoint", {}).get("scaling_info", {})
print(f"Requested target QPS: {scaling_info.get('requested_target_qps')}")
print(f"State: {scaling_info.get('state')}")
# State is "SCALING_CHANGE_IN_PROGRESS" while capacity is being provisioned,
# then transitions to "SCALING_CHANGE_APPLIED"
Create an endpoint with target QPS:
POST /api/2.0/vector-search/endpoints
{
"name": "my-high-qps-endpoint",
"endpoint_type": "STANDARD",
"target_qps": 500
}
Update target QPS on an existing endpoint:
PATCH /api/2.0/vector-search/endpoints/<ENDPOINT_NAME>
{
"target_qps": 500
}
Check scaling status:
GET /api/2.0/vector-search/endpoints/<ENDPOINT_NAME>
The response scaling_info field shows the requested_target_qps and scaling state. The state is SCALING_CHANGE_IN_PROGRESS while capacity is being provisioned, then transitions to SCALING_CHANGE_APPLIED.
How scaling applies
After you set a target QPS, the required capacity is provisioned automatically. The new throughput level applies after provisioning completes; you do not need to sync indexes to trigger the change.
Attempting to update target QPS while a scaling operation is in progress returns a RESOURCE_CONFLICT error. Wait for the current operation to complete before retrying.
Limitations
- No autoscaling: You must set target QPS manually based on expected traffic. If traffic exceeds the provisioned level, 429 errors occur. See Plan for query spikes.
- Standard endpoints only: Storage-optimized endpoints do not support
target_qps.