Optimize Model Serving endpoints for production
Learn how to optimize Model Serving endpoints for production workloads that require high throughput, low latency, and reliable performance.
Optimization strategies fall into three categories:
- Endpoint optimizations: Configure endpoint infrastructure for better performance
- Model optimizations: Improve model efficiency and throughput
- Client optimizations: Optimize how clients interact with serving endpoints
When to optimize your endpoint
Consider optimizing your Model Serving endpoint when you encounter any of the following scenarios:
- High query volume: Your application sends more than 50k queries per second (QPS) to a single endpoint
- Latency requirements: Your application requires sub-100ms response times
- Scaling bottlenecks: Endpoints experience queuing or return HTTP 429 errors during traffic spikes
- Cost optimization: You want to reduce serving costs while maintaining performance targets
- Production preparation: You're preparing to move from development to production workloads
Infrastructure optimizations
Infrastructure optimizations improve network routing, scaling behavior, and compute capacity.
Route optimization
Route optimization provides the most significant infrastructure improvement for high-throughput workloads. When you enable route optimization on an endpoint, Databricks Model Serving improves the network path for inference requests, resulting in faster, more direct communication between clients and models.
Performance benefits:
Feature | Standard endpoint limit | Route optimized endpoint limit |
|---|---|---|
Queries per second (QPS) per workspace | 200 | 50,000+ (contact Databricks for higher limits) |
Client concurrency per workspace | 192-1024 (varies by region) | No explicit limit (limited by provisioned concurrency) |
Endpoint provisioned concurrency per served entity | 1,024 | 1,024 (contact Databricks for higher limits) |
When to use route optimization:
- Workloads requiring more than 200 QPS
- Applications with strict latency requirements (sub-50ms overhead)
- Production deployments serving multiple concurrent users
Route optimization is only available for custom model serving endpoints. Foundation Model APIs and external models do not support route optimization. OAuth tokens are required for authentication; personal access tokens are not supported.
See Route optimization on serving endpoints for setup instructions and Query route-optimized serving endpoints for querying details.
Provisioned concurrency
Provisioned concurrency controls how many simultaneous requests your endpoint can process. Configure provisioned concurrency based on your expected QPS and latency requirements.
Configuration guidelines:
- Minimum concurrency: Set high enough to handle baseline traffic without queuing
- Maximum concurrency: Set high enough to accommodate traffic spikes while controlling costs
- Autoscaling: Enable autoscaling to dynamically adjust capacity based on demand
Calculate required concurrency:
Required Concurrency = Target QPS × Average Latency (seconds)
For example, if your target is 100 QPS with 200ms average latency:
Required Concurrency = 100 × 0.2 = 20
Use load testing to measure actual latency and determine optimal concurrency settings.
Instance types
Choose instance types based on your model's compute requirements:
Instance type | Best for | Trade-offs |
|---|---|---|
CPU (Small, Medium, Large) | Lightweight models, simple inference logic | Lower cost, slower for compute-intensive models |
GPU (Small, Medium, Large) | Large models, complex computations, image/video processing | Higher cost, optimal performance for deep learning |
Start with CPU instances for development and testing. Switch to GPU instances only if you observe high inference latency or your model requires specialized compute (such as deep learning operations).
Model optimizations
Model optimizations improve inference speed and resource efficiency.
Model size and complexity
Model Size and Complexity: Smaller, less complex models generally lead to faster inference times and higher QPS. Consider techniques like model quantization or pruning if your model is large.
Batching
If your application can send multiple requests in a single call, enable batching at the client side. This can significantly reduce the overhead per prediction.
Pre-processing and post-processing optimization
Offload complex pre-processing and post-processing from serving endpoints to reduce load on inference infrastructure.
Client-side optimizations
Client-side optimizations improve how applications interact with serving endpoints.
Connection pooling
Connection pooling reuses existing connections instead of creating new connections for each request, significantly reducing overhead.
- Use the Databricks SDK, which automatically implements connection pooling best practices
- If using custom clients, implement connection pooling yourself.
Error handling and retry strategies
Implement robust error handling to gracefully handle temporary failures, especially during autoscaling events or network disruptions.
Payload size optimization
Minimize request and response payload sizes to reduce network transfer time and improve throughput.
Measure and improve performance
Performance monitoring
Monitor endpoint performance using the tools provided by Mosaic AI Model Serving:
Metric | What it measures | Target | Action if exceeded |
|---|---|---|---|
Latency (P50, P90, P99) | Response time for requests | Application-dependent (typically <100-500ms) | Check for queuing, optimize model or client |
Throughput (QPS) | Requests completed per second | Workload-dependent | Enable route optimization, increase provisioned concurrency |
Error rate | Percentage of failed requests | <1% | Review service logs, check for capacity issues |
Queue depth | Requests waiting for processing | 0 (no queuing) | Increase provisioned concurrency or enable autoscaling |
CPU/Memory usage | Resource utilization | <80% | Scale up instance type or increase concurrency |
See Monitor model quality and endpoint health for detailed monitoring guidance and Track and export serving endpoint health metrics to Prometheus and Datadog for exporting metrics to observability tools.
Load testing
Load testing measures endpoint performance under realistic traffic conditions and helps you:
- Determine optimal provisioned concurrency settings
- Identify performance bottlenecks
- Validate latency and throughput requirements
- Understand the relationship between client concurrency and server concurrency
See Load testing for serving endpoints.
Troubleshoot common performance issues
Queuing
Model Serving supports autoscaling to adjust capacity based on traffic patterns. However, sudden traffic surges can cause queuing because autoscaling requires time to detect increased load and provision additional capacity. During this period, incoming requests may temporarily exceed available capacity, causing requests to queue.
Queuing occurs when the request rate or concurrency surpasses the endpoint's current processing capacity. This typically happens during sharp traffic spikes, workload bursts, or when the endpoint has insufficient provisioned concurrency. Model Serving endpoints allow temporary queuing to handle bursts, but beyond a defined threshold, the endpoint returns HTTP 429 (Too Many Requests) errors to protect system stability.
Queuing increases latency because queued requests wait before being processed. To minimize queuing:
- Set minimum provisioned concurrency high enough to handle baseline traffic plus typical bursts
- Enable route optimization for higher capacity limits
- Implement retry logic with exponential backoff in your client applications
External API bottlenecks
Models often call external APIs for data enrichment, feature retrieval, or other tasks during inference. These external dependencies can become performance bottlenecks:
- Latency: Measure the response time of each external API call. High latency in these calls directly increases overall serving latency and reduces throughput.
- Throughput limits: External APIs may impose rate limits or capacity constraints. Exceeding these limits can cause throttling, errors, and performance degradation.
- Error rates: Frequent errors from external APIs can trigger retries and increase load on your serving endpoint.
- Caching: Implement caching for frequently accessed data from external APIs to reduce the number of calls and improve response times.
Monitor these factors to identify bottlenecks and implement targeted optimizations for high-throughput workloads.