Pular para o conteúdo principal

Optimize Model Serving endpoints for production

Learn how to optimize Model Serving endpoints for production workloads that require high throughput, low latency, and reliable performance.

Optimization strategies fall into three categories:

When to optimize your endpoint

Consider optimizing your Model Serving endpoint when you encounter any of the following scenarios:

  • High query volume: Your application sends more than 50k queries per second (QPS) to a single endpoint
  • Latency requirements: Your application requires sub-100ms response times
  • Scaling bottlenecks: Endpoints experience queuing or return HTTP 429 errors during traffic spikes
  • Cost optimization: You want to reduce serving costs while maintaining performance targets
  • Production preparation: You're preparing to move from development to production workloads

Infrastructure optimizations

Infrastructure optimizations improve network routing, scaling behavior, and compute capacity.

Route optimization

Route optimization provides the most significant infrastructure improvement for high-throughput workloads. When you enable route optimization on an endpoint, Databricks Model Serving improves the network path for inference requests, resulting in faster, more direct communication between clients and models.

Performance benefits:

Feature

Standard endpoint limit

Route optimized endpoint limit

Queries per second (QPS) per workspace

200

50,000+ (contact Databricks for higher limits)

Client concurrency per workspace

192-1024 (varies by region)

No explicit limit (limited by provisioned concurrency)

Endpoint provisioned concurrency per served entity

1,024

1,024 (contact Databricks for higher limits)

When to use route optimization:

  • Workloads requiring more than 200 QPS
  • Applications with strict latency requirements (sub-50ms overhead)
  • Production deployments serving multiple concurrent users
importante

Route optimization is only available for custom model serving endpoints. Foundation Model APIs and external models do not support route optimization. OAuth tokens are required for authentication; personal access tokens are not supported.

See Route optimization on serving endpoints for setup instructions and Query route-optimized serving endpoints for querying details.

Provisioned concurrency

Provisioned concurrency controls how many simultaneous requests your endpoint can process. Configure provisioned concurrency based on your expected QPS and latency requirements.

Configuration guidelines:

  • Minimum concurrency: Set high enough to handle baseline traffic without queuing
  • Maximum concurrency: Set high enough to accommodate traffic spikes while controlling costs
  • Autoscaling: Enable autoscaling to dynamically adjust capacity based on demand

Calculate required concurrency:

Required Concurrency = Target QPS × Average Latency (seconds)

For example, if your target is 100 QPS with 200ms average latency:

Required Concurrency = 100 × 0.2 = 20

Use load testing to measure actual latency and determine optimal concurrency settings.

Instance types

Choose instance types based on your model's compute requirements:

Instance type

Best for

Trade-offs

CPU (Small, Medium, Large)

Lightweight models, simple inference logic

Lower cost, slower for compute-intensive models

GPU (Small, Medium, Large)

Large models, complex computations, image/video processing

Higher cost, optimal performance for deep learning

dica

Start with CPU instances for development and testing. Switch to GPU instances only if you observe high inference latency or your model requires specialized compute (such as deep learning operations).

Model optimizations

Model optimizations improve inference speed and resource efficiency.

Model size and complexity

Model Size and Complexity: Smaller, less complex models generally lead to faster inference times and higher QPS. Consider techniques like model quantization or pruning if your model is large.

Batching

If your application can send multiple requests in a single call, enable batching at the client side. This can significantly reduce the overhead per prediction.

Pre-processing and post-processing optimization

Offload complex pre-processing and post-processing from serving endpoints to reduce load on inference infrastructure.

Client-side optimizations

Client-side optimizations improve how applications interact with serving endpoints.

Connection pooling

Connection pooling reuses existing connections instead of creating new connections for each request, significantly reducing overhead.

  • Use the Databricks SDK, which automatically implements connection pooling best practices
  • If using custom clients, implement connection pooling yourself.

Error handling and retry strategies

Implement robust error handling to gracefully handle temporary failures, especially during autoscaling events or network disruptions.

Payload size optimization

Minimize request and response payload sizes to reduce network transfer time and improve throughput.

Measure and improve performance

Performance monitoring

Monitor endpoint performance using the tools provided by Mosaic AI Model Serving:

Metric

What it measures

Target

Action if exceeded

Latency (P50, P90, P99)

Response time for requests

Application-dependent (typically <100-500ms)

Check for queuing, optimize model or client

Throughput (QPS)

Requests completed per second

Workload-dependent

Enable route optimization, increase provisioned concurrency

Error rate

Percentage of failed requests

<1%

Review service logs, check for capacity issues

Queue depth

Requests waiting for processing

0 (no queuing)

Increase provisioned concurrency or enable autoscaling

CPU/Memory usage

Resource utilization

<80%

Scale up instance type or increase concurrency

See Monitor model quality and endpoint health for detailed monitoring guidance and Track and export serving endpoint health metrics to Prometheus and Datadog for exporting metrics to observability tools.

Load testing

Load testing measures endpoint performance under realistic traffic conditions and helps you:

  • Determine optimal provisioned concurrency settings
  • Identify performance bottlenecks
  • Validate latency and throughput requirements
  • Understand the relationship between client concurrency and server concurrency

See Load testing for serving endpoints.

Troubleshoot common performance issues

Queuing

Model Serving supports autoscaling to adjust capacity based on traffic patterns. However, sudden traffic surges can cause queuing because autoscaling requires time to detect increased load and provision additional capacity. During this period, incoming requests may temporarily exceed available capacity, causing requests to queue.

Queuing occurs when the request rate or concurrency surpasses the endpoint's current processing capacity. This typically happens during sharp traffic spikes, workload bursts, or when the endpoint has insufficient provisioned concurrency. Model Serving endpoints allow temporary queuing to handle bursts, but beyond a defined threshold, the endpoint returns HTTP 429 (Too Many Requests) errors to protect system stability.

Queuing increases latency because queued requests wait before being processed. To minimize queuing:

  • Set minimum provisioned concurrency high enough to handle baseline traffic plus typical bursts
  • Enable route optimization for higher capacity limits
  • Implement retry logic with exponential backoff in your client applications

External API bottlenecks

Models often call external APIs for data enrichment, feature retrieval, or other tasks during inference. These external dependencies can become performance bottlenecks:

  • Latency: Measure the response time of each external API call. High latency in these calls directly increases overall serving latency and reduces throughput.
  • Throughput limits: External APIs may impose rate limits or capacity constraints. Exceeding these limits can cause throttling, errors, and performance degradation.
  • Error rates: Frequent errors from external APIs can trigger retries and increase load on your serving endpoint.
  • Caching: Implement caching for frequently accessed data from external APIs to reduce the number of calls and improve response times.

Monitor these factors to identify bottlenecks and implement targeted optimizations for high-throughput workloads.

Additional resources