Model serving with Serverless Real-Time Inference

Preview

This feature is in Public Preview.

This article describes model serving with Databricks Serverless Real-Time Inference, including its advantages and limits in comparison to Classic MLflow model serving.

Serverless Real-Time Inference exposes your MLflow machine learning models as scalable REST API endpoints. This functionality uses Serverless compute, which means that the endpoints and associated compute resources are managed and run in the Databricks cloud account. Usage and storage costs incurred are currently free of charge, but Databricks will provide notice when charging begins.

Classic MLflow model serving uses a single-node cluster that runs under your own account within what is now called the Classic data plane. This data plane includes the virtual network and its associated compute resources such as, clusters for notebooks and jobs, classic SQL warehouses, and Classic model serving endpoints.

Why use Serverless Real-Time Inference?

Serverless Real-Time Inference offers:

  • Ability to launch an endpoint with one click: Databricks automatically prepares a production-ready environment for your model and offers serverless configuration options for compute.

  • High availability and scalability: Serverless Real-Time Inference is intended for production use and can support up to 3000 queries-per-second (QPS). Serverless Real-Time Inference endpoints automatically scale up and down, which means that endpoints automatically adjust based on the volume of scoring requests.

  • Dashboards: Use the built-in Serverless Real-Time Inference dashboard to monitor the health of your model endpoints using metrics such as QPS, latency, and error rate.

  • Feature store integration: When your model is trained with features from Databricks Feature Store, the model is packaged with feature metadata. If you configure your online store, these features are incorporated in real-time as scoring requests are received.

Limitations

While this service is in preview, the following limits apply:

  • Payload size limit of 16 MB per request.

  • Default limit of 200 QPS of scoring requests per workspace enrolled. You can increase this limit to up to 3000 QPS per workspace by reaching out to your Databricks support contact.

  • Best effort support on less than 100 millisecond latency overhead and availability.

Serverless Real-Time Inference endpoints are open to the internet for inbound traffic unless an IP allowlist is enabled in the workspace, in which case this list applies to the endpoints as well.

Region availability

Serverless Real-Time Inference is available in the following AWS regions:

  • eu-west-1

  • eu-central-1

  • us-east-1

  • us-east-2

  • us-west-2

  • ca-central-1

  • ap-southeast-1

  • ap-southeast-2

Staging and production time expectations

Transitioning a model from staging to production takes time. Deploying a newly registered model version involves building a model container image and provisioning the model endpoint. This process can take ~5 minutes.

Databricks performs a “zero-downtime” update of /staging and /production endpoints by keeping the existing model deployment up until the new one becomes ready. Doing so ensures no interruption for model endpoints that are in use.

If model computation takes longer than 60 seconds, requests will time out. If you believe your model computation will take longer than 60 seconds, please reach out to your Databricks support contact.

Prerequisites

Important

During public preview, you need to reach out to your Databricks support contact to enable Serverless Real-Time Inference on your workspace.

Before you can create Serverless Real-Time Inference endpoints, you must enable them on your workspace. See Enable Serverless Real-Time Inference endpoints for model serving.

After Serverless Real-Time Inference endpoints have been enabled on your workspace, you need the following permissions to create endpoints for model serving: