Add a model serving endpoint resource to a Databricks app

Add model serving endpoints as Databricks Apps resources so your app can query machine learning models for inference. Model serving endpoints handle model predictions and provide a consistent interface to access deployed models.

Add a model serving endpoint resource

Before you add a model serving endpoint as an app resource, review the resource prerequisites.

When you create or edit an app, navigate to the Configure step.
In the App resources section, click + Add resource.
Select Serving endpoint as the resource type.
Choose a model serving endpoint from the available endpoints in your workspace.
Select the appropriate permission level for your app:
- Can view: View endpoint metadata, including model names, versions, and workload configuration. Cannot send inference requests.
- Can query: Send inference requests and view metadata. Use this for most apps that need model predictions.
- Can manage: Full administrative control including view, edit, query, delete, and manage permissions.
(Optional) Specify a custom resource key, which is how you reference the model serving endpoint in your app configuration. The default key is serving-endpoint.

note

The model serving endpoint must be in a READY state to process inference requests from your app.

Environment variables

When you deploy an app with a model serving endpoint resource, Databricks exposes the serving endpoint name through environment variables that you can reference using the valueFrom field.

For example:

SERVING_ENDPOINT=<your-serving-endpoint-name>

For more information, see Use environment variables to access resources.

Remove a model serving endpoint resource

When you remove a model serving endpoint resource from an app, the app's service principal loses access to the endpoint. The model serving endpoint itself remains unchanged and continues to be available for other users and applications that have appropriate permissions.

Best practices

Consider the following when you work with model serving endpoint resources:

Grant minimal permissions. Use Can view for the least access or Can query for most apps that need to send inference requests, unless your app specifically needs to perform administrative tasks on the endpoint.
Avoid long-running queries when possible, as inference requests can time out.
Check the endpoint status before sending requests. Endpoints must be in READY state to process queries.
Consider rate limiting your inference requests to avoid overwhelming the endpoint, especially during high-traffic periods.