Configure route optimization on serving endpoints

This article describes how to configure route optimization on your model serving or feature serving endpoints and how to query them. Route optimized serving endpoints dramatically lower overhead latency and allow for substantial improvements in the throughput supported by your endpoint.

Route optimization is recommended for high throughput or latency sensitive workloads.

Requirements

  • For route optimization on a model serving endpoint, see Requirements.

  • For route optimization on a feature serving endpoint, see Requirements.

Enable route optimization on a model serving endpoint

Specify the route_optimized parameter during model serving endpoint creation to configure your endpoint for route optimization. You can only specify this parameter during endpoint creation, you can not update existing endpoints to be route optimized.

POST /api/2.0/serving-endpoints

{
  "name": "my-endpoint",
  "config":{
    "served_entities": [{
      "entity_name": "ads1",
      "entity_version": "1",
      "workload_type": "CPU",
      "workload_size": "Small",
      "scale_to_zero_enabled": true,
    }],
  },
  "route_optimized": true
}

You can enable route optimization for an endpoint in the Serving UI. If you use Python, you can use the following notebook to create a route optimized serving endpoint.

Create a route optimized serving endpoint using Python notebook

Open notebook in new tab

Enable route optimization on a feature serving endpoint

To use route optimization for Feature and Function Serving, specify the full name of the feature specification in the entity_name field for serving endpoint creation requests. The entity_version is not needed for FeatureSpecs.

POST /api/2.0/serving-endpoints

{
  "name": "my-endpoint",
  "config": {
    "served_entities": [
      {
        "entity_name": "catalog_name.schema_name.feature_spec_name",
        "workload_type": "CPU",
        "workload_size": "Small",
        "scale_to_zero_enabled": true
      }
    ]
  },
  "route_optimized": true
}

Query route optimized model serving endpoints

The following steps show how to test query a route optimized model serving endpoint.

For production use, like using your route optimized endpoint in an application, you must create an OAuth token. The following steps show how to fetch a token in the Serving UI. For programmatic workflows, see Fetch an OAuth token programmatically.

  1. Fetch an OAuth token from the Serving UI of your workspace.

    1. Click Serving in the sidebar to display the Serving UI.

    2. On the Serving endpoints page, select your route optimized endpoint to see endpoint details.

    3. On the endpoint details page, click the Query endpoint button.

    4. Select the Fetch Token tab.

    5. Select Fetch OAuth Token button. This token is valid for 1 hour. Fetch a new token if your current token expires.

  2. Get your model serving endpoint URL from the endpoint details page from the Serving UI.

  3. Use the OAuth token from step 1 and the endpoint URL from step 2 to populate the following example code that queries the route optimized endpoint.

url="your-endpoint-url"
OAUTH_TOKEN=xxxxxxx

curl -X POST -H 'Content-Type: application/json' -H "Authorization: Bearer $OAUTH_TOKEN" -d@data.json $url

For a Python SDK to query a route optimized endpoint, reach out to your Databricks account team.

Fetch an OAuth token programmatically

Use a service principal to authenticate with Databricks (OAuth M2M) provides guidance on how to fetch an OAuth token programmatically. In addition to those steps, you must specify authorization_details in the request.

  • Replace <token-endpoint-URL> with the preceding token endpoint URL.

  • Replace <client-id> with the service principal’s client ID, which is also known as an application ID.

  • Replace <client-secret> with the service principal’s OAuth secret that you created.

  • Replace <endpoint-id> with the endpoint ID of the route optimized endpoint. You can fetch this from hostName in the endpoint url.

  • Replace <action> with the action permission given to the service principal. The action can be query_inference_endpoint or manage_inference_endpoint.

For example:

      export CLIENT_ID=<client-id>
      export CLIENT_SECRET=<client-secret>
      export ENDPOINT_ID=<endpoint-id>
      export ACTION=<action>

      curl --request POST \
      --url <token-endpoint-URL> \
      --user "$CLIENT_ID:$CLIENT_SECRET" \
      --data 'grant_type=client_credentials&scope=all-apis'
      --data-urlencode 'authorization_details=[{"type":"workspace_permission","object_type":"serving-endpoints","object_path":"'"/serving-endpoints/$ENDPOINT_ID"'","actions": ["'"$ACTION"'"]}]'

Limitations

  • Route optimization is only available for custom model serving endpoints and feature serving endpoints. Foundation Model APIs and External Models are not supported.

  • Databricks in-house OAuth tokens are the only supported authentication for route optimization. Personal access tokens are not supported.