Configure route optimization on serving endpoints

This article describes how to configure route optimization on your model serving or feature serving endpoints and how to query them. Route optimized serving endpoints dramatically lower overhead latency and allow for substantial improvements in the throughput supported by your endpoint.

Route optimization is recommended for high throughput or latency sensitive workloads.

Requirements

  • For route optimization on a model serving endpoint, see Requirements.

  • For route optimization on a feature serving, see Requirements.

Enable route optimization on a model serving endpoint

Specify the route_optimized parameter during model serving endpoint creation to configure your endpoint for route optimization. You can only specify this parameter during endpoint creation, you can not update existing endpoints to be route optimized.

POST /api/2.0/serving-endpoints

{
  "name": "my-endpoint",
  "config":{
    "served_entities": [{
      "entity_name": "ads1",
      "entity_version": "1",
      "workload_type": "CPU",
      "workload_size": "Small",
      "scale_to_zero_enabled": true,
    }],
  },
  "route_optimized": true
}

If you prefer to use Python, you can create a route optimized serving endpoint using the following notebook.

Create a route optimized serving endpoint using Python notebook

Open notebook in new tab

Enable route optimization on a feature serving endpoint

To use route optimization for Feature and Function Serving, specify the full name of the feature specification in the entity_name field for serving endpoint creation requests. The entity_version is not needed for FeatureSpecs.

POST /api/2.0/serving-endpoints

{
  "name": "my-endpoint",
  "config":{
    "served_entities": [{
      "entity_name": "catalog_name.schema_name.feature_spec_name",
      "workload_type": "CPU",
      "workload_size": "Small",
      "scale_to_zero_enabled": true,
    }],
  },
  "route_optimized": true
}

Query route optimized model serving endpoints

The following steps show how to test query a route optimized model serving endpoint.

For production use, like using your route optimized endpoint in an application, you must create an OAuth token. To fetch an OAuth token programmatically, you can follow the guidance in OAuth machine-to-machine (M2M) authentication.

  1. Fetch an OAuth token from the Serving UI of your workspace.

    1. Click Serving in the sidebar to display the Serving UI.

    2. On the Serving endpoints page, select your route optimized endpoint to see endpoint details.

    3. On the endpoint details page, click the Query endpoint button.

    4. Select the Fetch Token tab.

    5. Select Fetch OAuth Token button. This token is valid for 1 hour. Fetch a new token if your current token expires.

  2. Get your model serving endpoint URL from the endpoint details page from the Serving UI.

  3. Use the OAuth token from step 1 and the endpoint URL from step 2 to populate the following example code that queries the route optimized endpoint.

url="your-endpoint-url"
OAUTH_TOKEN=xxxxxxx

curl -X POST -H 'Content-Type: application/json' -H "Authorization: Bearer $OAUTH_TOKEN" -d@data.json $url

For a Python SDK to query a route optimized endpoint, reach out to your Databricks account team.

Limitations

  • OAuth tokens are the only supported authentication for route optimization. Personal access tokens are not supported.

  • Route optimization does not enforce any network restrictions you might have configured in your Databricks workspace such as IP access control lists or PrivateLink. Do not enable route optimization if you require that model serving traffic be bound by those controls. If you have such network requirements and still want to try route-optimized model serving, reach out to your Databricks account team.