Configure AI Gateway on model serving endpoints

In this article, you learn how to configure Mosaic AI Gateway on a model serving endpoint.

Requirements

A Databricks workspace in a region where model serving is supported. See Model serving features availability.
A model serving endpoint. You can use one of the preconfigured pay-per-token endpoints on your workspace or do the following:
- To create an endpoint for external models, complete steps 1 and 2 of Create an external model serving endpoint.
- To create an endpoint for provisioned throughput, see Provisioned throughput Foundation Model APIs.
- To create an endpoint for a custom model, see Create an endpoint.

Configure AI Gateway using the UI

In the AI Gateway section of the endpoint creation page, you can individually configure AI Gateway features. See Supported features for which features are available on external model serving endpoints and provisioned throughput endpoints.

Configure AI Gateway features

The following table summarizes how to configure AI Gateway during endpoint creation using the Serving UI. If you prefer to do this programmatically, see the Notebook example.

Feature	How to enable	Details
Usage tracking	Select Enable usage tracking to enable tracking and monitoring of data usage metrics.	You must have Unity Catalog enabled. Account admins must enable the serving system table schema before using the system tables: `system.serving.endpoint_usage`, which captures token counts for each request to the endpoint. `system.serving.served_entities`, which stores metadata for each foundation model. See Usage tracking table schemas Only account admins have permission to view or query the `served_entities` table or `endpoint_usage` table, even though the user that manages the endpoint must enable usage tracking. See Grant access to system tables. The input and output token count are estimated as (`text_length`+1)/4 if the token count is not returned by the model.
Payload logging	Select Enable inference tables to automatically log requests and responses from your endpoint into Delta tables managed by Unity Catalog.	You must have Unity Catalog enabled and `CREATE TABLE` access in the specified catalog schema. Inference tables enabled by AI Gateway have a different schema than legacy inference tables which are only supported on endpoints that serve custom models. See AI Gateway-enabled inference table schema. Payload logging data populates these tables less than an hour after querying the endpoint. See Limitations for latency expectations for custom model serving endpoints. Payloads larger than 1 MiB are not logged. The response payload aggregates the response of all of the returned chunks. Streaming is supported. In streaming scenarios, the response payload aggregates the response of returned chunks. Inference tables for route optimized model serving endpoints are in Public Preview.
AI Guardrails	See Configure AI Guardrails in the UI.	Guardrails prevent the model from interacting with unsafe and harmful content that is detected in model inputs and outputs. Output guardrails are not supported for embeddings models or for streaming.
Rate limits	Select Rate limits to manage and specify the number of queries per minute (QPM) or tokens per minute (TPM) that your endpoint can support. Rate limits only apply to users who have permission to query the endpoint. You can define query-based and token-based rate limits at different levels: Use the Endpoint field to specify the maximum QPM or TPM that the entire endpoint can handle. This limit applies to all traffic, regardless of the user. Use the User (Default) field to set a default per-user rate limit that applies to all users of the endpoint, unless a more specific, custom rate limit is defined. You can specify custom rate limits for: Individual users or service principals. These take priority over user group custom rate limits. User groups. This limit is a shared rate limit for all members of the group.	TPM rate limits cannot be applied to serving endpoints that serve custom models or agents. By default, there are no rate limits configured for users or the endpoint. A maximum of 20 rate limits and up to 5 group-specific rate limits can be specified on an endpoint. The endpoint rate limit is a global maximum. If this limit is exceeded, all requests to the endpoint are blocked, regardless of any user-specific or group-specific rate limits. If an endpoint, user, or service principal have both a query-based rate limit and token-based rate limit specified, the more restrictive rate limit is enforced. Custom rate limits override the User (Default) rate limit. If a user belongs to both a user-specific limit and a group-specific limit, the user-specific limit is enforced. If a user belongs to multiple user groups with different QPM or TPM rate limits, then the user is rate limited if they exceed all of the QPM rate limits or all of the TPM rate limits of their user groups.
Traffic splitting	In the Served entities section, specify the percentage of traffic you want to be routed to specific models. To configure traffic splitting on your endpoint programmatically, see Serve multiple external models to an endpoint.	To route all traffic to a specific model, set it to 100%. If you want to specify a fallback-only model, add that model to the endpoint and set its percentage of traffic to 0%. To load balance traffic across models and set up fallbacks, you can expect the following behavior: Requests are randomly split across the entities based on the assigned traffic percentages. If the request hits the first entity and fails, it falls back to the next entity in the order that the served entities were listed during endpoint creation or the most recent endpoint update. The traffic split does not influence the order of fallback attempts.
Fallbacks	Select Enable fallbacks in the AI Gateway section to send your request to other served models on the endpoint as a fallback.	If the initial request routed to a certain entity returns a `429` or `5XX` error, the request falls back to the next entity listed on the endpoint. The order in which requests are redirected to fallback served entities is based on the order the models are listed during endpoint creation or the most recent endpoint update. Traffic percentage does not influence the order of fallback attempts sent to served entities. Fallbacks are only supported for external models. You must assign traffic percentages to other models served on the endpoint before you can enable fallbacks to external models. Any external model assigned 0% traffic functions exclusively as a fallback model. You can have a maximum of two fallbacks. Each entity is tried once in sequential order until the request succeeds. If all listed entities have been tried without success the request fails. The first successful or last failed request attempt and response are logged in both the usage tracking and payload logging tables.

The following diagram shows a fallbacks example where,

Three served entities are served on a model serving endpoint.
The request is originally routed to Served entity 3.
If the request returns a 200 response, the request was successful on Served entity 3 and the request and its response are logged to the usage tracking and payload logging tables of the endpoint.
If the request returns a 429 or 5xx error on Served entity 3, the request falls back to the next served entity on the endpoint, Served entity 1.
- If the request returns a 429 or 5xx error on Served entity 1, the request falls back to the next served entity on the endpoint, Served entity 2.
- If the request returns a 429 or 5xx error on Served entity 2, the request fails since this is the maximum number of fall back entities. The failed request and the response error are logged to the usage tracking and payload logging tables.

Fallback diagram example

Configure AI Guardrails in the UI

Preview

This feature is in Public Preview.

The following table shows how to configure supported guardrails.

Guardrail	How to enable
Safety	Select Safety to enable safeguards to prevent your model from interacting with unsafe and harmful content.
Personally identifiable information (PII) detection	Select to Block or Mask PII data such as names, addresses, credit card numbers if such information is detected in endpoint requests and responses. Otherwise, select None for no PII detection to occur.

Configure AI Guardrail features

Usage tracking table schemas

The following sections summarize the usage tracking table schemas for the system.serving.served_entities and system.serving.endpoint_usage system tables.

`system.serving.served_entities` usage tracking table schema

The system.serving.served_entities usage tracking system table has the following schema:

Column name	Description	Type
`served_entity_id`	The unique ID of the served entity.	STRING
`account_id`	The customer account ID for Delta Sharing.	STRING
`workspace_id`	The customer workspace ID of the serving endpoint.	STRING
`created_by`	The ID of the creator. For pay-per-token endpoints, this is `System-User`	STRING
`endpoint_name`	The name of the serving endpoint.	STRING
`endpoint_id`	The unique ID of the serving endpoint.	STRING
`served_entity_name`	The name of the served entity.	STRING
`entity_type`	Type of the entity that is served. Can be `FEATURE_SPEC`, `EXTERNAL_MODEL`, `FOUNDATION_MODEL`, or `CUSTOM_MODEL`	STRING
`entity_name`	The underlying name of the entity. Different from the `served_entity_name` which is a user provided name. For example, `entity_name` is the name of the Unity Catalog model.	STRING
`entity_version`	The version of the served entity.	STRING
`endpoint_config_version`	The version of the endpoint configuration.	INT
`task`	The task type. Can be `llm/v1/chat`, `llm/v1/completions`, or `llm/v1/embeddings`.	STRING
`external_model_config`	Configurations for external models. For example, `{Provider: OpenAI}`	STRUCT
`foundation_model_config`	Configurations for foundation models. For example,`{min_provisioned_throughput: 2200, max_provisioned_throughput: 4400}`	STRUCT
`custom_model_config`	Configurations for custom models. For example,`{ min_concurrency: 0, max_concurrency: 4, compute_type: CPU }`	STRUCT
`feature_spec_config`	Configurations for feature specifications. For example, `{ min_concurrency: 0, max_concurrency: 4, compute_type: CPU }`	STRUCT
`change_time`	Timestamp of change for the served entity.	TIMESTAMP
`endpoint_delete_time`	Timestamp of entity deletion. The endpoint is the container for the served entity. After the endpoint is deleted, the served entity is also deleted.	TIMESTAMP

`system.serving.endpoint_usage` usage tracking table schema

The system.serving.endpoint_usage usage tracking system table has the following schema:

Column name	Description	Type
`account_id`	The customer account ID.	STRING
`workspace_id`	The customer workspace id of the serving endpoint.	STRING
`client_request_id`	The user provided request identifier that can be specified in the model serving request body. For custom model endpoints, this is not supported for requests larger than 4MiB.	STRING
`databricks_request_id`	A Databricks generated request identifier attached to all model serving requests.	STRING
`requester`	The ID of the user or service principal whose permissions are used for the invocation request of the serving endpoint.	STRING
`status_code`	The HTTP status code that was returned from the model.	INTEGER
`request_time`	The timestamp at which the request is received.	TIMESTAMP
`input_token_count`	The token count of the input. This will be 0 for custom model requests.	LONG
`output_token_count`	The token count of the output. This will be 0 for custom model requests.	LONG
`input_character_count`	The character count of the input string or prompt. This will be 0 for custom model requests.	LONG
`output_character_count`	The character count of the output string of the response. This will be 0 for custom model requests.	LONG
`usage_context`	The user provided map containing identifiers of the end user or the customer application that makes the call to the endpoint. See Further define usage with `usage_context`. For custom model endpoints, this is not supported for requests larger than 4MiB.	MAP
`request_streaming`	Whether the request is in stream mode.	BOOLEAN
`served_entity_id`	The unique ID used to join with the `system.serving.served_entities` dimension table to lookup information about the endpoint and served entity.	STRING

Further define usage with `usage_context`

When you query an external model with usage tracking enabled, you can provide the usage_context parameter with type Map[String, String]. The usage context mapping appears in the usage tracking table in the usage_context column. The usage_context map size cannot exceed 10 KiB.

Bash
{
  "messages": [
    {
      "role": "user",
      "content": "What is Databricks?"
    }
  ],
  "max_tokens": 128,
  "usage_context":
    {
      "use_case": "external",
      "project": "project1",
      "priority": "high",
      "end_user_to_charge": "abcde12345",
      "a_b_test_group": "group_a"
    }
}

If you're using the OpenAI Python client, you can specify the usage_context by including it in the extra_body parameter.

Python
from openai import OpenAI

client = OpenAI(
    api_key="dapi-your-databricks-token",
    base_url="https://example.staging.cloud.databricks.com/serving-endpoints"
)

response = client.chat.completions.create(
    model="databricks-claude-sonnet-4-5",
    messages=[{"role": "user", "content": "What is Databricks?"}],
    temperature=0,
    extra_body={"usage_context": {"project": "project1"}},
)
answer = response.choices[0].message.content
print("Answer:", answer)

Account admins can aggregate different rows based on the usage context to get insights and can join this information with the information in the payload logging table. For example, you can add end_user_to_charge to the usage_context for tracking cost attribution for end users.

Monitor endpoint usage

To monitor endpoint usage, you can join the system tables and inference tables for your endpoint.

Join system tables

This example applies to external, provisioned throughput, pay-per-token, and custom model endpoints.

To join the endpoint_usage and served_entities system tables, use the following SQL:

SQL
SELECT * FROM system.serving.endpoint_usage as eu
JOIN system.serving.served_entities as se
ON eu.served_entity_id = se.served_entity_id
WHERE created_by = "\<user_email\>";

Update AI Gateway features on endpoints

You can update AI Gateway features on model serving endpoints that had them previously enabled and endpoints that did not. Updates to AI Gateway configurations take about 20-40 seconds to be applied, however rate limiting updates can take up to 60 seconds.

The following shows how to update AI Gateway features on a model serving endpoint using the Serving UI.

In the Gateway section of the endpoint page, you can see which features are enabled. To update these features, click Edit AI Gateway.

Update AI Gateway features

Notebook example

The following notebook shows how to programmatically enable and use Databricks Mosaic AI Gateway features to manage and govern models from providers. See the PUT /api/2.0/serving-endpoints/{name}/ai-gateway for REST API details.

Enable Databricks Mosaic AI Gateway features notebook

Open notebook in new tab

Requirements​

Configure AI Gateway using the UI​

Configure AI Guardrails in the UI​

Usage tracking table schemas​

system.serving.served_entities usage tracking table schema​

system.serving.endpoint_usage usage tracking table schema​

Further define usage with usage_context​

Monitor endpoint usage​

Join system tables​

Update AI Gateway features on endpoints​

Notebook example​