Configure AI Gateway on model serving endpoints
In this article, you learn how to configure Mosaic AI Gateway on a model serving endpoint.
Requirements
A Databricks workspace in an external models supported region.
Complete steps 1 and 2 of Create an external model serving endpoint.
Configure AI Gateway using the UI
This section shows how to configure AI Gateway during endpoint creation using the Serving UI.
If you prefer to do this programmatically, see the Notebook example.
In the AI Gateway section of the endpoint creation page, you can individually configure the following AI Gateway features:
Feature |
How to enable |
Details |
---|---|---|
Usage tracking |
Select Enable usage tracking to enable tracking and monitoring of data usage metrics. |
|
Payload logging |
Select Enable inference tables to automatically log requests and responses from your endpoint into Delta tables managed by Unity Catalog. |
|
|
||
Rate limits |
You can enforce request rate limits to manage traffic for your endpoint on a per user and per endpoint basis |
|
Traffic routing |
To configure traffic routing on your endpoint, see Serve multiple external models to an endpoint. |
Configure AI Guardrails in the UI
The following table shows how to configure supported guardrails.
Guardrail |
How to enable |
Details |
---|---|---|
Safety |
Select Safety to enable safeguards to prevent your model from interacting with unsafe and harmful content. |
|
Personally identifiable information (PII) detection |
Select PII detection to detect PII data such as names, addresses, credit card numbers. |
|
Valid topics |
You can type topics directly into this field. If you have multiple entries, be sure to press enter after each topic. Alternatively, you can upload a |
A maximum of 50 valid topics can be specified. Each topic cannot exceed 100 characters |
Invalid keywords |
You can type topics directly into this field. If you have multiple entries, be sure to press enter after each topic. Alternatively, you can upload a |
A maximum of 50 invalid keywords can be specified. Each keyword cannot exceed 100 characters. |
Usage tracking table schemas
The system.serving.served_entities
usage tracking system table has the following schema:
Column name |
Description |
Type |
---|---|---|
|
The unique ID of the served entity. |
STRING |
|
The customer account ID for Delta Sharing. |
STRING |
|
The customer workspace ID of the serving endpoint. |
STRING |
|
The ID of the creator. |
STRING |
|
The name of the serving endpoint. |
STRING |
|
The unique ID of the serving endpoint. |
STRING |
|
The name of the served entity. |
STRING |
|
Type of the entity that is served. Can be |
STRING |
|
The underlying name of the entity. Different from the |
STRING |
|
The version of the served entity. |
STRING |
|
The version of the endpoint configuration. |
INT |
|
The task type. Can be |
STRING |
|
Configurations for external models. For example, |
STRUCT |
|
Configurations for foundation models. For example, |
STRUCT |
|
Configurations for custom models. For example, |
STRUCT |
|
Configurations for feature specifications. For example, |
STRUCT |
|
Timestamp of change for the served entity. |
TIMESTAMP |
|
Timestamp of entity deletion. The endpoint is the container for the served entity. After the endpoint is deleted, the served entity is also deleted. |
TIMESTAMP |
The system.serving.endpoint_usage
usage tracking system table has the following schema:
Column name |
Description |
Type |
---|---|---|
|
The customer account ID. |
STRING |
|
The customer workspace id of the serving endpoint. |
STRING |
|
The user provided request identifier that can be specified in the model serving request body. |
STRING |
|
A Databricks generated request identifier attached to all model serving requests. |
STRING |
|
The ID of the user or service principal whose permissions are used for the invocation request of the serving endpoint. |
STRING |
|
The HTTP status code that was returned from the model. |
INTEGER |
|
The timestamp at which the request is received. |
TIMESTAMP |
|
The token count of the input. |
LONG |
|
The token count of the output. |
LONG |
|
The character count of the input string or prompt. |
LONG |
|
The character count of the output string of the response. |
LONG |
|
The user provided map containing identifiers of the end user or the customer application that makes the call to the endpoint. See Further define usage with usage_context. |
MAP |
|
Whether the request is in stream mode. |
BOOLEAN |
|
The unique ID used to join with the |
STRING |
Further define usage with usage_context
When you query an external model with usage tracking enabled, you can provide the usage_context
parameter with type Map[String, String]
. The usage context mapping appears in the usage tracking table in the usage_context
column. Account admins can aggregate different rows based on the usage context to get insights and can join this information with the information in the payload logging table. For example, you can add end_user_to_charge
to the usage_context
for tracking cost attribution for end users.
{
"messages": [
{
"role": "user",
"content": "What is Databricks?"
}
],
"max_tokens": 128,
"usage_context":
{
"use_case": "external",
"project": "project1",
"priority": "high",
"end_user_to_charge": "abcde12345",
"a_b_test_group": "group_a"
}
}
AI Gateway-enabled inference table schema
Inference tables enabled using AI Gateway have the following schema:
Column name |
Description |
Type |
---|---|---|
|
The UTC date on which the model serving request was received. |
DATE |
|
A Databricks generated request identifier attached to all model serving requests. |
STRING |
|
An optional client generated request identifier that can be specified in the model serving request body. |
STRING |
|
The timestamp at which the request is received. |
TIMESTAMP |
|
The HTTP status code that was returned from the model. |
INT |
|
The sampling fraction used in the event that the request was down-sampled. This value is between 0 and 1, where 1 represents that 100% of incoming requests were included. |
DOUBLE |
|
The time in milliseconds for which the model performed inference. This does not include overhead network latencies and only represents the time it took for the model to generate predictions. |
BIGINT |
|
The raw request JSON body that was sent to the model serving endpoint. |
STRING |
|
The raw response JSON body that was returned by the model serving endpoint. |
STRING |
|
The unique ID of the served entity. |
STRING |
|
ARRAY |
|
|
The ID of the user or service principal whose permissions are used for the invocation request of the serving endpoint. |
STRING |
Update AI Gateway features on endpoints
You can update AI Gateway features on model serving endpoints that had them previously enabled and endpoints that did not. Updates to AI Gateway configurations take about 20-40 seconds to be applied, however rate limiting updates can take up to 60 seconds.
The following shows how to update AI Gateway features on a model serving endpoint using the Serving UI.
In the Gateway section of the endpoint page, you can see which features are enabled. To update these features, click Edit AI Gateway.
Notebook example
The following notebook shows how to programmatically enable and use Databricks Mosaic AI Gateway features to manage and govern models from providers. See the following for REST API details: