Configure Unity AI Gateway endpoints

Beta

This feature is in Beta. Account admins can control access to this feature from the account console Previews page. See Manage Databricks previews.

This page describes how to configure Unity AI Gateway endpoints.

Requirements

Unity AI Gateway preview enabled for your account. See Manage Databricks previews.
A Databricks workspace in a Unity AI Gateway supported region.
Unity Catalog enabled for your workspace. See Enable a workspace for Unity Catalog.
Endpoint admin operations require CAN MANAGE on that endpoint. See Access control lists.
On creation, the creator is granted CAN MANAGE on the new endpoint.
To prevent bypassing guardrails or throughput limits, restrict endpoint creation and CAN MANAGE to admins, and grant other users only query permissions on approved endpoints.

Create a Unity AI Gateway endpoint

To create a Unity AI Gateway endpoint:

In the sidebar, click AI Gateway.
Click Create Unity AI Gateway Endpoint.
Configure your endpoint name and primary model.
Click Create.

Configure features on an endpoint

You can update Unity AI Gateway endpoints to enable and disable features. Updates to Unity AI Gateway configurations take up to 1 minute to take effect.

To update Unity AI Gateway features on an existing endpoint:

Click on your endpoint from the AI Gateway page.
In the Gateway Endpoint Details sidebar, click the edit icon next to the feature you want to update.
Make your changes and click Save.

AI Gateway UI

The following table summarizes the available Unity AI Gateway features and how to configure them:

Feature	How to configure	Details
Usage tracking	Enabled by default.	Logs usage data to the `system.ai_gateway.usage` system table. Account admins must enable the `ai_gateway` system table schema before using the system tables. See Grant access to system tables. Only account admins have permission to view or query the `system.ai_gateway.usage` table. The input and output token counts are estimated as `(text_length+1)/4` if the token count is not returned by the model.
Inference tables	Select Enable inference tables to log requests and responses.	Logs to Unity Catalog Delta tables. You must have `CREATE TABLE` permission in the specified catalog schema. Payloads larger than 10 MiB are not logged. The response payload aggregates the response of all of the returned chunks.
Rate limits	Select Rate limits to configure queries per minute (QPM) or tokens per minute (TPM).	Configure limits at the endpoint, user, or group level. Use the Endpoint field to set global limits. The endpoint rate limit is a global maximum. If exceeded, all requests are blocked. Use the User (Default) field to set per-user limits. Define custom rate limits for individual users, service principals, or groups.
Guardrails	Select Guardrails to configure content policies.	Apply personally identifiable information (PII) detection, content moderation, and other pre-built LLM-based policies to requests and responses. Blocked requests return HTTP 400 and are recorded in the usage tracking and inference tables. Enable dry run to test guardrail configurations without affecting production traffic. In dry run mode, guardrails are evaluated but requests or responses are never blocked or modified.
Fallbacks	Select Add fallback model to configure fallback models.	Requests fall back to other models when the primary model returns `429` or `5XX` errors. Each fallback model is tried once in sequential order until the request succeeds. The first successful or last failed request attempt and response are logged in both usage tracking and inference tables. All fallback attempts are recorded in the `routing_information` field of the usage tracking table.
Traffic splitting	Select Add traffic split to distribute requests across multiple model backends.	Assign a percentage of traffic to each destination model. Percentages must sum to 100. Use traffic splitting to gradually roll out new models, run A/B tests, or spread load across providers. All routing decisions are recorded in the `routing_information` field of the usage tracking table.
Custom APIs	Select Custom API when creating an endpoint to connect to an external API.	Apply the same access controls, rate limits, and logging to any external API endpoint. Custom API traffic is logged to the usage tracking and inference tables, with some limitations: token counting might not be available in usage tracking, and response chunk aggregation for streaming requests might not be available in inference tables.

The following diagram shows a fallbacks example where three models are registered as destinations of a Unity AI Gateway endpoint:

The request is originally routed to Model 1.
If the request returns a 200 response, the request was successful on Model 1 and the request and its response are logged to the usage tracking and inference tables.
If the request returns a 429 or 5XX error on Model 1, the request falls back to the next model on the endpoint, Model 2.
If the request returns a 429 or 5XX error on Model 2, the request falls back to the next model on the endpoint, Model 3.
If the request returns a 429 or 5XX error on Model 3, the request fails since all fallback models have been tried. The failed request and the response error are logged to the usage tracking and inference tables.

Fallbacks example

Requirements​

Create a Unity AI Gateway endpoint​

Configure features on an endpoint​

Additional resources​

Requirements

Create a Unity AI Gateway endpoint

Configure features on an endpoint

Additional resources