Skip to main content

Configure AI Gateway endpoints

Beta

This feature is in Beta. Account admins can control access to this feature from the account console Previews page. See Manage Databricks previews.

This page describes how to configure AI Gateway (Beta) endpoints.

Requirements

Create an AI Gateway endpoint

To create an AI Gateway endpoint:

  1. In the sidebar, click AI Gateway.
  2. Click Create AI Gateway Endpoint.
  3. Configure your endpoint name and primary model.
  4. Click Create.

Configure features on an endpoint

You can update AI Gateway endpoints to enable and disable features. Updates to AI Gateway configurations take up to 1 minute to take effect.

To update AI Gateway features on an existing endpoint:

  1. Click on your endpoint from the AI Gateway page.
  2. In the Gateway Endpoint Details sidebar, click the edit icon next to the feature you want to update.
  3. Make your changes and click Save.

AI Gateway UI

The following table summarizes the available AI Gateway features and how to configure them:

Feature

How to configure

Details

Usage tracking

Enabled by default.

  • Logs usage data to the system.ai_gateway.usage system table.
  • Account admins must enable the ai_gateway system table schema before using the system tables. See Grant access to system tables.
  • Only account admins have permission to view or query the system.ai_gateway.usage table.
  • The input and output token counts are estimated as (text_length+1)/4 if the token count is not returned by the model.

Inference tables

Select Enable inference tables to log requests and responses.

  • Logs to Unity Catalog Delta tables.
  • You must have CREATE TABLE permission in the specified catalog schema.
  • Payloads larger than 10 MiB are not logged.
  • The response payload aggregates the response of all of the returned chunks.

Rate limits

Select Rate limits to configure queries per minute (QPM) or tokens per minute (TPM).

  • Configure limits at the endpoint, user, or group level.
    • Use the Endpoint field to set global limits. The endpoint rate limit is a global maximum. If exceeded, all requests are blocked.
    • Use the User (Default) field to set per-user limits.
      • Define custom rate limits for individual users, service principals, or groups.
  • A maximum of 20 rate limits and up to 5 group-specific rate limits can be specified.
  • If a user has both QPM and TPM limits, the more restrictive limit is enforced.
  • Rate limits only apply to users who have permission to query the endpoint.
  • By default, there are no rate limits configured for users or the endpoint.
  • Custom rate limits override the User (Default) rate limit.
    • If a user belongs to both a user-specific limit and a group-specific limit, the user-specific limit is enforced.
    • If a user belongs to multiple user groups with different rate limits, they are rate limited if they exceed all of the QPM rate limits or all of the TPM rate limits of their user groups.

Fallbacks

Select Add fallback model to configure fallback models.

  • Requests fall back to other models when the primary model returns 429 or 5XX errors.
  • Each fallback model is tried once in sequential order until the request succeeds.
  • The first successful or last failed request attempt and response are logged in both usage tracking and inference tables.
  • All fallback attempts are recorded in the routing_information field of the usage tracking table.

The following diagram shows a fallbacks example where three models are registered as destinations of an AI Gateway endpoint:

  1. The request is originally routed to Model 1.
  2. If the request returns a 200 response, the request was successful on Model 1 and the request and its response are logged to the usage tracking and inference tables.
  3. If the request returns a 429 or 5XX error on Model 1, the request falls back to the next model on the endpoint, Model 2.
  4. If the request returns a 429 or 5XX error on Model 2, the request falls back to the next model on the endpoint, Model 3.
  5. If the request returns a 429 or 5XX error on Model 3, the request fails since all fallback models have been tried. The failed request and the response error are logged to the usage tracking and inference tables.

Fallbacks example

Next steps