Unity AI Gateway for serving endpoints
A new Unity AI Gateway experience is available in Beta. The new Unity AI Gateway is the enterprise control plane for governing LLM endpoints and coding agents with enhanced features. See Unity AI Gateway for LLM endpoints.
This page describes Unity AI Gateway for serving endpoints, which governs and monitors access to supported generative AI models and their associated model-serving endpoints.
What is Unity AI Gateway for serving endpoints?
Unity AI Gateway is designed to streamline the usage and management of generative AI models and agents within an organization. It is a centralized service that brings governance, monitoring, and production readiness to model serving endpoints. It also allows you to run, secure, and govern AI traffic to democratize and accelerate AI adoption for your organization.
All data is logged into Delta tables in Unity Catalog.
To start visualizing insights from your Unity AI Gateway data, download the example Unity AI Gateway dashboard from GitHub. This dashboard leverages the data from the usage tracking and payload logging inference tables.
After you download the JSON file, import the dashboard into your workspace. For instructions on importing dashboards, see Import a dashboard file.
Supported features
The new Unity AI Gateway features a rich UI, improved observability, and expanded API coverage for LLMs, including external models and pay-per-token Foundation Model APIs. We recommend using Unity AI Gateway to unlock these new capabilities.
The following table defines the available Unity AI Gateway features and which model serving endpoint types support them.
Feature | Definition | |||||
|---|---|---|---|---|---|---|
Available in Unity AI Gateway | Use enhanced Unity AI Gateway features. See Unity AI Gateway for LLM endpoints. | Supported | Supported | Not supported | Not supported | Not supported |
Permission and rate limiting | Control who has access and how much access. | Supported | Supported | Supported | Not supported | Supported |
Payload logging | Monitor and audit data being sent to model APIs using inference tables. | Supported | Supported | Supported | Supported | Supported |
Usage tracking | Monitor operational usage on endpoints and associated costs using system tables. | Supported | Supported | Supported | Not supported | Supported |
AI Guardrails | Prevent unwanted and unsafe data in requests and responses. See AI Guardrails. | Supported | Supported | Supported | Not supported | Not supported |
Fallbacks | Minimize production outages during and after deployment. | Supported | Not supported | Not supported | Not supported | Not supported |
Traffic splitting | Load balance traffic across models. | Supported | Not supported | Supported | Not supported | Supported |
Unity AI Gateway incurs charges on an enabled feature basis. Paid features include payload logging and usage tracking. Features such as query permissions, rate limiting, fallbacks, and traffic splitting are free of charge. Any new features are subject to charge.
AI Guardrails
This feature is in Public Preview.
AI Guardrails allow users to configure and enforce data compliance at the model serving endpoint level and to reduce harmful content on any requests sent to the underlying model. Bad requests and responses are blocked and a default message is returned to the user. See how to configure guardrails on a model serving endpoint.
The AI Guardrails moderation service has a dependency on Foundation Model APIs pay-per-token models. This dependency limits the availability of the AI Guardrails moderation service to regions that support Foundation Model APIs pay-per-token.
The following table summarizes the configurable guardrails. See Limitations.
Guardrail | Definition |
|---|---|
Safety filtering | Safety filtering prevents your model from interacting with unsafe and harmful content, like violent crime, self-harm, and hate speech. |
Personally identifiable information (PII) detection | Customers can detect any sensitive information such as credit card numbers for users. |
Use Unity AI Gateway
You can configure Unity AI Gateway features on your model serving endpoints using the Serving UI. See Configure Unity AI Gateway on model serving endpoints.
Limitations
The following are limitations for Unity AI Gateway-enabled endpoints:
- When AI guardrails are used, the request batch size, that is an embeddings batch size, completions batch size, or the
nparameter of chat requests, can not exceed 16. - If you use function calling and specify AI guardrails, those guardrails are not applied to the requests and intermediate responses of the function. However, guardrails are applied to the final output response.
- Text-to-image workloads are not supported.
- Only usage tracking is supported for batch inference workloads on pay-per-token endpoints that have Unity AI Gateway features enabled. In the
endpoint_usagesystem table only the rows corresponding to the batch inference request are visible. - AI guardrails and fallbacks are not supported on custom model serving endpoints.
- For custom model serving endpoints, only workloads that are not route-optimized support rate limiting and usage tracking.
- Inference tables for route-optimized model serving endpoints are in Public Preview.
- See Unity AI Gateway-enabled inference table limitations for details on inference table limitations.