databricks-logo

enable-gateway-features

(Python)
Loading...

Enable Databricks Mosaic AI Gateway features

This notebook shows how to enable and use Databricks Mosaic AI Gateway features to manage and govern models from providers, such as OpenAI and Anthropic.

In this notebook, you use the Model Serving and AI Gateway API to accomplish the following tasks:

  • Create and configure an endpoint for OpenAI GPT-4o-Mini.
  • Enable AI Gateway features including usage tracking, inference tables, guardrails, and rate limits.
  • Set up invalid keywords and personally identifiable information (PII) detection for model requests and responses.
  • Implement rate limits for model serving endpoints.
  • Configure multiple models for A/B testing.
  • Enable fallbacks for failed requests.

If you prefer a low-code experience, you can create an external models endpoint and configure AI Gateway features using the Serving UI (AWS | Azure | GCP).

2

3

Create a model serving endpoint for OpenAI GPT-4o-Mini

The following creates a model serving endpoint for GPT-4o Mini without AI Gateway enabled. First, you define a helper function for creating and updating the endpoint:

5

Next, write a simple configuration to set up the endpoint. See POST /api/2.0/serving-endpoints for API details.

7

8

One of the immediate benefits of using OpenAI models (or models from other providers) using Databricks is that you can immediately query the model using the any of the following methods:

  • Databricks Python SDK
  • OpenAI Python client
  • REST API calls
  • MLflow Deployments SDK
  • Databricks SQL ai_query function

See the Query foundation models and external models article (AWS | Azure | GCP).

For example, you can use ai_query to query the model with Databricks SQL.

10

Add an AI Gateway configuration

After you set up a model serving endpoint, you can query the OpenAI model using any of the various querying methods accessible in Databricks.

You can further enrich the model serving endpoint by enabling the Databricks Mosaic AI Gateway, which offers a variety of features for monitoring and managing your endpoint. These features include inference tables, guardrails, and rate limits, among other things.

To start, the following is a simple configuration that enables inference tables for monitoring endpoint usage. Understanding how the endpoint is being used and how often, helps to determine what usage limits and guardrails are beneficial for your use case.

12

13

Query the inference table

The following displays the inference table that was created when enabled in AI Gateway. Note: For example purposes, a number of queries were run on this endpoint in the AI playground after running the above update to add inference tables, but before querying them.

15

You can extract details such as the request messages, response messages, and token counts using SQL:

17

Set up AI Guardrails

Set invalid keywords

You can investigate the inference table to see whether the endpoint is being used for inappropriate topics. From the inference table, it looks like a user is talking about SuperSecretProject! For this example, you can assume that topic is not in the scope of use for this chat endpoint.

19

The following adds SuperSecretProject to the list of invalid keywords to make sure usage stays in scope.

21

22

Now, queries referencing SuperSecretProject are not run, but instead returns an error message, "Error: Invalid keywords detected in the prompt. Please revise your input."

24

Error: Invalid keywords detected in the prompt. Please revise your input.

Set up PII detection

Now, the endpoint blocks messages referencing SuperSecretProject. You can also make sure the endpoint doesn't accept requests with or respond with messages containing any PII.

The following updates the guardrails configuration for pii:

26

27

The following tries to prompt the model to work with PII, but returns the message, "Error: PII (Personally Identifiable Information) detected. Please try again.".

29

Error: PII (Personally Identifiable Information) detected. Please try again.

Add rate limits

Say you are investigating the inference tables further and you see some steep spikes in usage suggesting a higher-than-expected volume of queries. Extremely high usage could be costly if not monitored and limited.

31

You can set a rate limit to prevent excessive queries. In this case, you can set the limit on the endpoint, but it is also possible to set per-user limits.

33

34

The following shows an example of what the output error looks like when the rate limit is exceeded.

36

Request 1 sent Request 2 sent Request 3 sent Request 4 sent Request 5 sent Request 6 sent Request 7 sent Request 8 sent Request 9 sent Request 10 sent
RateLimitError: Error code: 429 - {'error_code': 'REQUEST_LIMIT_EXCEEDED', 'message': 'REQUEST_LIMIT_EXCEEDED: User defined rate limit(s) exceeded for endpoint: dr-gateway-demo.'}
File <command-1862781031873971>, line 8 6 start_time = time.time() 7 for i in range(1, 12): ----> 8 client.chat.completions.create( 9 model="dr-gateway-demo", 10 messages=[ 11 {"role": "system", "content": "You are a helpful assistant."}, 12 {"role": "user", "content": f"This is request {i}"}, 13 ], 14 max_tokens=10, 15 ) 16 print(f"Request {i} sent") 17 print(f"Total time: {time.time() - start_time:.2f} seconds")

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-96f8c053-30f1-4cae-afe7-f8c0461fc561/lib/python3.10/site-packages/openai/_base_client.py:1041, in SyncAPIClient._request(self, cast_to, options, remaining_retries, stream, stream_cls) 1038 err.response.read() 1040 log.debug("Re-raising status error") -> 1041 raise self._make_status_error_from_response(err.response) from None 1043 return self._process_response( 1044 cast_to=cast_to, 1045 options=options, (...) 1049 retries_taken=options.get_max_retries(self.max_retries) - retries, 1050 )

Add another model

At some point, you might want to A/B test models from different providers. You can add another OpenAI model to the configuration, like in the following example:

38

39

Now, traffic will be split between these two models (you can configure the proportion of traffic going to each model). This enables you to use the inference tables to evaluate the quality of each model and make an informed decision about switching from one model to another.

Enable fallback models for requests

For requests on External Models, you can configure a fallback.

Enabling fallbacks ensures that if a request to one entity fails with a 429 or 5XX error, it will automatically failover to the next entity in the listed order, cycling back to the top if necessary. There is a maximum of 2 fallbacks allowed. Any External Model assigned 0% traffic functions exclusively as a fallback model. The first successful or last failed request attempt is recorded in both the usage tracking system table and the inference table.

In the following example:

  • The traffic_config field specifies that 50 percent of traffic goes to external_model_1 and the other 50% of the traffic goes to external_model_2.
  • In the ai_gateway section, the fallback_config field specifies that fallbacks are enabled.
  • If a request fails when it is sent to external_model_1 then the request is redirected to the next model listed in the traffic configuration, in this case, external_model_2.
42

;