Use model services

Beta

This feature is in Beta. Account admins can control access to this feature from the account console Previews page. See Manage Databricks previews.

In this article, you learn the options for writing query requests for foundation models and sending them to a model service in Unity AI Gateway.

Unity AI Gateway exposes model services through a unified, OpenAI-compatible API, so you can experiment with and customize Databricks-hosted foundation models across providers. Identify a model service by its fully qualified name as the model slug—for example, system.ai.claude-opus-4-6—and send requests to your workspace's Unity AI Gateway base URL, https://<workspace-url>/ai-gateway/mlflow/v1.

note

The examples in this article query model services. For backward compatibility, Databricks interprets a Databricks-hosted model name without a fully qualified name, such as databricks-claude-opus-4-6, as the system-provided model service system.ai.claude-opus-4-6. This behavior lets existing workloads continue to run without code changes.

Query options

Unity AI Gateway provides the following options for sending query requests to model services that serve foundation models:

Method	Details
OpenAI client	Query a model service using the OpenAI client. Specify the model service's fully qualified name (for example, `system.ai.claude-opus-4-6`) as the `model` input. Supported for chat, embeddings, and completions models made available by Foundation Model APIs or external models.
REST API	Call and query the model service using the REST API. Send a `POST` request to your workspace's Unity AI Gateway base URL, `https://<workspace-url>/ai-gateway/mlflow/v1/chat/completions`. See Unity AI Gateway.
Databricks Python SDK	Databricks Python SDK is a layer on top of the REST API. It handles low-level details, such as authentication, making it easier to interact with the models.

note

During Beta, you cannot query a model service with the ai_query SQL function. Query model services with the OpenAI client or the REST API.

Requirements

EXECUTE on the model service, and USE CATALOG and USE SCHEMA on its catalog and schema. System-provided model services in system.ai grant EXECUTE to all account users by default. You don't need access to the models the service references—Databricks checks that the model service owner has EXECUTE on them.
A model service to query. To create a custom model service, see Create custom model services.
A Databricks workspace in a Unity AI Gateway supported region.
To send a scoring request through the OpenAI client or REST API, you must have a Databricks API token.

important

As a security best practice for production scenarios, Databricks recommends that you use machine-to-machine OAuth tokens for authentication during production.

For testing and development, Databricks recommends using a personal access token belonging to service principals instead of workspace users. To create tokens for service principals, see Manage tokens for a service principal.

Install packages

After you have selected a querying method, you must first install the appropriate package to your cluster.

OpenAI client
REST API
Databricks Python SDK

To use the OpenAI client, the databricks-openai package needs to be installed on your cluster. This package provides an OpenAI client with authorization automatically configured to query generative AI models. Run the following in your notebook or your local terminal:

pip install -U databricks-openai

The following is only required when installing the package on a Databricks Notebook

Python
dbutils.library.restartPython()

Structured outputs

Structured outputs is OpenAI-compatible and is only available during model serving as part of Unity AI Gateway. For details, see Structured outputs on Databricks.

Prompt caching

Prompt caching is supported for Databricks-hosted Claude models as part of Unity AI Gateway.

You can specify the cache_control parameter in your query requests to cache the following:

Text content messages in the messages.content array.
Thinking messages content in the messages.content array.
Images content blocks in the messages.content array.
Tool use, results and definitions in the tools array.

TextContent
ReasonContent
ImageContent
ToolCallContent

JSON
{
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What's the date today?",
          "cache_control": { "type": "ephemeral" }
        }
      ]
    }
  ]
}

JSON
{
  "messages": [
    {
      "role": "assistant",
      "content": [
        {
          "type": "reasoning",
          "summary": [
            {
              "type": "summary_text",
              "text": "Thinking...",
              "signature": "[optional]"
            },
            {
              "type": "summary_encrypted_text",
              "data": "[encrypted text]"
            }
          ]
        }
      ]
    }
  ]
}

Image message content must use the encoded data as its source. URLs are not supported.

JSON
{
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What’s in this image?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "data:image/jpeg;base64,[content]"
          },
          "cache_control": { "type": "ephemeral" }
        }
      ]
    }
  ]
}

JSON
{
  "messages": [
    {
      "role": "assistant",
      "content": "Ok, let’s get the weather in New York.",
      "tool_calls": [
        {
          "type": "function",
          "id": "123",
          "function": {
            "name": "get_weather",
            "arguments": "{\"location\":\"New York, NY\"}"
          },
          "cache_control": { "type": "ephemeral" }
        }
      ]
    }
  ]
}

note

The Databricks REST API is OpenAI-compatible and differs from the Anthropic API. These differences also impact response objects like the following:

Output is returned in the choices field.
Streaming chunk format. All chunks adhere to the same format where choices contains the response delta and usage is returned in every chunk.
Stop reason is returned in the finish_reason field.
- Anthropic uses: end_turn, stop_sequence, max_tokens, and tool_use
- Respectively, Databricks uses: stop, stop, length, and tool_calls

Chat with supported LLMs using AI Playground

You can interact with supported large language models using the AI Playground. The AI Playground is a chat-like environment where you can test, prompt, and compare LLMs from your Databricks workspace.

AI playground

Query options​

Requirements​

Install packages​

Structured outputs​

Prompt caching​

Chat with supported LLMs using AI Playground​

Additional resources​

Query options

Requirements

Install packages

Structured outputs

Prompt caching

Chat with supported LLMs using AI Playground

Additional resources