Use foundation models

In this article, you learn which options are available to write query requests for foundation models and how to send them to your model serving endpoint. You can query foundation models that are hosted by Databricks and foundation models hosted outside of Databricks.

For traditional ML or Python models query requests, see Query serving endpoints for custom models.

Mosaic AI Model Serving supports Foundation Models APIs and external models for accessing foundation models. Model Serving uses a unified OpenAI-compatible API and SDK for querying them. This makes it possible to experiment with and customize foundation models for production across supported clouds and providers.

Query options

Mosaic AI Model Serving provides the following options for sending query requests to endpoints that serve foundation models:

Method	Details
OpenAI client	Query a model hosted by a Mosaic AI Model Serving endpoint using the OpenAI client. Specify the model serving endpoint name as the `model` input. Supported for chat, embeddings, and completions models made available by external models.
Serving UI	Select Query endpoint from the Serving endpoint page. Insert JSON format model input data and click Send Request. If the model has an input example logged, use Show Example to load it.
REST API	Call and query the model using the REST API. See POST /serving-endpoints/{name}/invocations for details. For scoring requests to endpoints serving multiple models, see Query individual models behind an endpoint.
MLflow Deployments SDK	Use MLflow Deployments SDK's predict() function to query the model.
Databricks Python SDK	Databricks Python SDK is a layer on top of the REST API. It handles low-level details, such as authentication, making it easier to interact with the models.

Requirements

A model serving endpoint.
A Databricks workspace in a supported region.
- Foundation Model APIs regions
- External models regions
To send a scoring request through the OpenAI client, REST API or MLflow Deployment SDK, you must have a Databricks API token.

Install packages

After you have selected a querying method, you must first install the appropriate package to your cluster.

OpenAI client
REST API
MLflow Deployments SDK
Databricks Python SDK

To use the OpenAI client, the databricks-openai package needs to be installed on your cluster. This package provides an OpenAI client with authorization automatically configured to query generative AI models. Run the following in your notebook or your local terminal:

pip install -U databricks-openai

The following is only required when installing the package on a Databricks Notebook

Python
dbutils.library.restartPython()

!pip install mlflow

The following is only required when installing the package on a Databricks Notebook

Python
dbutils.library.restartPython()

Foundation model types

The following table summarizes the supported foundation models based on task type.

Task type	Description	Supported models	When to use? Recommended use cases
General purpose	Models designed to understand and engage in natural, multi-turn conversations. They are fine-tuned on large datasets of human dialogue, which enables them to generate contextually relevant responses, track conversational history, and provide coherent, human-like interactions across various topics.	The following are supported Databricks-hosted foundation models: `databricks-gpt-5` `databricks-gpt-5-mini` `databricks-gpt-5-nano` `databricks-gpt-5-3-codex` `databricks-gpt-5-2-codex` `databricks-gpt-5-1-codex-max` `databricks-gpt-5-1-codex-mini` `databricks-gemini-3-1-pro` `databricks-gemini-3-pro` `databricks-gemini-3-flash` `databricks-gemini-2-5-pro` `databricks-gemini-2-5-flash` `databricks-claude-sonnet-4-6` `databricks-claude-sonnet-4-5` `databricks-claude-opus-4-1` `databricks-gpt-oss-20b` `databricks-gpt-oss-120b` `databricks-gemma-3-12b` `databricks-claude-sonnet-4` `databricks-llama-4-maverick` `databricks-claude-3-7-sonnet` `databricks-meta-llama-3-3-70b-instruct` The following are supported external models: OpenAI GPT and o series models Anthropic Claude models Google Gemini models	Recommended for scenarios where natural, multi-turn dialogue and contextual understanding are needed: Virtual assistants Customer support bots Interactive tutoring systems.
Embeddings	Embedding models are machine learning systems that transform complex data—such as text, images, or audio—into compact numerical vectors called embeddings. These vectors capture the essential features and relationships within the data, allowing for efficient comparison, clustering, and semantic search.	The following are supported Databricks-hosted foundation model: `databricks-qwen3-embedding-0-6b` `databricks-gte-large-en` The following are supported external models: OpenAI text embedding models Cohere text embedding models Google text embedding models	Recommended for applications where semantic understanding, similarity comparison, and efficient retrieval or clustering of complex data are essential: Semantic search Retrieval augmented generation (RAG) Topic clustering Sentiment analysis and text analytics
Vision	Models designed to process, interpret, and analyze visual data—such as images and videos so machines can "see" and understand the visual world.	The following are supported Databricks-hosted foundation models: `databricks-gpt-5-2` `databricks-gpt-5-1` `databricks-gpt-5` `databricks-gpt-5-mini` `databricks-gpt-5-nano` `databricks-gemini-3-1-pro` `databricks-gemini-3-pro` `databricks-gemini-3-flash` `databricks-gemini-2-5-pro` `databricks-gemini-2-5-flash` `databricks-gemma-3-12b` `databricks-claude-sonnet-4-6` `databricks-claude-sonnet-4-5` `databricks-claude-haiku-4-5` `databricks-claude-sonnet-4` `databricks-claude-opus-4-6` `databricks-claude-opus-4-5` `databricks-claude-opus-4-1` `databricks-claude-3-7-sonnet` `databricks-llama-4-maverick` The following are supported external models: OpenAI GPT and o series models with vision capabilities Anthropic Claude models with vision capabilities Google Gemini models with vision capabilities Other external foundation models with vision capabilities that are OpenAI API compatible are also supported.	Recommended wherever automated, accurate, and scalable analysis of visual information is needed: Object detection and recognition Image classification Image segmentation Document understanding
Reasoning	Advanced AI systems designed to simulate human-like logical thinking. Reasoning models integrate techniques such as symbolic logic, probabilistic reasoning, and neural networks to analyze context, break down tasks, and explain their decision-making.	The following are supported Databricks-hosted foundation model: `databricks-gpt-5-2` `databricks-gpt-5-1` `databricks-gpt-5` `databricks-gpt-5-mini` `databricks-gpt-5-nano` `databricks-gemini-3-1-pro` `databricks-gemini-3-pro` `databricks-gemini-3-flash` `databricks-gemini-2-5-pro` `databricks-gemini-2-5-flash` `databricks-claude-sonnet-4-6` `databricks-claude-sonnet-4-5` `databricks-gpt-oss-20b` `databricks-gpt-oss-120b` `databricks-claude-sonnet-4` `databricks-claude-opus-4-6` `databricks-claude-opus-4-5` `databricks-claude-opus-4-1` `databricks-claude-3-7-sonnet` The following are supported external models: OpenAI models with reasoning capabilities Anthropic Claude models with reasoning capabilities Google Gemini models with reasoning capabilities	Recommended wherever automated, accurate, and scalable analysis of visual information is needed: Code generation Content creation and summarization Agent orchestration

Function calling

Databricks Function Calling is OpenAI-compatible and is only available during model serving as part of Foundation Model APIs and serving endpoints that serve external models. For details, see Function calling on Databricks.

Structured outputs

Structured outputs is OpenAI-compatible and is only available during model serving as part of Foundation Model APIs. For details, see Structured outputs on Databricks.

Prompt caching

Prompt caching is supported for Databricks-hosted Claude models as part of Foundation Model APIs.

You can specify the cache_control parameter in your query requests to cache the following:

Text content messages in the messages.content array.
Thinking messages content in the messages.content array.
Images content blocks in the messages.content array.
Tool use, results and definitions in the tools array.

See Foundation model REST API reference.

TextContent
ReasonContent
ImageContent
ToolCallContent

JSON
{
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What's the date today?",
          "cache_control": { "type": "ephemeral" }
        }
      ]
    }
  ]
}

JSON
{
  "messages": [
    {
      "role": "assistant",
      "content": [
        {
          "type": "reasoning",
          "summary": [
            {
              "type": "summary_text",
              "text": "Thinking...",
              "signature": "[optional]"
            },
            {
              "type": "summary_encrypted_text",
              "data": "[encrypted text]"
            }
          ]
        }
      ]
    }
  ]
}

Image message content must use the encoded data as its source. URLs are not supported.

JSON
{
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What’s in this image?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "data:image/jpeg;base64,[content]"
          },
          "cache_control": { "type": "ephemeral" }
        }
      ]
    }
  ]
}

JSON
{
  "messages": [
    {
      "role": "assistant",
      "content": "Ok, let’s get the weather in New York.",
      "tool_calls": [
        {
          "type": "function",
          "id": "123",
          "function": {
            "name": "get_weather",
            "arguments": "{\"location\":\"New York, NY\"}"
          },
          "cache_control": { "type": "ephemeral" }
        }
      ]
    }
  ]
}

note

The Databricks REST API is OpenAI-compatible and differs from the Anthropic API. These differences also impact response objects like the following:

Output is returned in the choices field.
Streaming chunk format. All chunks adhere to the same format where choices contains the response delta and usage is returned in every chunk.
Stop reason is returned in the finish_reason field.
- Anthropic uses: end_turn, stop_sequence, max_tokens, and tool_use
- Respectively, Databricks uses: stop, stop, length, and tool_calls

Chat with supported LLMs using AI Playground

You can interact with supported large language models using the AI Playground. The AI Playground is a chat-like environment where you can test, prompt, and compare LLMs from your Databricks workspace.

AI playground

Query options​

Requirements​

Install packages​

Foundation model types​

Function calling​

Structured outputs​

Prompt caching​

Chat with supported LLMs using AI Playground​

Additional resources​