Query foundation models by type

Beta

This feature is in Beta. Account admins can control access to this feature from the account console Previews page. See Manage Databricks previews.

In this article, you learn how to write query requests for Databricks-hosted foundation models served by model services in Unity AI Gateway, organized by model type: chat, vision, audio and video, and reasoning.

Requirements

See Requirements.
Install the appropriate package to your cluster based on the querying client option you choose.

note

The following examples are based on Unity AI Gateway and model services. If you use model serving endpoints instead of model services, replace the model service name with an endpoint name. See Discover foundation models for a list of available foundation models and their model service and endpoint names.

Chat

Foundation models that are optimized for chat and general purpose tasks.

The examples in this section show how to query a model service using the different client options.

For a batch inference example, see Enrich data using AI Functions.

OpenAI Chat Completions
OpenAI Responses
REST API
Databricks Python SDK
LangChain

To use the OpenAI client, specify the model service name as the model input.

Python
from databricks_openai import DatabricksOpenAI

client = DatabricksOpenAI()

response = client.chat.completions.create(
    model="system.ai.claude-sonnet-4-5",
    messages=[
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "What is a mixture of experts model?",
      }
    ],
    max_tokens=256
)

To query foundation models outside of your workspace, you must use the OpenAI client directly. You also need your Databricks workspace instance to connect the OpenAI client to Databricks. The following example assumes you have a Databricks API token and openai installed on your compute.

Python

import os
import openai
from openai import OpenAI

client = OpenAI(
    api_key=os.environ.get('DATABRICKS_TOKEN'),
    base_url="https://<workspace-url>/ai-gateway/mlflow/v1"
)

response = client.chat.completions.create(
    model="system.ai.claude-sonnet-4-5",
    messages=[
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "What is a mixture of experts model?",
      }
    ],
    max_tokens=256
)

As an example, the following is the expected request format for a chat model when using the REST API.

Bash
{
  "messages": [
    {
      "role": "user",
      "content": "What is a mixture of experts model?"
    }
  ],
  "max_tokens": 100,
  "temperature": 0.1
}

The following is an expected response format for a request made using the REST API:

JSON
{
  "model": "databricks-claude-sonnet-4-5",
  "choices": [
    {
      "message": {},
      "index": 0,
      "finish_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 7,
    "completion_tokens": 74,
    "total_tokens": 81
  },
  "object": "chat.completion",
  "id": null,
  "created": 1698824353
}

important

The Responses API is only compatible with OpenAI models.

To use the OpenAI Responses API, specify the model service name as the model input.

Python
from databricks_openai import DatabricksOpenAI

client = DatabricksOpenAI()

response = client.responses.create(
    model="system.ai.gpt-5",
    input=[
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "What is a mixture of experts model?",
      }
    ],
    max_output_tokens=256
)

Python

import os
import openai
from openai import OpenAI

client = OpenAI(
    api_key=os.environ.get('DATABRICKS_TOKEN'),
    base_url="https://<workspace-url>/ai-gateway/mlflow/v1"
)

response = client.responses.create(
    model="system.ai.gpt-5",
    input=[
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "What is a mixture of experts model?",
      }
    ],
    max_output_tokens=256
)

As an example, the following is the expected request format when using the OpenAI Responses API. The URL path for this API is /serving-endpoints/responses.

Bash
{
  "model": "databricks-gpt-5",
  "input": [
    {
      "role": "user",
      "content": "What is a mixture of experts model?"
    }
  ],
  "max_output_tokens": 100,
  "temperature": 0.1
}

The following is an expected response format for a request made using the Responses API:

JSON
{
  "id": "resp_abc123",
  "object": "response",
  "created_at": 1698824353,
  "model": "databricks-gpt-5",
  "output": [
    {
      "type": "message",
      "role": "assistant",
      "content": []
    }
  ],
  "usage": {
    "input_tokens": 7,
    "output_tokens": 74,
    "total_tokens": 81
  }
}

Bash
curl \
-u token:$DATABRICKS_TOKEN \
-X POST \
-H "Content-Type: application/json" \
-d '{
  "model": "system.ai.claude-sonnet-4-5",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": " What is a mixture of experts model?"
    }
  ]
}' \
https://<workspace-url>/ai-gateway/mlflow/v1/chat/completions

As an example, the following is the expected request format for a chat model when using the REST API.

Bash
{
  "messages": [
    {
      "role": "user",
      "content": "What is a mixture of experts model?"
    }
  ],
  "max_tokens": 100,
  "temperature": 0.1
}

The following is an expected response format for a request made using the REST API:

JSON
{
  "model": "databricks-claude-sonnet-4-5",
  "choices": [
    {
      "message": {},
      "index": 0,
      "finish_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 7,
    "completion_tokens": 74,
    "total_tokens": 81
  },
  "object": "chat.completion",
  "id": null,
  "created": 1698824353
}

This code must be run in a notebook in your workspace. See Use the Databricks SDK for Python from a Databricks notebook.

Python
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.serving import ChatMessage, ChatMessageRole

w = WorkspaceClient()
response = w.serving_endpoints.query(
    name="system.ai.claude-sonnet-4-5",
    messages=[
        ChatMessage(
            role=ChatMessageRole.SYSTEM, content="You are a helpful assistant."
        ),
        ChatMessage(
            role=ChatMessageRole.USER, content="What is a mixture of experts model?"
        ),
    ],
    max_tokens=128,
)
print(f"RESPONSE:\n{response.choices[0].message.content}")

As an example, the following is the expected request format for a chat model when using the REST API.

Bash
{
  "messages": [
    {
      "role": "user",
      "content": "What is a mixture of experts model?"
    }
  ],
  "max_tokens": 100,
  "temperature": 0.1
}

The following is an expected response format for a request made using the REST API:

JSON
{
  "model": "databricks-claude-sonnet-4-5",
  "choices": [
    {
      "message": {},
      "index": 0,
      "finish_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 7,
    "completion_tokens": 74,
    "total_tokens": 81
  },
  "object": "chat.completion",
  "id": null,
  "created": 1698824353
}

To query a model service using LangChain, you can use the ChatDatabricks ChatModel class and specify the model.

Bash
%pip install databricks-langchain

Python
from langchain_core.messages import HumanMessage, SystemMessage
from databricks_langchain import ChatDatabricks

messages = [
    SystemMessage(content="You're a helpful assistant"),
    HumanMessage(content="What is a mixture of experts model?"),
]

llm = ChatDatabricks(model="system.ai.claude-sonnet-4-5")
llm.invoke(messages)

As an example, the following is the expected request format for a chat model when using the REST API.

Bash
{
  "messages": [
    {
      "role": "user",
      "content": "What is a mixture of experts model?"
    }
  ],
  "max_tokens": 100,
  "temperature": 0.1
}

The following is an expected response format for a request made using the REST API:

JSON
{
  "model": "databricks-claude-sonnet-4-5",
  "choices": [
    {
      "message": {},
      "index": 0,
      "finish_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 7,
    "completion_tokens": 74,
    "total_tokens": 81
  },
  "object": "chat.completion",
  "id": null,
  "created": 1698824353
}

Vision

Query Databricks-hosted vision models through model services in Unity AI Gateway to understand and analyze images with a unified API.

OpenAI client

To use the OpenAI client, specify the model service name as the model input.

Python

from openai import OpenAI
import base64
import requests

# Get the workspace API URL and token from the notebook context
API_ROOT = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiUrl().get()
API_TOKEN = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()

client = OpenAI(
    api_key=API_TOKEN,
    base_url=f"{API_ROOT}/ai-gateway/mlflow/v1",
)

# Download and encode image
image_url = "https://upload.wikimedia.org/wikipedia/commons/a/a7/Camponotus_flavomarginatus_ant.jpg"
resp = requests.get(image_url)
resp.raise_for_status()
image_data = base64.b64encode(resp.content).decode("utf-8")

# OpenAI request
completion = client.chat.completions.create(
    model="system.ai.claude-sonnet-4-5",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "what's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{image_data}"},
                },
            ],
        }
    ],
)

print(completion.choices[0].message.content)

The Chat Completions API supports multiple image inputs, allowing the model to analyze each image and synthesize information from all inputs to generate a response to the prompt.

Python

from openai import OpenAI
import base64
import requests

# Get the workspace API URL and token from the notebook context
API_ROOT = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiUrl().get()
API_TOKEN = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()

client = OpenAI(
    api_key=API_TOKEN,
    base_url=f"{API_ROOT}/ai-gateway/mlflow/v1",
)

# Download and encode multiple images
image1_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
resp1 = requests.get(image1_url)
resp1.raise_for_status()
image1_data = base64.b64encode(resp1.content).decode("utf-8")

image2_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
resp2 = requests.get(image2_url)
resp2.raise_for_status()
image2_data = base64.b64encode(resp2.content).decode("utf-8")

# OpenAI request
completion = client.chat.completions.create(
    model="system.ai.claude-sonnet-4-5",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What are in these images? Is there any difference between them?"},
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{image1_data}"},
                },
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{image2_data}"},
                },
            ],
        }
    ],
)

print(completion.choices[0].message.content)

Input image requirements

Model	Supported formats	Multiple images per request	Image size limitations	Image resizing recommendations	Image quality considerations
`databricks-gpt-5-5-pro`	`JPEG` `PNG` `WebP` `GIF` (Non-animated `GIF`)	Up to 500 individual image inputs per request	File size limit: Up to 10 MB total payload size per request	N/A	No watermarks or logos Clear enough for a human to understand
`databricks-gpt-5-5`	`JPEG` `PNG` `WebP` `GIF` (Non-animated `GIF`)	Up to 500 individual image inputs per request	File size limit: Up to 10 MB total payload size per request	N/A	No watermarks or logos Clear enough for a human to understand
`databricks-gpt-5-4`	`JPEG` `PNG` `WebP` `GIF` (Non-animated `GIF`)	Up to 500 individual image inputs per request	File size limit: Up to 10 MB total payload size per request	N/A	No watermarks or logos Clear enough for a human to understand
`databricks-gpt-5-4-mini`	`JPEG` `PNG` `WebP` `GIF` (Non-animated `GIF`)	Up to 500 individual image inputs per request	File size limit: Up to 10 MB total payload size per request	N/A	No watermarks or logos Clear enough for a human to understand
`databricks-gpt-5-4-nano`	`JPEG` `PNG` `WebP` `GIF` (Non-animated `GIF`)	Up to 500 individual image inputs per request	File size limit: Up to 10 MB total payload size per request	N/A	No watermarks or logos Clear enough for a human to understand
`databricks-gpt-5-2`	`JPEG` `PNG` `WebP` `GIF` (Non-animated `GIF`)	Up to 500 individual image inputs per request	File size limit: Up to 10 MB total payload size per request	N/A	No watermarks or logos Clear enough for a human to understand
`databricks-gpt-5-1`	`JPEG` `PNG` `WebP` `GIF` (Non-animated `GIF`)	Up to 500 individual image inputs per request	File size limit: Up to 10 MB total payload size per request	N/A	No watermarks or logos Clear enough for a human to understand
`databricks-gpt-5`	`JPEG` `PNG` `WebP` `GIF` (Non-animated `GIF`)	Up to 500 individual image inputs per request	File size limit: Up to 10 MB total payload size per request	N/A	No watermarks or logos Clear enough for a human to understand
`databricks-gpt-5-mini`	`JPEG` `PNG` `WebP` `GIF` (Non-animated `GIF`)	Up to 500 individual image inputs per request	File size limit: Up to 10 MB total payload size per request	N/A	No watermarks or logos Clear enough for a human to understand
`databricks-gpt-5-nano`	`JPEG` `PNG` `WebP` `GIF` (Non-animated `GIF`)	Up to 500 individual image inputs per request	File size limit: Up to 10 MB total payload size per request	N/A	No watermarks or logos Clear enough for a human to understand
`databricks-gemini-3-5-flash`	`JPEG` `PNG` `WebP`	Up to 50 images for API requests. All provided images are processed in a request.	File size limit: 7 MB each image	N/A	N/A
`databricks-gemini-3-1-pro`	`JPEG` `PNG` `WebP`	Up to 50 images for API requests. All provided images are processed in a request.	File size limit: 7 MB each image	N/A	N/A
`databricks-gemini-3-pro`	`JPEG` `PNG` `WebP`	Up to 50 images for API requests. All provided images are processed in a request.	File size limit: 7 MB each image	N/A	N/A
`databricks-gemini-3-flash`	`JPEG` `PNG` `WebP`	Up to 50 images for API requests. All provided images are processed in a request.	File size limit: 7 MB each image	N/A	N/A
`databricks-gemini-3-1-flash-lite`	`JPEG` `PNG` `WebP`	Up to 50 images for API requests. All provided images are processed in a request.	File size limit: 7 MB each image	N/A	N/A
`databricks-gemini-2-5-pro`	`JPEG` `PNG` `WebP`	Up to 50 images for API requests. All provided images are processed in a request.	File size limit: 7 MB each image	N/A	N/A
`databricks-gemini-2-5-flash`	`JPEG` `PNG` `WebP`	Up to 50 images for API requests. All provided images are processed in a request.	File size limit: 7 MB each image	N/A	N/A
`databricks-gemma-3-12b`	`JPEG` `PNG` `WebP` `GIF`	Up to 5 images for API requests All provided images are processed in a request.	File size limit: 10 MB total across all images per API request	N/A	N/A
`databricks-llama-4-maverick`	`JPEG` `PNG` `WebP` `GIF`	Up to 5 images for API requests All provided images are processed in a request.	File size limit: 10 MB total across all images per API request	N/A	N/A
`databricks-claude-sonnet-4-6` `databricks-claude-sonnet-4-5` `databricks-claude-haiku-4-5` `databricks-claude-opus-4-8` `databricks-claude-opus-4-7` `databricks-claude-opus-4-6` `databricks-claude-opus-4-5` `databricks-claude-opus-4-1` `databricks-claude-sonnet-4`	`JPEG` `PNG` `GIF` `WebP`	Up to 20 images for Claude.ai Up to 100 images for API requests All provided images are processed in a request, which is useful for comparing or contrasting them.	Images larger than 8000x8000 px are rejected. If more than 20 images are submitted in one API request, the maximum allowed size per image is 2000 x 2000 px.	For optimal performance, resize images before uploading if they are too large. If an image's long edge exceeds 1568 pixels or its size exceeds ~1,600 tokens, it is automatically scaled down while preserving aspect ratio. Very small images (under 200 pixels on any edge) may degrade performance. To reduce latency, keep images within 1.15 megapixels and at most 1568 pixels in both dimensions.	Clarity: Avoid blurry or pixelated images. Text in images: Ensure text is legible and not too smal. Avoid cropping out key visual context just to enlarge the text.

Image to token conversion

Each image in a request to a foundation model adds to your token usage. See the pricing calculator to estimate image pricing based on the token usage and model you are using.

Limitations of image understanding

The following are image understanding limitations for the supported Databricks-hosted foundation models:

Model	Limitations
The following Claude models are supported: `databricks-claude-opus-4-8` `databricks-claude-opus-4-7` `databricks-claude-opus-4-6` `databricks-claude-opus-4-5` `databricks-claude-opus-4-1` `databricks-claude-sonnet-4-6` `databricks-claude-sonnet-4-5` `databricks-claude-sonnet-4`	The following are the limits for Claude models on Databricks: Avoid using Claude for tasks requiring perfect precision or sensitive analysis without human oversight. People identification: Cannot identify or name people in images. Accuracy: May misinterpret low-quality, rotated, or very small images (200 px). Spatial reasoning: Struggles with precise layouts, such as reading analog clocks or chess positions. Counting: Provides approximate counts, but may be inaccurate for many small objects. AI-generated images: Cannot reliably detect synthetic or fake images. Inappropriate content: Blocks explicit or policy-violating images. Healthcare: Not suited for complex medical scans (for example, CTs and MRIs). It's not a diagnostic tool.

Model

Limitations

The following Claude models are supported:

databricks-claude-opus-4-8
databricks-claude-opus-4-7
databricks-claude-opus-4-6
databricks-claude-opus-4-5
databricks-claude-opus-4-1
databricks-claude-sonnet-4-6
databricks-claude-sonnet-4-5
databricks-claude-sonnet-4

The following are the limits for Claude models on Databricks:

Avoid using Claude for tasks requiring perfect precision or sensitive analysis without human oversight.
People identification: Cannot identify or name people in images.
Accuracy: May misinterpret low-quality, rotated, or very small images (200 px).
Spatial reasoning: Struggles with precise layouts, such as reading analog clocks or chess positions.
Counting: Provides approximate counts, but may be inaccurate for many small objects.
AI-generated images: Cannot reliably detect synthetic or fake images.
Inappropriate content: Blocks explicit or policy-violating images.
Healthcare: Not suited for complex medical scans (for example, CTs and MRIs). It's not a diagnostic tool.

Audio and video

Send audio and video inputs to Gemini foundation models served by Unity AI Gateway on Databricks. You can provide media as a URL or as base64-encoded inline data using the Chat Completions API or the Google Gemini API.

You can provide audio and video inputs using two methods:

URL: Pass a publicly accessible URL to the media file. For video, YouTube URLs are also supported.
Base64 inline data: Encode the file as a base64 string and pass it as a data URI (for example, data:video/mp4;base64,<encoded_data>).

Chat Completions API

The chat completions API allows you to pass video and audio input. Use the video_url and audio_url content types in the messages array to pass media inputs. Each content item includes a url field that accepts either a web URL or a base64 data URI.

The following examples show video and audio input using the Chat Completions API.

Python
REST API

Python
import os
import base64
from openai import OpenAI

DATABRICKS_TOKEN = os.environ.get('DATABRICKS_TOKEN')
DATABRICKS_BASE_URL = os.environ.get('DATABRICKS_BASE_URL')

client = OpenAI(
    api_key=DATABRICKS_TOKEN,
    base_url=DATABRICKS_BASE_URL
)

# Encode a local video file as base64
with open("video.mp4", "rb") as f:
    video_b64 = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
    model="system.ai.gemini-3-1-pro",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Summarize what happens in these videos."},
            {
                "type": "video_url",
                "video_url": {"url": "https://example.com/sample-video.mp4"}
            },
            {
                "type": "video_url",
                "video_url": {"url": f"data:video/mp4;base64,{video_b64}"}
            },
        ]
    }],
    max_tokens=1024
)

print(response.choices[0].message.content)

Python
import os
import base64
from openai import OpenAI

DATABRICKS_TOKEN = os.environ.get('DATABRICKS_TOKEN')
DATABRICKS_BASE_URL = os.environ.get('DATABRICKS_BASE_URL')

client = OpenAI(
    api_key=DATABRICKS_TOKEN,
    base_url=DATABRICKS_BASE_URL
)

# Encode a local audio file as base64
with open("audio.mp3", "rb") as f:
    audio_b64 = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
    model="system.ai.gemini-3-1-pro",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Transcribe this audio and summarize the key points."},
            {
                "type": "audio_url",
                "audio_url": {"url": "https://example.com/sample-audio.mp3"}
            },
            {
                "type": "audio_url",
                "audio_url": {"url": f"data:audio/mp3;base64,{audio_b64}"}
            },
        ]
    }],
    max_tokens=1024
)

print(response.choices[0].message.content)

Bash
curl \
-u token:$DATABRICKS_TOKEN \
-X POST \
-H "Content-Type: application/json" \
-d '{
  "model": "system.ai.gemini-3-1-pro",
  "messages": [{
    "role": "user",
    "content": [
      {"type": "text", "text": "Summarize what happens in these videos."},
      {
        "type": "video_url",
        "video_url": {"url": "https://example.com/sample-video.mp4"}
      },
      {
        "type": "video_url",
        "video_url": {"url": "data:video/mp4;base64,<base64_encoded_data>"}
      }
    ]
  }],
  "max_tokens": 1024
}' \
https://<workspace-url>/ai-gateway/mlflow/v1/chat/completions

Bash
curl \
-u token:$DATABRICKS_TOKEN \
-X POST \
-H "Content-Type: application/json" \
-d '{
  "model": "system.ai.gemini-3-1-pro",
  "messages": [{
    "role": "user",
    "content": [
      {"type": "text", "text": "Transcribe this audio and summarize the key points."},
      {
        "type": "audio_url",
        "audio_url": {"url": "https://example.com/sample-audio.mp3"}
      },
      {
        "type": "audio_url",
        "audio_url": {"url": "data:audio/mp3;base64,<base64_encoded_data>"}
      }
    ]
  }],
  "max_tokens": 1024
}' \
https://<workspace-url>/ai-gateway/mlflow/v1/chat/completions

Google Gemini API

Use the Google Gemini API to pass media as inlineData (base64-encoded) or fileData (URL reference) within the parts array.

The following examples show video and audio input using the Google Gemini API.

Python
REST API

Python
from google import genai
from google.genai import types
import base64
import os

DATABRICKS_TOKEN = os.environ.get('DATABRICKS_TOKEN')

client = genai.Client(
    api_key="databricks",
    http_options=types.HttpOptions(
        base_url="https://<workspace-url>/ai-gateway/gemini",
        headers={
            "Authorization": f"Bearer {DATABRICKS_TOKEN}",
        },
    ),
)

# Encode a local video file as base64
with open("video.mp4", "rb") as f:
    video_b64 = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.models.generate_content(
    model="system.ai.gemini-3-1-pro",
    contents=[
        types.Content(
            role="user",
            parts=[
                types.Part(text="Summarize what happens in these videos."),
                types.Part(
                    file_data=types.FileData(
                        mime_type="video/mp4",
                        file_uri="https://example.com/sample-video.mp4",
                    )
                ),
                types.Part(
                    inline_data=types.Blob(
                        mime_type="video/mp4",
                        data=video_b64,
                    )
                ),
            ],
        ),
    ],
    config=types.GenerateContentConfig(
        max_output_tokens=1024,
    ),
)

print(response.text)

Python
from google import genai
from google.genai import types
import base64
import os

DATABRICKS_TOKEN = os.environ.get('DATABRICKS_TOKEN')

client = genai.Client(
    api_key="databricks",
    http_options=types.HttpOptions(
        base_url="https://<workspace-url>/ai-gateway/gemini",
        headers={
            "Authorization": f"Bearer {DATABRICKS_TOKEN}",
        },
    ),
)

# Encode a local audio file as base64
with open("audio.mp3", "rb") as f:
    audio_b64 = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.models.generate_content(
    model="system.ai.gemini-3-1-pro",
    contents=[
        types.Content(
            role="user",
            parts=[
                types.Part(text="Transcribe this audio and summarize the key points."),
                types.Part(
                    file_data=types.FileData(
                        mime_type="audio/mp3",
                        file_uri="https://example.com/sample-audio.mp3",
                    )
                ),
                types.Part(
                    inline_data=types.Blob(
                        mime_type="audio/mp3",
                        data=audio_b64,
                    )
                ),
            ],
        ),
    ],
    config=types.GenerateContentConfig(
        max_output_tokens=1024,
    ),
)

print(response.text)

Bash
curl \
-u token:$DATABRICKS_TOKEN \
-X POST \
-H "Content-Type: application/json" \
-d '{
  "contents": [{
    "role": "user",
    "parts": [
      {"text": "Summarize what happens in these videos."},
      {
        "fileData": {
          "mimeType": "video/mp4",
          "fileUri": "https://example.com/sample-video.mp4"
        }
      },
      {
        "inlineData": {
          "mimeType": "video/mp4",
          "data": "<base64_encoded_data>"
        }
      }
    ]
  }]
}' \
https://<workspace-url>/ai-gateway/gemini/v1beta/models/system.ai.gemini-3-1-pro:generateContent

Bash
curl \
-u token:$DATABRICKS_TOKEN \
-X POST \
-H "Content-Type: application/json" \
-d '{
  "contents": [{
    "role": "user",
    "parts": [
      {"text": "Transcribe this audio and summarize the key points."},
      {
        "fileData": {
          "mimeType": "audio/mp3",
          "fileUri": "https://example.com/sample-audio.mp3"
        }
      },
      {
        "inlineData": {
          "mimeType": "audio/mp3",
          "data": "<base64_encoded_data>"
        }
      }
    ]
  }]
}' \
https://<workspace-url>/ai-gateway/gemini/v1beta/models/system.ai.gemini-3-1-pro:generateContent

Limitations

Multiple audio or video inputs can be included in a single request, but large files increase latency and token usage.

Reasoning

Foundation models optimized for reasoning tasks. Databricks Foundation Model API provides a unified API to interact with all Foundation Models, including reasoning models. Reasoning gives foundation models enhanced capabilities to tackle complex tasks. Some models also provide transparency by revealing their step-by-step thought process before delivering a final answer.

Types of reasoning models

There are two types of models, reasoning-only and hybrid. The following table describes how different models use different approaches to control reasoning:

Models	Reasoning model type	Details	Parameters
GPT-5 models like `databricks-gpt-5-5-pro`, `databricks-gpt-5-5`, `databricks-gpt-5-4`, `databricks-gpt-5-4-mini`, `databricks-gpt-5-4-nano`, `databricks-gpt-5-2`, `databricks-gpt-5-1`, `databricks-gpt-5`, `databricks-gpt-5-mini` and `databricks-gpt-5-nano`.	Reasoning only	These models always use internal reasoning in their responses.	Use the following parameter in your request: `reasoning_effort`: This parameter is only accepted by a limited set of models. Higher reasoning effort may result in more thoughtful and accurate responses but may increase latency and token usage. For GPT-5.5 and GPT-5.5 Pro, the `reasoning_effort` parameter is set to `medium` by default, but can be overridden in requests. For GPT-5.1 and GPT-5.2, the `reasoning_effort` parameter is set to `none` by default, but can be overridden in requests. For GPT-5, GPT-5 mini, and GPT-5 nano, the `reasoning_effort` parameter is set to `minimal` by default, but can be overridden in requests.
Claude models like `databricks-claude-sonnet-4-6`, `databricks-claude-sonnet-4-5`, `databricks-claude-sonnet-4`, `databricks-claude-opus-4-8`, `databricks-claude-opus-4-7`, `databricks-claude-opus-4-6`, `databricks-claude-opus-4-5`, and `databricks-claude-opus-4-1`.	Hybrid reasoning	These models support both fast, instant replies and deeper reasoning when needed.	Include the following parameters to use hybrid reasoning: `thinking` `budget_tokens`: controls how many tokens the model can use for internal thought. Higher budgets can improve quality for complex tasks, but usage above 32K may vary. `budget_tokens` must be less than `max_tokens`.
Gemini 3 models like `databricks-gemini-3-5-flash`, `databricks-gemini-3-1-pro`, `databricks-gemini-3-1-flash-lite`, `databricks-gemini-3-pro`, and `databricks-gemini-3-flash`	Hybrid reasoning	These models support both fast, instant replies and deeper reasoning when needed.	Include the following parameters to use hybrid reasoning: `reasoning_effort`: This parameter is accepted by Gemini 3 models and higher. For Gemini 3 models, this parameter accepts values of `"low"` (default), `"medium"`, or `"high"`.
Gemini 2.5 models like `databricks-gemini-2-5-pro` and `databricks-gemini-2-5-flash`.	Hybrid reasoning	These models support both fast, instant replies and deeper reasoning when needed.	Include the following parameters to use hybrid reasoning: `thinking` `budget_tokens`: controls how many tokens the model can use for internal thought. Higher budgets can improve quality for complex tasks, but usage above 32K may vary. `budget_tokens` must be less than `max_tokens`.
GPT OSS models like `databricks-gpt-oss-120b` and `databricks-gpt-oss-20b`.	Reasoning only	These models always use internal reasoning in their responses.	Use the following parameter in your request: `reasoning_effort`: This parameter is only accepted by a limited set of models. Higher reasoning effort may result in more thoughtful and accurate responses but may increase latency and token usage. For GPT OSS models, this parameter accepts values of `"low"`, `"medium"` (default), or `"high"`.

Query examples

All reasoning models are accessed through the chat completions endpoint.

Claude model example
GPT-5.1
GPT OSS model example
Gemini model example

Python
import os
from openai import OpenAI

client = OpenAI(
  api_key=os.environ.get('YOUR_DATABRICKS_TOKEN'),
  base_url=os.environ.get('YOUR_DATABRICKS_BASE_URL')
  )

response = client.chat.completions.create(
    model="system.ai.claude-sonnet-4-5",
    messages=[{"role": "user", "content": "Why is the sky blue?"}],
    max_tokens=20480,
    extra_body={
        "thinking": {
            "type": "enabled",
            "budget_tokens": 10240
        }
    }
)

msg = response.choices[0].message
reasoning = msg.content[0]["summary"][0]["text"]
answer = msg.content[1]["text"]

print("Reasoning:", reasoning)
print("Answer:", answer)

The reasoning_effort parameter for GPT-5.1 is set to none by default, but can be overridden in requests. Higher reasoning effort may result in more thoughtful and accurate responses, but may increase latency and token usage.

Bash
curl -X POST "https://<workspace-url>/ai-gateway/mlflow/v1/chat/completions" \
  -H "Authorization: Bearer $DATABRICKS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "system.ai.gpt-5-1",
    "messages": [
      {
        "role": "user",
        "content": "Why is the sky blue?"
      }
    ],
    "max_tokens": 4096,
    "reasoning_effort": "none"
  }'

The reasoning_effort parameter accepts "low", "medium" (default), or "high" values. Higher reasoning effort may result in more thoughtful and accurate responses, but may increase latency and token usage.

Bash
curl -X POST "https://<workspace-url>/ai-gateway/mlflow/v1/chat/completions" \
  -H "Authorization: Bearer $DATABRICKS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "system.ai.gpt-oss-120b",
    "messages": [
      {
        "role": "user",
        "content": "Why is the sky blue?"
      }
    ],
    "max_tokens": 4096,
    "reasoning_effort": "high"
  }'

This example uses system.ai.gemini-3-1-pro. The reasoning_effort parameter is set to "low" by default, but can be overridden in requests as seen in the following example.

Bash
curl -X POST "https://<workspace-url>/ai-gateway/mlflow/v1/chat/completions" \
  -H "Authorization: Bearer $DATABRICKS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "system.ai.gemini-3-1-pro",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Why is the sky blue?"
      }
    ],
    "max_tokens": 2000,
    "stream": true,
    "reasoning_effort": "high"
  }'

The API response includes both thinking and text content blocks:

Python
ChatCompletionMessage(
    role="assistant",
    content=[
        {
            "type": "reasoning",
            "summary": [
                {
                    "type": "summary_text",
                    "text": ("The question is asking about the scientific explanation for why the sky appears blue... "),
                    "signature": ("EqoBCkgIARABGAIiQAhCWRmlaLuPiHaF357JzGmloqLqkeBm3cHG9NFTxKMyC/9bBdBInUsE3IZk6RxWge...")
                }
            ]
        },
        {
            "type": "text",
            "text": (
                "# Why the Sky Is Blue\n\n"
                "The sky appears blue because of a phenomenon called Rayleigh scattering. Here's how it works..."
            )
        }
    ],
    refusal=None,
    annotations=None,
    audio=None,
    function_call=None,
    tool_calls=None
)

Manage reasoning across multiple turns

This section is specific to the databricks-claude-sonnet-4-5 model.

In multi-turn conversations, only the reasoning blocks associated with the last assistant turn or tool-use session are visible to the model and counted as input tokens.

If you don't want to pass reasoning tokens back to the model (for example, you don't need it to reason over its prior steps), you can omit the reasoning block entirely. For example:

Python
response = client.chat.completions.create(
    model="system.ai.claude-sonnet-4-5",
    messages=[
        {"role": "user", "content": "Why is the sky blue?"},
        {"role": "assistant", "content": text_content},
        {"role": "user", "content": "Can you explain in a way that a 5-year-old child can understand?"}
    ],
    max_tokens=20480,
    extra_body={
        "thinking": {
            "type": "enabled",
            "budget_tokens": 10240
        }
    }
)

answer = response.choices[0].message.content[1]["text"]
print("Answer:", answer)

However, if you do need the model to reason over its previous reasoning process - for instance, if you're building experiences that surface its intermediate reasoning - you must include the full, unmodified assistant message, including the reasoning block from the previous turn. Here's how to continue a thread with the full assistant message:

Python
assistant_message = response.choices[0].message

response = client.chat.completions.create(
    model="system.ai.claude-sonnet-4-5",
    messages=[
        {"role": "user", "content": "Why is the sky blue?"},
        {"role": "assistant", "content": text_content},
        {"role": "user", "content": "Can you explain in a way that a 5-year-old child can understand?"},
        assistant_message,
        {"role": "user", "content": "Can you simplify the previous answer?"}
    ],
    max_tokens=20480,
    extra_body={
        "thinking": {
            "type": "enabled",
            "budget_tokens": 10240
        }
    }
)

answer = response.choices[0].message.content[1]["text"]
print("Answer:", answer)

How does a reasoning model work?

Reasoning models introduce special reasoning tokens in addition to the standard input and output tokens. These tokens let the model "think" through the prompt, breaking it down and considering different ways to respond. After this internal reasoning process, the model generates its final answer as visible output tokens. Some models, like databricks-claude-sonnet-4-5, display these reasoning tokens to users, while others, such as the OpenAI o series, discard them and do not expose them in the final output.

Supported models

See Discover foundation models for the available foundation models and the interaction types each supports, including chat, vision, audio and video, and reasoning.

Requirements​

Chat​

Vision​

Input image requirements​

Image to token conversion​

Limitations of image understanding​

Audio and video​

Chat Completions API​

Google Gemini API​

Limitations​

Reasoning​

Types of reasoning models​

Query examples​

Manage reasoning across multiple turns​

How does a reasoning model work?​

Supported models​

Additional resources​

Requirements

Chat

Vision

Input image requirements

Image to token conversion

Limitations of image understanding

Audio and video

Chat Completions API

Google Gemini API

Limitations

Reasoning

Types of reasoning models

Query examples

Manage reasoning across multiple turns

How does a reasoning model work?

Supported models

Additional resources