Query foundation models by type
This feature is in Beta. Account admins can control access to this feature from the account console Previews page. See Manage Databricks previews.
In this article, you learn how to write query requests for Databricks-hosted foundation models served by model services in Unity AI Gateway, organized by model type: chat, vision, audio and video, and reasoning.
Requirements
- See Requirements.
- Install the appropriate package to your cluster based on the querying client option you choose.
The following examples are based on Unity AI Gateway and model services. If you use model serving endpoints instead of model services, replace the model service name with an endpoint name. See Discover foundation models for a list of available foundation models and their model service and endpoint names.
Chat
Foundation models that are optimized for chat and general purpose tasks.
The examples in this section show how to query a model service using the different client options.
For a batch inference example, see Enrich data using AI Functions.
- OpenAI Chat Completions
- OpenAI Responses
- REST API
- Databricks Python SDK
- LangChain
To use the OpenAI client, specify the model service name as the model input.
from databricks_openai import DatabricksOpenAI
client = DatabricksOpenAI()
response = client.chat.completions.create(
model="system.ai.claude-sonnet-4-5",
messages=[
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is a mixture of experts model?",
}
],
max_tokens=256
)
To query foundation models outside of your workspace, you must use the OpenAI client directly. You also need your Databricks workspace instance to connect the OpenAI client to Databricks. The following example assumes you have a Databricks API token and openai installed on your compute.
import os
import openai
from openai import OpenAI
client = OpenAI(
api_key=os.environ.get('DATABRICKS_TOKEN'),
base_url="https://<workspace-url>/ai-gateway/mlflow/v1"
)
response = client.chat.completions.create(
model="system.ai.claude-sonnet-4-5",
messages=[
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is a mixture of experts model?",
}
],
max_tokens=256
)
As an example, the following is the expected request format for a chat model when using the REST API.
{
"messages": [
{
"role": "user",
"content": "What is a mixture of experts model?"
}
],
"max_tokens": 100,
"temperature": 0.1
}
The following is an expected response format for a request made using the REST API:
{
"model": "databricks-claude-sonnet-4-5",
"choices": [
{
"message": {},
"index": 0,
"finish_reason": null
}
],
"usage": {
"prompt_tokens": 7,
"completion_tokens": 74,
"total_tokens": 81
},
"object": "chat.completion",
"id": null,
"created": 1698824353
}
The Responses API is only compatible with OpenAI models.
To use the OpenAI Responses API, specify the model service name as the model input.
from databricks_openai import DatabricksOpenAI
client = DatabricksOpenAI()
response = client.responses.create(
model="system.ai.gpt-5",
input=[
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is a mixture of experts model?",
}
],
max_output_tokens=256
)
To query foundation models outside of your workspace, you must use the OpenAI client directly. You also need your Databricks workspace instance to connect the OpenAI client to Databricks. The following example assumes you have a Databricks API token and openai installed on your compute.
import os
import openai
from openai import OpenAI
client = OpenAI(
api_key=os.environ.get('DATABRICKS_TOKEN'),
base_url="https://<workspace-url>/ai-gateway/mlflow/v1"
)
response = client.responses.create(
model="system.ai.gpt-5",
input=[
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is a mixture of experts model?",
}
],
max_output_tokens=256
)
As an example, the following is the expected request format when using the OpenAI Responses API. The URL path for this API is /serving-endpoints/responses.
{
"model": "databricks-gpt-5",
"input": [
{
"role": "user",
"content": "What is a mixture of experts model?"
}
],
"max_output_tokens": 100,
"temperature": 0.1
}
The following is an expected response format for a request made using the Responses API:
{
"id": "resp_abc123",
"object": "response",
"created_at": 1698824353,
"model": "databricks-gpt-5",
"output": [
{
"type": "message",
"role": "assistant",
"content": []
}
],
"usage": {
"input_tokens": 7,
"output_tokens": 74,
"total_tokens": 81
}
}
curl \
-u token:$DATABRICKS_TOKEN \
-X POST \
-H "Content-Type: application/json" \
-d '{
"model": "system.ai.claude-sonnet-4-5",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": " What is a mixture of experts model?"
}
]
}' \
https://<workspace-url>/ai-gateway/mlflow/v1/chat/completions
As an example, the following is the expected request format for a chat model when using the REST API.
{
"messages": [
{
"role": "user",
"content": "What is a mixture of experts model?"
}
],
"max_tokens": 100,
"temperature": 0.1
}
The following is an expected response format for a request made using the REST API:
{
"model": "databricks-claude-sonnet-4-5",
"choices": [
{
"message": {},
"index": 0,
"finish_reason": null
}
],
"usage": {
"prompt_tokens": 7,
"completion_tokens": 74,
"total_tokens": 81
},
"object": "chat.completion",
"id": null,
"created": 1698824353
}
This code must be run in a notebook in your workspace. See Use the Databricks SDK for Python from a Databricks notebook.
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.serving import ChatMessage, ChatMessageRole
w = WorkspaceClient()
response = w.serving_endpoints.query(
name="system.ai.claude-sonnet-4-5",
messages=[
ChatMessage(
role=ChatMessageRole.SYSTEM, content="You are a helpful assistant."
),
ChatMessage(
role=ChatMessageRole.USER, content="What is a mixture of experts model?"
),
],
max_tokens=128,
)
print(f"RESPONSE:\n{response.choices[0].message.content}")
As an example, the following is the expected request format for a chat model when using the REST API.
{
"messages": [
{
"role": "user",
"content": "What is a mixture of experts model?"
}
],
"max_tokens": 100,
"temperature": 0.1
}
The following is an expected response format for a request made using the REST API:
{
"model": "databricks-claude-sonnet-4-5",
"choices": [
{
"message": {},
"index": 0,
"finish_reason": null
}
],
"usage": {
"prompt_tokens": 7,
"completion_tokens": 74,
"total_tokens": 81
},
"object": "chat.completion",
"id": null,
"created": 1698824353
}
To query a model service using LangChain, you can use the ChatDatabricks ChatModel class and specify the model.
%pip install databricks-langchain
from langchain_core.messages import HumanMessage, SystemMessage
from databricks_langchain import ChatDatabricks
messages = [
SystemMessage(content="You're a helpful assistant"),
HumanMessage(content="What is a mixture of experts model?"),
]
llm = ChatDatabricks(model="system.ai.claude-sonnet-4-5")
llm.invoke(messages)
As an example, the following is the expected request format for a chat model when using the REST API.
{
"messages": [
{
"role": "user",
"content": "What is a mixture of experts model?"
}
],
"max_tokens": 100,
"temperature": 0.1
}
The following is an expected response format for a request made using the REST API:
{
"model": "databricks-claude-sonnet-4-5",
"choices": [
{
"message": {},
"index": 0,
"finish_reason": null
}
],
"usage": {
"prompt_tokens": 7,
"completion_tokens": 74,
"total_tokens": 81
},
"object": "chat.completion",
"id": null,
"created": 1698824353
}
Vision
Query Databricks-hosted vision models through model services in Unity AI Gateway to understand and analyze images with a unified API.
- OpenAI client
To use the OpenAI client, specify the model service name as the model input.
from openai import OpenAI
import base64
import requests
# Get the workspace API URL and token from the notebook context
API_ROOT = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiUrl().get()
API_TOKEN = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()
client = OpenAI(
api_key=API_TOKEN,
base_url=f"{API_ROOT}/ai-gateway/mlflow/v1",
)
# Download and encode image
image_url = "https://upload.wikimedia.org/wikipedia/commons/a/a7/Camponotus_flavomarginatus_ant.jpg"
resp = requests.get(image_url)
resp.raise_for_status()
image_data = base64.b64encode(resp.content).decode("utf-8")
# OpenAI request
completion = client.chat.completions.create(
model="system.ai.claude-sonnet-4-5",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "what's in this image?"},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{image_data}"},
},
],
}
],
)
print(completion.choices[0].message.content)
The Chat Completions API supports multiple image inputs, allowing the model to analyze each image and synthesize information from all inputs to generate a response to the prompt.
from openai import OpenAI
import base64
import requests
# Get the workspace API URL and token from the notebook context
API_ROOT = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiUrl().get()
API_TOKEN = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()
client = OpenAI(
api_key=API_TOKEN,
base_url=f"{API_ROOT}/ai-gateway/mlflow/v1",
)
# Download and encode multiple images
image1_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
resp1 = requests.get(image1_url)
resp1.raise_for_status()
image1_data = base64.b64encode(resp1.content).decode("utf-8")
image2_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
resp2 = requests.get(image2_url)
resp2.raise_for_status()
image2_data = base64.b64encode(resp2.content).decode("utf-8")
# OpenAI request
completion = client.chat.completions.create(
model="system.ai.claude-sonnet-4-5",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What are in these images? Is there any difference between them?"},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{image1_data}"},
},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{image2_data}"},
},
],
}
],
)
print(completion.choices[0].message.content)
Input image requirements
Model | Supported formats | Multiple images per request | Image size limitations | Image resizing recommendations | Image quality considerations |
|---|---|---|---|---|---|
|
| Up to 500 individual image inputs per request | File size limit: Up to 10 MB total payload size per request | N/A |
|
|
| Up to 500 individual image inputs per request | File size limit: Up to 10 MB total payload size per request | N/A |
|
|
| Up to 500 individual image inputs per request | File size limit: Up to 10 MB total payload size per request | N/A |
|
|
| Up to 500 individual image inputs per request | File size limit: Up to 10 MB total payload size per request | N/A |
|
|
| Up to 500 individual image inputs per request | File size limit: Up to 10 MB total payload size per request | N/A |
|
|
| Up to 500 individual image inputs per request | File size limit: Up to 10 MB total payload size per request | N/A |
|
|
| Up to 500 individual image inputs per request | File size limit: Up to 10 MB total payload size per request | N/A |
|
|
| Up to 500 individual image inputs per request | File size limit: Up to 10 MB total payload size per request | N/A |
|
|
| Up to 500 individual image inputs per request | File size limit: Up to 10 MB total payload size per request | N/A |
|
|
| Up to 500 individual image inputs per request | File size limit: Up to 10 MB total payload size per request | N/A |
|
|
| Up to 50 images for API requests. All provided images are processed in a request. | File size limit: 7 MB each image | N/A | N/A |
|
| Up to 50 images for API requests. All provided images are processed in a request. | File size limit: 7 MB each image | N/A | N/A |
|
| Up to 50 images for API requests. All provided images are processed in a request. | File size limit: 7 MB each image | N/A | N/A |
|
| Up to 50 images for API requests. All provided images are processed in a request. | File size limit: 7 MB each image | N/A | N/A |
|
| Up to 50 images for API requests. All provided images are processed in a request. | File size limit: 7 MB each image | N/A | N/A |
|
| Up to 50 images for API requests. All provided images are processed in a request. | File size limit: 7 MB each image | N/A | N/A |
|
| Up to 50 images for API requests. All provided images are processed in a request. | File size limit: 7 MB each image | N/A | N/A |
|
| Up to 5 images for API requests
| File size limit: 10 MB total across all images per API request | N/A | N/A |
|
| Up to 5 images for API requests
| File size limit: 10 MB total across all images per API request | N/A | N/A |
|
|
|
| For optimal performance, resize images before uploading if they are too large.
|
|
Image to token conversion
Each image in a request to a foundation model adds to your token usage. See the pricing calculator to estimate image pricing based on the token usage and model you are using.
Limitations of image understanding
The following are image understanding limitations for the supported Databricks-hosted foundation models:
Model | Limitations |
|---|---|
The following Claude models are supported:
| The following are the limits for Claude models on Databricks:
|
Audio and video
Send audio and video inputs to Gemini foundation models served by Unity AI Gateway on Databricks. You can provide media as a URL or as base64-encoded inline data using the Chat Completions API or the Google Gemini API.
You can provide audio and video inputs using two methods:
- URL: Pass a publicly accessible URL to the media file. For video, YouTube URLs are also supported.
- Base64 inline data: Encode the file as a base64 string and pass it as a data URI (for example,
data:video/mp4;base64,<encoded_data>).
Chat Completions API
The chat completions API allows you to pass video and audio input. Use the video_url and audio_url content types in the messages array to pass media inputs. Each content item includes a url field that accepts either a web URL or a base64 data URI.
The following examples show video and audio input using the Chat Completions API.
- Python
- REST API
import os
import base64
from openai import OpenAI
DATABRICKS_TOKEN = os.environ.get('DATABRICKS_TOKEN')
DATABRICKS_BASE_URL = os.environ.get('DATABRICKS_BASE_URL')
client = OpenAI(
api_key=DATABRICKS_TOKEN,
base_url=DATABRICKS_BASE_URL
)
# Encode a local video file as base64
with open("video.mp4", "rb") as f:
video_b64 = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.chat.completions.create(
model="system.ai.gemini-3-1-pro",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Summarize what happens in these videos."},
{
"type": "video_url",
"video_url": {"url": "https://example.com/sample-video.mp4"}
},
{
"type": "video_url",
"video_url": {"url": f"data:video/mp4;base64,{video_b64}"}
},
]
}],
max_tokens=1024
)
print(response.choices[0].message.content)
import os
import base64
from openai import OpenAI
DATABRICKS_TOKEN = os.environ.get('DATABRICKS_TOKEN')
DATABRICKS_BASE_URL = os.environ.get('DATABRICKS_BASE_URL')
client = OpenAI(
api_key=DATABRICKS_TOKEN,
base_url=DATABRICKS_BASE_URL
)
# Encode a local audio file as base64
with open("audio.mp3", "rb") as f:
audio_b64 = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.chat.completions.create(
model="system.ai.gemini-3-1-pro",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Transcribe this audio and summarize the key points."},
{
"type": "audio_url",
"audio_url": {"url": "https://example.com/sample-audio.mp3"}
},
{
"type": "audio_url",
"audio_url": {"url": f"data:audio/mp3;base64,{audio_b64}"}
},
]
}],
max_tokens=1024
)
print(response.choices[0].message.content)
curl \
-u token:$DATABRICKS_TOKEN \
-X POST \
-H "Content-Type: application/json" \
-d '{
"model": "system.ai.gemini-3-1-pro",
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "Summarize what happens in these videos."},
{
"type": "video_url",
"video_url": {"url": "https://example.com/sample-video.mp4"}
},
{
"type": "video_url",
"video_url": {"url": "data:video/mp4;base64,<base64_encoded_data>"}
}
]
}],
"max_tokens": 1024
}' \
https://<workspace-url>/ai-gateway/mlflow/v1/chat/completions
curl \
-u token:$DATABRICKS_TOKEN \
-X POST \
-H "Content-Type: application/json" \
-d '{
"model": "system.ai.gemini-3-1-pro",
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "Transcribe this audio and summarize the key points."},
{
"type": "audio_url",
"audio_url": {"url": "https://example.com/sample-audio.mp3"}
},
{
"type": "audio_url",
"audio_url": {"url": "data:audio/mp3;base64,<base64_encoded_data>"}
}
]
}],
"max_tokens": 1024
}' \
https://<workspace-url>/ai-gateway/mlflow/v1/chat/completions
Google Gemini API
Use the Google Gemini API to pass media as inlineData (base64-encoded) or fileData (URL reference) within the parts array.
The following examples show video and audio input using the Google Gemini API.
- Python
- REST API
from google import genai
from google.genai import types
import base64
import os
DATABRICKS_TOKEN = os.environ.get('DATABRICKS_TOKEN')
client = genai.Client(
api_key="databricks",
http_options=types.HttpOptions(
base_url="https://<workspace-url>/ai-gateway/gemini",
headers={
"Authorization": f"Bearer {DATABRICKS_TOKEN}",
},
),
)
# Encode a local video file as base64
with open("video.mp4", "rb") as f:
video_b64 = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.models.generate_content(
model="system.ai.gemini-3-1-pro",
contents=[
types.Content(
role="user",
parts=[
types.Part(text="Summarize what happens in these videos."),
types.Part(
file_data=types.FileData(
mime_type="video/mp4",
file_uri="https://example.com/sample-video.mp4",
)
),
types.Part(
inline_data=types.Blob(
mime_type="video/mp4",
data=video_b64,
)
),
],
),
],
config=types.GenerateContentConfig(
max_output_tokens=1024,
),
)
print(response.text)
from google import genai
from google.genai import types
import base64
import os
DATABRICKS_TOKEN = os.environ.get('DATABRICKS_TOKEN')
client = genai.Client(
api_key="databricks",
http_options=types.HttpOptions(
base_url="https://<workspace-url>/ai-gateway/gemini",
headers={
"Authorization": f"Bearer {DATABRICKS_TOKEN}",
},
),
)
# Encode a local audio file as base64
with open("audio.mp3", "rb") as f:
audio_b64 = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.models.generate_content(
model="system.ai.gemini-3-1-pro",
contents=[
types.Content(
role="user",
parts=[
types.Part(text="Transcribe this audio and summarize the key points."),
types.Part(
file_data=types.FileData(
mime_type="audio/mp3",
file_uri="https://example.com/sample-audio.mp3",
)
),
types.Part(
inline_data=types.Blob(
mime_type="audio/mp3",
data=audio_b64,
)
),
],
),
],
config=types.GenerateContentConfig(
max_output_tokens=1024,
),
)
print(response.text)
curl \
-u token:$DATABRICKS_TOKEN \
-X POST \
-H "Content-Type: application/json" \
-d '{
"contents": [{
"role": "user",
"parts": [
{"text": "Summarize what happens in these videos."},
{
"fileData": {
"mimeType": "video/mp4",
"fileUri": "https://example.com/sample-video.mp4"
}
},
{
"inlineData": {
"mimeType": "video/mp4",
"data": "<base64_encoded_data>"
}
}
]
}]
}' \
https://<workspace-url>/ai-gateway/gemini/v1beta/models/system.ai.gemini-3-1-pro:generateContent
curl \
-u token:$DATABRICKS_TOKEN \
-X POST \
-H "Content-Type: application/json" \
-d '{
"contents": [{
"role": "user",
"parts": [
{"text": "Transcribe this audio and summarize the key points."},
{
"fileData": {
"mimeType": "audio/mp3",
"fileUri": "https://example.com/sample-audio.mp3"
}
},
{
"inlineData": {
"mimeType": "audio/mp3",
"data": "<base64_encoded_data>"
}
}
]
}]
}' \
https://<workspace-url>/ai-gateway/gemini/v1beta/models/system.ai.gemini-3-1-pro:generateContent
Limitations
- Multiple audio or video inputs can be included in a single request, but large files increase latency and token usage.
Reasoning
Foundation models optimized for reasoning tasks. Databricks Foundation Model API provides a unified API to interact with all Foundation Models, including reasoning models. Reasoning gives foundation models enhanced capabilities to tackle complex tasks. Some models also provide transparency by revealing their step-by-step thought process before delivering a final answer.
Types of reasoning models
There are two types of models, reasoning-only and hybrid. The following table describes how different models use different approaches to control reasoning:
Models | Reasoning model type | Details | Parameters |
|---|---|---|---|
GPT-5 models like | Reasoning only | These models always use internal reasoning in their responses. | Use the following parameter in your request:
|
Claude models like | Hybrid reasoning | These models support both fast, instant replies and deeper reasoning when needed. | Include the following parameters to use hybrid reasoning:
|
Gemini 3 models like | Hybrid reasoning | These models support both fast, instant replies and deeper reasoning when needed. | Include the following parameters to use hybrid reasoning:
|
Gemini 2.5 models like | Hybrid reasoning | These models support both fast, instant replies and deeper reasoning when needed. | Include the following parameters to use hybrid reasoning:
|
GPT OSS models like | Reasoning only | These models always use internal reasoning in their responses. | Use the following parameter in your request:
|
Query examples
All reasoning models are accessed through the chat completions endpoint.
- Claude model example
- GPT-5.1
- GPT OSS model example
- Gemini model example
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ.get('YOUR_DATABRICKS_TOKEN'),
base_url=os.environ.get('YOUR_DATABRICKS_BASE_URL')
)
response = client.chat.completions.create(
model="system.ai.claude-sonnet-4-5",
messages=[{"role": "user", "content": "Why is the sky blue?"}],
max_tokens=20480,
extra_body={
"thinking": {
"type": "enabled",
"budget_tokens": 10240
}
}
)
msg = response.choices[0].message
reasoning = msg.content[0]["summary"][0]["text"]
answer = msg.content[1]["text"]
print("Reasoning:", reasoning)
print("Answer:", answer)
The reasoning_effort parameter for GPT-5.1 is set to none by default, but can be overridden in requests. Higher reasoning effort may result in more thoughtful and accurate responses, but may increase latency and token usage.
curl -X POST "https://<workspace-url>/ai-gateway/mlflow/v1/chat/completions" \
-H "Authorization: Bearer $DATABRICKS_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "system.ai.gpt-5-1",
"messages": [
{
"role": "user",
"content": "Why is the sky blue?"
}
],
"max_tokens": 4096,
"reasoning_effort": "none"
}'
The reasoning_effort parameter accepts "low", "medium" (default), or "high" values. Higher reasoning effort may result in more thoughtful and accurate responses, but may increase latency and token usage.
curl -X POST "https://<workspace-url>/ai-gateway/mlflow/v1/chat/completions" \
-H "Authorization: Bearer $DATABRICKS_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "system.ai.gpt-oss-120b",
"messages": [
{
"role": "user",
"content": "Why is the sky blue?"
}
],
"max_tokens": 4096,
"reasoning_effort": "high"
}'
This example uses system.ai.gemini-3-1-pro. The reasoning_effort parameter is set to "low" by default, but can be overridden in requests as seen in the following example.
curl -X POST "https://<workspace-url>/ai-gateway/mlflow/v1/chat/completions" \
-H "Authorization: Bearer $DATABRICKS_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "system.ai.gemini-3-1-pro",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Why is the sky blue?"
}
],
"max_tokens": 2000,
"stream": true,
"reasoning_effort": "high"
}'
The API response includes both thinking and text content blocks:
ChatCompletionMessage(
role="assistant",
content=[
{
"type": "reasoning",
"summary": [
{
"type": "summary_text",
"text": ("The question is asking about the scientific explanation for why the sky appears blue... "),
"signature": ("EqoBCkgIARABGAIiQAhCWRmlaLuPiHaF357JzGmloqLqkeBm3cHG9NFTxKMyC/9bBdBInUsE3IZk6RxWge...")
}
]
},
{
"type": "text",
"text": (
"# Why the Sky Is Blue\n\n"
"The sky appears blue because of a phenomenon called Rayleigh scattering. Here's how it works..."
)
}
],
refusal=None,
annotations=None,
audio=None,
function_call=None,
tool_calls=None
)
Manage reasoning across multiple turns
This section is specific to the databricks-claude-sonnet-4-5 model.
In multi-turn conversations, only the reasoning blocks associated with the last assistant turn or tool-use session are visible to the model and counted as input tokens.
If you don't want to pass reasoning tokens back to the model (for example, you don't need it to reason over its prior steps), you can omit the reasoning block entirely. For example:
response = client.chat.completions.create(
model="system.ai.claude-sonnet-4-5",
messages=[
{"role": "user", "content": "Why is the sky blue?"},
{"role": "assistant", "content": text_content},
{"role": "user", "content": "Can you explain in a way that a 5-year-old child can understand?"}
],
max_tokens=20480,
extra_body={
"thinking": {
"type": "enabled",
"budget_tokens": 10240
}
}
)
answer = response.choices[0].message.content[1]["text"]
print("Answer:", answer)
However, if you do need the model to reason over its previous reasoning process - for instance, if you're building experiences that surface its intermediate reasoning - you must include the full, unmodified assistant message, including the reasoning block from the previous turn. Here's how to continue a thread with the full assistant message:
assistant_message = response.choices[0].message
response = client.chat.completions.create(
model="system.ai.claude-sonnet-4-5",
messages=[
{"role": "user", "content": "Why is the sky blue?"},
{"role": "assistant", "content": text_content},
{"role": "user", "content": "Can you explain in a way that a 5-year-old child can understand?"},
assistant_message,
{"role": "user", "content": "Can you simplify the previous answer?"}
],
max_tokens=20480,
extra_body={
"thinking": {
"type": "enabled",
"budget_tokens": 10240
}
}
)
answer = response.choices[0].message.content[1]["text"]
print("Answer:", answer)
How does a reasoning model work?
Reasoning models introduce special reasoning tokens in addition to the standard input and output tokens. These tokens let the model "think" through the prompt, breaking it down and considering different ways to respond. After this internal reasoning process, the model generates its final answer as visible output tokens. Some models, like databricks-claude-sonnet-4-5, display these reasoning tokens to users, while others, such as the OpenAI o series, discard them and do not expose them in the final output.
Supported models
See Discover foundation models for the available foundation models and the interaction types each supports, including chat, vision, audio and video, and reasoning.