Query a chat model
A new Unity AI Gateway experience is available in Beta. The new Unity AI Gateway is the enterprise control plane for governing LLM endpoints and coding agents with enhanced features. See Unity AI Gateway.
In this article, you learn how to write query requests for foundation models that are optimized for chat and general purpose tasks and served by Unity AI Gateway.
Genie Code (Agent mode) can do this for you. Try this example prompt:
Query the databricks-claude-sonnet-4-5 chat model using the OpenAI client. Send a system prompt and a user question, and print the response.
The examples in this article apply to querying foundation models that are made available using either:
- Foundation Models APIs which are referred to as Databricks-hosted foundation models.
- External models which are referred to as foundation models hosted outside of Databricks.
Requirements
- See Requirements.
- Install the appropriate package to your cluster based on the querying client option you choose.
Query examples
The following examples are based on Unity AI Gateway and model services. If you use model serving endpoints instead of model services, replace the model service name with an endpoint name. See Databricks-hosted foundation models available in Foundation Model APIs for a list of available foundation models and their model service and endpoint names.
The examples in this section show how to query a Foundation Model API pay-per-token model service using the different client options.
For a batch inference example, see Enrich data using AI Functions.
- OpenAI Chat Completions
- OpenAI Responses
- SQL
- REST API
- MLflow Deployments SDK
- Databricks Python SDK
- LangChain
To use the OpenAI client, specify the model service name as the model input.
from databricks_openai import DatabricksOpenAI
client = DatabricksOpenAI()
response = client.chat.completions.create(
model="system.ai.claude-sonnet-4-5",
messages=[
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is a mixture of experts model?",
}
],
max_tokens=256
)
To query foundation models outside of your workspace, you must use the OpenAI client directly. You also need your Databricks workspace instance to connect the OpenAI client to Databricks. The following example assumes you have a Databricks API token and openai installed on your compute.
import os
import openai
from openai import OpenAI
client = OpenAI(
api_key="dapi-your-databricks-token",
base_url="https://example.staging.cloud.databricks.com/ai-gateway/mlflow/v1"
)
response = client.chat.completions.create(
model="system.ai.claude-sonnet-4-5",
messages=[
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is a mixture of experts model?",
}
],
max_tokens=256
)
As an example, the following is the expected request format for a chat model when using the REST API. For external models, you can include additional parameters that are valid for a given provider and endpoint configuration. See Additional query parameters.
{
"messages": [
{
"role": "user",
"content": "What is a mixture of experts model?"
}
],
"max_tokens": 100,
"temperature": 0.1
}
The following is an expected response format for a request made using the REST API:
{
"model": "databricks-claude-sonnet-4-5",
"choices": [
{
"message": {},
"index": 0,
"finish_reason": null
}
],
"usage": {
"prompt_tokens": 7,
"completion_tokens": 74,
"total_tokens": 81
},
"object": "chat.completion",
"id": null,
"created": 1698824353
}
This section covers the OpenAI Responses API, a native passthrough that supports the full set of OpenAI Responses parameters for OpenAI models. To use the Responses request format with Anthropic Claude, Google Gemini, or Databricks-hosted open models, see Query a model with the Open Responses API.
To use the OpenAI Responses API, specify the model service name as the model input.
from databricks_openai import DatabricksOpenAI
client = DatabricksOpenAI()
response = client.responses.create(
model="system.ai.gpt-5",
input=[
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is a mixture of experts model?",
}
],
max_output_tokens=256
)
To query foundation models outside of your workspace, you must use the OpenAI client directly. You also need your Databricks workspace instance to connect the OpenAI client to Databricks. The following example assumes you have a Databricks API token and openai installed on your compute.
import os
import openai
from openai import OpenAI
client = OpenAI(
api_key="dapi-your-databricks-token",
base_url="https://example.staging.cloud.databricks.com/ai-gateway/mlflow/v1"
)
response = client.responses.create(
model="system.ai.gpt-5",
input=[
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is a mixture of experts model?",
}
],
max_output_tokens=256
)
As an example, the following is the expected request format when using the OpenAI Responses API. The URL path for this API is /serving-endpoints/responses.
{
"model": "databricks-gpt-5",
"input": [
{
"role": "user",
"content": "What is a mixture of experts model?"
}
],
"max_output_tokens": 100,
"temperature": 0.1
}
The following is an expected response format for a request made using the Responses API:
{
"id": "resp_abc123",
"object": "response",
"created_at": 1698824353,
"model": "databricks-gpt-5",
"output": [
{
"type": "message",
"role": "assistant",
"content": []
}
],
"usage": {
"input_tokens": 7,
"output_tokens": 74,
"total_tokens": 81
}
}
The following example uses the built-in SQL function, ai_query. This function is in Public Preview and the definition might change.
SELECT ai_query(
"system.ai.claude-sonnet-4-5",
"Can you explain AI in ten words?"
)
As an example, the following is the expected request format for a chat model when using the REST API. For external models, you can include additional parameters that are valid for a given provider and endpoint configuration. See Additional query parameters.
{
"messages": [
{
"role": "user",
"content": "What is a mixture of experts model?"
}
],
"max_tokens": 100,
"temperature": 0.1
}
The following is an expected response format for a request made using the REST API:
{
"model": "databricks-claude-sonnet-4-5",
"choices": [
{
"message": {},
"index": 0,
"finish_reason": null
}
],
"usage": {
"prompt_tokens": 7,
"completion_tokens": 74,
"total_tokens": 81
},
"object": "chat.completion",
"id": null,
"created": 1698824353
}
The following example uses REST API parameters for querying serving endpoints that serve foundation models. These parameters are in Public Preview and the definition might change. See POST /serving-endpoints/{name}/invocations.
curl \
-u token:$DATABRICKS_TOKEN \
-X POST \
-H "Content-Type: application/json" \
-d '{
"model": "system.ai.claude-sonnet-4-5",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": " What is a mixture of experts model?"
}
]
}' \
https://<workspace_host>.databricks.com/ai-gateway/mlflow/v1/chat/completions \
As an example, the following is the expected request format for a chat model when using the REST API. For external models, you can include additional parameters that are valid for a given provider and endpoint configuration. See Additional query parameters.
{
"messages": [
{
"role": "user",
"content": "What is a mixture of experts model?"
}
],
"max_tokens": 100,
"temperature": 0.1
}
The following is an expected response format for a request made using the REST API:
{
"model": "databricks-claude-sonnet-4-5",
"choices": [
{
"message": {},
"index": 0,
"finish_reason": null
}
],
"usage": {
"prompt_tokens": 7,
"completion_tokens": 74,
"total_tokens": 81
},
"object": "chat.completion",
"id": null,
"created": 1698824353
}
The following example uses the predict() API from the MLflow Deployments SDK.
import mlflow.deployments
# Only required when running this example outside of a Databricks Notebook
export DATABRICKS_HOST="https://<workspace_host>.databricks.com"
export DATABRICKS_TOKEN="dapi-your-databricks-token"
client = mlflow.deployments.get_deploy_client("databricks")
chat_response = client.predict(
endpoint="system.ai.claude-sonnet-4-5",
inputs={
"messages": [
{
"role": "user",
"content": "Hello!"
},
{
"role": "assistant",
"content": "Hello! How can I assist you today?"
},
{
"role": "user",
"content": "What is a mixture of experts model??"
}
],
"temperature": 0.1,
"max_tokens": 20
}
)
As an example, the following is the expected request format for a chat model when using the REST API. For external models, you can include additional parameters that are valid for a given provider and endpoint configuration. See Additional query parameters.
{
"messages": [
{
"role": "user",
"content": "What is a mixture of experts model?"
}
],
"max_tokens": 100,
"temperature": 0.1
}
The following is an expected response format for a request made using the REST API:
{
"model": "databricks-claude-sonnet-4-5",
"choices": [
{
"message": {},
"index": 0,
"finish_reason": null
}
],
"usage": {
"prompt_tokens": 7,
"completion_tokens": 74,
"total_tokens": 81
},
"object": "chat.completion",
"id": null,
"created": 1698824353
}
This code must be run in a notebook in your workspace. See Use the Databricks SDK for Python from a Databricks notebook.
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.serving import ChatMessage, ChatMessageRole
w = WorkspaceClient()
response = w.serving_endpoints.query(
name="system.ai.claude-sonnet-4-5",
messages=[
ChatMessage(
role=ChatMessageRole.SYSTEM, content="You are a helpful assistant."
),
ChatMessage(
role=ChatMessageRole.USER, content="What is a mixture of experts model?"
),
],
max_tokens=128,
)
print(f"RESPONSE:\n{response.choices[0].message.content}")
As an example, the following is the expected request format for a chat model when using the REST API. For external models, you can include additional parameters that are valid for a given provider and endpoint configuration. See Additional query parameters.
{
"messages": [
{
"role": "user",
"content": "What is a mixture of experts model?"
}
],
"max_tokens": 100,
"temperature": 0.1
}
The following is an expected response format for a request made using the REST API:
{
"model": "databricks-claude-sonnet-4-5",
"choices": [
{
"message": {},
"index": 0,
"finish_reason": null
}
],
"usage": {
"prompt_tokens": 7,
"completion_tokens": 74,
"total_tokens": 81
},
"object": "chat.completion",
"id": null,
"created": 1698824353
}
To query a model service using LangChain, you can use the ChatDatabricks ChatModel class and specify the model.
%pip install databricks-langchain
from langchain_core.messages import HumanMessage, SystemMessage
from databricks_langchain import ChatDatabricks
messages = [
SystemMessage(content="You're a helpful assistant"),
HumanMessage(content="What is a mixture of experts model?"),
]
llm = ChatDatabricks(model="system.ai.claude-sonnet-4-5")
llm.invoke(messages)
As an example, the following is the expected request format for a chat model when using the REST API. For external models, you can include additional parameters that are valid for a given provider and endpoint configuration. See Additional query parameters.
{
"messages": [
{
"role": "user",
"content": "What is a mixture of experts model?"
}
],
"max_tokens": 100,
"temperature": 0.1
}
The following is an expected response format for a request made using the REST API:
{
"model": "databricks-claude-sonnet-4-5",
"choices": [
{
"message": {},
"index": 0,
"finish_reason": null
}
],
"usage": {
"prompt_tokens": 7,
"completion_tokens": 74,
"total_tokens": 81
},
"object": "chat.completion",
"id": null,
"created": 1698824353
}
Supported models
See Foundation model types for supported chat models.