Query foundation models

In this article, you learn how to format query requests for foundation models and send them to your model serving endpoint.

For traditional ML or Python models query requests, see Query serving endpoints for custom models.

Databricks Model Serving supports Foundation Models APIs and external models for accessing foundation models and uses a unified OpenAI-compatible API and SDK for querying them. This makes it possible to experiment with and customize foundation models for production across supported clouds and providers.

Query a chat completion model
Query an embedding model
Query a text completion model

Databricks Model Serving provides the following options for sending scoring requests to foundation models:

Method	Details
OpenAI client	Query a model hosted by a Databricks Model Serving endpoint using the OpenAI client. Specify the model serving endpoint name as the `model` input. Supported for chat, embeddings, and completions models made available by Foundation Model APIs or external models.
Serving UI	Select Query endpoint from the Serving endpoint page. Insert JSON format model input data and click Send Request. If the model has an input example logged, use Show Example to load it.
REST API	Call and query the model using the REST API. See POST /serving-endpoints/{name}/invocations for details. For scoring requests to endpoints serving multiple models, see Query individual models behind an endpoint.
MLflow Deployments SDK	Use MLflow Deployments SDK’s predict() function to query the model.
Databricks GenAI SDK	Databricks GenAI SDK is a layer on top of the REST API. It handles low-level details, such as authentication and mapping model IDs to endpoint URLs, making it easier to interact with the models. The SDK is designed to be used from inside Databricks notebooks.
SQL function	Invoke model inference directly from SQL using the `ai_query` SQL function. See Query a served model with ai_query().

Requirements

A model serving endpoint.
A Databricks workspace in a supported region.
- Foundation Model APIs regions
- External models regions
To send a scoring request through the OpenAI client, REST API or MLflow Deployment SDK, you must have a Databricks API token.

Important

As a security best practice for production scenarios, Databricks recommends that you use machine-to-machine OAuth tokens for authentication during production.

For testing and development, Databricks recommends using a personal access token belonging to service principals instead of workspace users. To create tokens for service principals, see Manage tokens for a service principal.

Install packages

After you have selected a querying method, you must first install the appropriate package to your cluster.

To use the OpenAI client, the openai package needs to be installed on your cluster. Run the following in your notebook or your local terminal:

!pip install openai

The following is only required when installing the package on a Databricks Notebook

dbutils.library.restartPython()

Access to the Serving REST API is available in Databricks Runtime for Machine Learning.

!pip install mlflow

The following is only required when installing the package on a Databricks Notebook

dbutils.library.restartPython()

 !pip install databricks-genai

The following is only required when installing the package on a Databricks Notebook

 dbutils.library.restartPython()

Query a chat completion model

The following are examples for querying a chat model.

For a batch inference example, see Batch inference using Foundation Model APIs.

The following is a chat request for the DBRX Instruct model made available by the Foundation Model APIs pay-per-token endpoint, databricks-dbrx-instruct in your workspace.

To use the OpenAI client, specify the model serving endpoint name as the model input. The following example assumes you have a Databricks API token and openai installed on your compute. You also need your Databricks workspace instance to connect the OpenAI client to Databricks.

import os
import openai
from openai import OpenAI

client = OpenAI(
    api_key="dapi-your-databricks-token",
    base_url="https://example.staging.cloud.databricks.com/serving-endpoints"
)

response = client.chat.completions.create(
    model="databricks-dbrx-instruct",
    messages=[
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "What is a mixture of experts model?",
      }
    ],
    max_tokens=256
)

Important

The following example uses REST API parameters for querying serving endpoints that serve foundation models. These parameters are Public Preview and the definition might change. See POST /serving-endpoints/{name}/invocations.

The following is a chat request for the DBRX Instruct model made available by the Foundation Model APIs pay-per-token endpoint, databricks-dbrx-instruct in your workspace.

curl \
-u token:$DATABRICKS_TOKEN \
-X POST \
-H "Content-Type: application/json" \
-d '{
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": " What is a mixture of experts model?"
    }
  ]
}' \
https://<workspace_host>.databricks.com/serving-endpoints/databricks-dbrx-instruct/invocations \

Important

The following example uses the predict() API from the MLflow Deployments SDK.

The following is a chat request for the DBRX Instruct model made available by the Foundation Model APIs pay-per-token endpoint, databricks-dbrx-instruct in your workspace.

import mlflow.deployments

# Only required when running this example outside of a Databricks Notebook
export DATABRICKS_HOST="https://<workspace_host>.databricks.com"
export DATABRICKS_TOKEN="dapi-your-databricks-token"

client = mlflow.deployments.get_deploy_client("databricks")

chat_response = client.predict(
    endpoint="databricks-dbrx-instruct",
    inputs={
        "messages": [
            {
              "role": "user",
              "content": "Hello!"
            },
            {
              "role": "assistant",
              "content": "Hello! How can I assist you today?"
            },
            {
              "role": "user",
              "content": "What is a mixture of experts model??"
            }
        ],
        "temperature": 0.1,
        "max_tokens": 20
    }
)

The following is a chat request for the DBRX Instruct model made available by the Foundation Model APIs pay-per-token endpoint, databricks-dbrx-instruct in your workspace.

from databricks_genai_inference import ChatCompletion

# Only required when running this example outside of a Databricks Notebook
export DATABRICKS_HOST="https://<workspace_host>.databricks.com"
export DATABRICKS_TOKEN="dapi-your-databricks-token"

response = ChatCompletion.create(model="databricks-dbrx-instruct",
                                messages=[{"role": "system", "content": "You are a helpful assistant."},
                                          {"role": "user","content": "What is a mixture of experts model?"}],
                                max_tokens=128)
print(f"response.message:{response.message}")

To query a foundation model endpoint using LangChain, you can do either of the following:

Import the Databricks LLM class and specify the endpoint_name and transform_input_fn.
Import the ChatDatabricks ChatModel class and specify the endpoint.

The following example uses the Databricks LLM class in LangChain to query the Foundation Model APIs pay-per-token endpoint, databricks-dbrx-instruct. Foundation Model APIs expects messages in the request dictionary, while LangChain Databricks LLM by default provides prompt in the request dictionary. Use the transform_input function to prepare the request dictionary into the expected format.

from langchain.llms import Databricks
from langchain_core.messages import HumanMessage, SystemMessage

def transform_input(**request):
  request["messages"] = [
    {
      "role": "user",
      "content": request["prompt"]
    }
  ]
  del request["prompt"]
  return request

llm = Databricks(endpoint_name="databricks-dbrx-instruct", transform_input_fn=transform_input)
llm("What is a mixture of experts model?")

The following example uses the ChatDatabricks ChatModel class and specifies the endpoint.

from langchain.chat_models import ChatDatabricks
from langchain_core.messages import HumanMessage, SystemMessage


messages = [
    SystemMessage(content="You're a helpful assistant"),
    HumanMessage(content="What is a mixture of experts model?"),
]
chat_model = ChatDatabricks(endpoint="databricks-dbrx-instruct", max_tokens=500)
chat_model.invoke(messages)

Important

The following example uses the built-in SQL function, ai_query. This function is Public Preview and the definition might change. See Query a served model with ai_query().

The following is a chat request for llama-2-70b-chat made available by the Foundation Model APIs pay-per-token endpoint, databricks-llama-2-70b-chat in your workspace.

Note

The ai_query() function does not support query endpoints that serve the DBRX or the DBRX Instruct model.

SELECT ai_query(
    "databricks-llama-2-70b-chat",
    "Can you explain AI in ten words?"
  )

The following is the expected request format for a chat model. For external models, you can include additional parameters that are valid for a given provider and endpoint configuration. See Additional query parameters.

{
  "messages": [
    {
      "role": "user",
      "content": "What is a mixture of experts model?"
    }
  ],
  "max_tokens": 100,
  "temperature": 0.1
}

The following is an expected response format:

{
  "model": "databricks-dbrx-instruct",
  "choices": [
    {
      "message": {},
      "index": 0,
      "finish_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 7,
    "completion_tokens": 74,
    "total_tokens": 81
  },
  "object": "chat.completion",
  "id": null,
  "created": 1698824353
}

Chat session

Databricks GenAI SDK provides ChatSession class to manage multi-round chat conversations. It provides the following functions:

Function	Return	Description
`reply (string)`		Takes a new user message
`last`	string	Last message from assistant
`history`	list of dict	Messages in chat history, including roles.
`count`	int	Number of chat rounds conducted so far.

To initialize ChatSession, you use the same set of arguments as ChatCompletion, and those arguments are used throughout the chat session.

from databricks_genai_inference import ChatSession

chat = ChatSession(model="llama-2-70b-chat", system_message="You are a helpful assistant.", max_tokens=128)
chat.reply("Knock, knock!")
chat.last # return "Hello! Who's there?"
chat.reply("Guess who!")
chat.last # return "Okay, I'll play along! Is it a person, a place, or a thing?"

chat.history
# return: [
#     {'role': 'system', 'content': 'You are a helpful assistant.'},
#     {'role': 'user', 'content': 'Knock, knock.'},
#     {'role': 'assistant', 'content': "Hello! Who's there?"},
#     {'role': 'user', 'content': 'Guess who!'},
#     {'role': 'assistant', 'content': "Okay, I'll play along! Is it a person, a place, or a thing?"}
# ]

Query an embedding model

The following is an embeddings request for the bge-large-en model made available by Foundation Model APIs.

To use the OpenAI client, specify the model serving endpoint name as the model input. The following example assumes you have a Databricks API token and openai installed on your cluster.

import os
import openai
from openai import OpenAI

client = OpenAI(
    api_key="dapi-your-databricks-token",
    base_url="https://example.staging.cloud.databricks.com/serving-endpoints"
)

response = client.embeddings.create(
  model="databricks-bge-large-en",
  input="what is databricks"
)

Important

The following example uses REST API parameters for querying serving endpoints that serve foundation models. These parameters are Public Preview and the definition might change. See POST /serving-endpoints/{name}/invocations.

curl \
-u token:$DATABRICKS_TOKEN \
-X POST \
-H "Content-Type: application/json" \
-d  '{ "input": "Embed this sentence!"}' \
https://<workspace_host>.databricks.com/serving-endpoints/databricks-bge-large-en/invocations

Important

The following example uses the predict() API from the MLflow Deployments SDK.

import mlflow.deployments

export DATABRICKS_HOST="https://<workspace_host>.databricks.com"
export DATABRICKS_TOKEN="dapi-your-databricks-token"

client = mlflow.deployments.get_deploy_client("databricks")

embeddings_response = client.predict(
    endpoint="databricks-bge-large-en",
    inputs={
        "input": "Here is some text to embed"
    }
)

from databricks_genai_inference import Embedding

# Only required when running this example outside of a Databricks Notebook
export DATABRICKS_HOST="https://<workspace_host>.databricks.com"
export DATABRICKS_TOKEN="dapi-your-databricks-token"

response = Embedding.create(
    model="bge-large-en",
    input="3D ActionSLAM: wearable person tracking in multi-floor environments")
print(f'embeddings: {response.embeddings}')

To use a Databricks Foundation Model APIs model in LangChain as an Embedding Model, import the DatabricksEmbeddings class and specify the endpoint parameter as follows:

from langchain.embeddings import DatabricksEmbeddings

embeddings = DatabricksEmbeddings(endpoint="databricks-bge-large-en")
embeddings.embed_query("Can you explain AI in ten words?")

Important

The following example uses the built-in SQL function, ai_query. This function is Public Preview and the definition might change. See Query a served model with ai_query().

SELECT ai_query(
    "databricks-bge-large-en",
    "Can you explain AI in ten words?"
  )

The following is the expected request format for an embeddings model. For external models, you can include additional parameters that are valid for a given provider and endpoint configuration. See Additional query parameters.

{
  "input": [
    "embedding text"
  ]
}

The following is the expected response format:

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": []
    }
  ],
  "model": "text-embedding-ada-002-v2",
  "usage": {
    "prompt_tokens": 2,
    "total_tokens": 2
  }
}

Query a text completion model

The following is a completions request for the databricks-mpt-30b-instruct model made available by Foundation Model APIs. For the parameters and syntax, see Completion task.

To use the OpenAI client, specify the model serving endpoint name as the model input. The following example assumes you have a Databricks API token and openai installed on your cluster.

import os
import openai
from openai import OpenAI

client = OpenAI(
    api_key="dapi-your-databricks-token",
    base_url="https://example.staging.cloud.databricks.com/serving-endpoints"
)

completion = client.completions.create(
  model="databricks-mpt-30b-instruct",
  prompt="what is databricks",
  temperature=1.0
)

Important

The following example uses REST API parameters for querying serving endpoints that serve foundation models. These parameters are Public Preview and the definition might change. See POST /serving-endpoints/{name}/invocations.

curl \
 -u token:$DATABRICKS_TOKEN \
 -X POST \
 -H "Content-Type: application/json" \
 -d '{"prompt": "What is a quoll?", "max_tokens": 64}' \
https://<workspace_host>.databricks.com/serving-endpoints/databricks-mpt-30b-instruct/invocations

Important

The following example uses the predict() API from the MLflow Deployments SDK.

import mlflow.deployments

# Only required when running this example outside of a Databricks Notebook
export DATABRICKS_HOST="https://<workspace_host>.databricks.com"
export DATABRICKS_TOKEN="dapi-your-databricks-token"

client = mlflow.deployments.get_deploy_client("databricks")

completions_response = client.predict(
    endpoint="databricks-mpt-30b-instruct",
    inputs={
        "prompt": "What is the capital of France?",
        "temperature": 0.1,
        "max_tokens": 10,
        "n": 2
    }
)

from databricks_genai_inference import Completion

# Only required when running this example outside of a Databricks Notebook
export DATABRICKS_HOST="https://<workspace_host>.databricks.com"
export DATABRICKS_TOKEN="dapi-your-databricks-token"

response = Completion.create(
    model="databricks-mpt-30b-instruct",
    prompt="Write 3 reasons why you should train an AI model on domain specific data sets.",
    max_tokens=128)
print(f"response.text:{response.text:}")

Important

The following example uses the built-in SQL function, ai_query. This function is Public Preview and the definition might change. See Query a served model with ai_query().

SELECT ai_query(
    "databricks-mpt-30b-instruct",
    "Can you explain AI in ten words?"
  )

The following is the expected request format for a completions model. For external models, you can include additional parameters that are valid for a given provider and endpoint configuration. See Additional query parameters.

{
  "prompt": "What is mlflow?",
  "max_tokens": 100,
  "temperature": 0.1,
  "stop": [
    "Human:"
  ],
  "n": 1,
  "stream": false,
  "extra_params":{
    "top_p": 0.9
  }
}

The following is the expected response format:

{
  "id": "cmpl-8FwDGc22M13XMnRuessZ15dG622BH",
  "object": "text_completion",
  "created": 1698809382,
  "model": "gpt-3.5-turbo-instruct",
  "choices": [
    {
    "text": "MLflow is an open-source platform for managing the end-to-end machine learning lifecycle. It provides tools for tracking experiments, managing and deploying models, and collaborating on projects. MLflow also supports various machine learning frameworks and languages, making it easier to work with different tools and environments. It is designed to help data scientists and machine learning engineers streamline their workflows and improve the reproducibility and scalability of their models.",
    "index": 0,
    "logprobs": null,
    "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 5,
    "completion_tokens": 83,
    "total_tokens": 88
  }
}

Chat with supported LLMs using AI Playground

You can interact with supported large language models using the AI Playground. The AI Playground is a chat-like environment where you can test, prompt, and compare LLMs from your Databricks workspace.

Additional resources

Inference tables for monitoring and debugging models
[_] (/machine-learning/foundation-models/fmapi-batch-inference.md)
Databricks Foundation Model APIs
External models in Databricks Model Serving
Supported models for pay-per-token
Foundation model REST API reference