Get started querying LLMs on Databricks
This article describes how to get started using Foundation Model APIs to serve and query LLMs on Databricks.
The easiest way to get started with serving and querying LLM models on Databricks is using Foundation Model APIs on a pay-per-token basis. The APIs provide access to popular foundation models from pay-per-token endpoints that are automatically available in the Serving UI of your Databricks workspace. See Supported models for pay-per-token.
You can also test out and chat with pay-per-token models using the AI Playground. See Chat with LLMs and prototype GenAI apps using AI Playground.
For production workloads, particularly those with a fine-tuned model or that require performance guarantees, Databricks recommends using Foundation Model APIs on a provisioned throughput endpoint.
Requirements
A Databricks workspace in a supported region for Foundation Model APIs pay-per-token.
A Databricks personal access token to query and access Mosaic AI Model Serving endpoints using the OpenAI client.
Important
As a security best practice for production scenarios, Databricks recommends that you use machine-to-machine OAuth tokens for authentication during production.
For testing and development, Databricks recommends using a personal access token belonging to service principals instead of workspace users. To create tokens for service principals, see Manage tokens for a service principal.
Get started using Foundation Model APIs
The following example is meant to be run in a Databricks notebook. The code example queries the Meta Llama 3.1 405B Instruct model that’s served on the pay-per-token endpoint databricks-meta-llama-3-1-405b-instruct
.
In this example, you use the OpenAI client to query the model by populating the model
field with the name of the model serving endpoint that hosts the model you want to query. Use your personal access token to populate the DATABRICKS_TOKEN
and your Databricks workspace instance to connect the OpenAI client to Databricks.
from openai import OpenAI
import os
DATABRICKS_TOKEN = os.environ.get("DATABRICKS_TOKEN")
client = OpenAI(
api_key=DATABRICKS_TOKEN, # your personal access token
base_url='https://<workspace_id>.databricks.com/serving-endpoints', # your Databricks workspace instance
)
chat_completion = client.chat.completions.create(
messages=[
{
"role": "system",
"content": "You are an AI assistant",
},
{
"role": "user",
"content": "What is a mixture of experts model?",
}
],
model="databricks-meta-llama-3-1-405b-instruct",
max_tokens=256
)
print(chat_completion.choices[0].message.content)
Note
If you encounter the following message ImportError: cannot import name 'OpenAI' from 'openai'
, upgrade your openai
version using !pip install -U openai
. After you install the package, run dbutils.library.restartPython()
.
Expected output:
{
"id": "xxxxxxxxxxxxx",
"object": "chat.completion",
"created": "xxxxxxxxx",
"model": "databricks-meta-llama-3-1-405b-instruct",
"choices": [
{
"index": 0,
"message":
{
"role": "assistant",
"content": "A Mixture of Experts (MoE) model is a machine learning technique that combines the predictions of multiple expert models to improve overall performance. Each expert model specializes in a specific subset of the data, and the MoE model uses a gating network to determine which expert to use for a given input."
},
"finish_reason": "stop"
}
],
"usage":
{
"prompt_tokens": 123,
"completion_tokens": 23,
"total_tokens": 146
}
}
Next steps
Use the AI playground to try out different models in a familiar chat interface.
Access models hosted outside of Databricks using external models.
Learn how to deploy fine-tuned models using provisioned throughput endpoints.
Explore methods to monitor model quality and endpoint health.