Get started querying LLMs on Databricks

This article describes how to get started using Foundation Model APIs to serve and query LLMs on Databricks.

The easiest way to get started with serving and querying LLM models on Databricks is using Foundation Model APIs on a pay-per-token basis. The APIs provide access to popular foundation models from pay-per-token endpoints that are automatically available in the Serving UI of your Databricks workspace. See Supported models for pay-per-token.

You can also test out and chat with pay-per-token models using the AI Playground. See Chat with supported LLMs using AI Playground.

For production workloads, particularly if you have a fine-tuned model or a workload that requires performance guarantees, Databricks recommends you upgrade to using Foundation Model APIs on a provisioned throughput endpoint.

Requirements

Databricks workspace in a supported region for Foundation Model APIs pay-per-token.
Databricks personal access token to query and access Databricks model serving endpoints using the OpenAI client.

Important

As a security best practice for production scenarios, Databricks recommends that you use machine-to-machine OAuth tokens for authentication during production.

For testing and development, Databricks recommends using a personal access token belonging to service principals instead of workspace users. To create tokens for service principals, see Manage tokens for a service principal.

Get started using Foundation Model APIs

The following example queries the databricks-dbrx-instruct model that’s served on the pay-per-token endpoint,databricks-dbrx-instruct. Learn more about the DBRX Instruct model.

In this example, you use the OpenAI client to query the model by populating the model field with the name of the model serving endpoint that hosts the model you want to query. Use your personal access token to populate the DATABRICKS_TOKEN and your Databricks workspace instance to connect the OpenAI client to Databricks.

from openai import OpenAI
import os

DATABRICKS_TOKEN = os.environ.get("DATABRICKS_TOKEN")

client = OpenAI(
  api_key=DATABRICKS_TOKEN, # your personal access token
  base_url='https://<workspace_id>.databricks.com/serving-endpoints', # your Databricks workspace instance
)

chat_completion = client.chat.completions.create(
  messages=[
    {
      "role": "system",
      "content": "You are an AI assistant",
    },
    {
      "role": "user",
      "content": "What is a mixture of experts model?",
    }
  ],
  model="databricks-dbrx-instruct",
  max_tokens=256
)

print(chat_completion.choices[0].message.content)

Expected output:

{
  "id": "xxxxxxxxxxxxx",
  "object": "chat.completion",
  "created": "xxxxxxxxx",
  "model": "databricks-dbrx-instruct",
  "choices": [
    {
      "index": 0,
      "message":
        {
          "role": "assistant",
          "content": "A Mixture of Experts (MoE) model is a machine learning technique that combines the predictions of multiple expert models to improve overall performance. Each expert model specializes in a specific subset of the data, and the MoE model uses a gating network to determine which expert to use for a given input."
        },
      "finish_reason": "stop"
    }
  ],
  "usage":
    {
      "prompt_tokens": 123,
      "completion_tokens": 23,
      "total_tokens": 146
    }
}

Next steps

Use the AI playground to try out different models in a familiar chat interface.
Query foundation models.
Access models hosted outside of Databricks using external models.
Learn how to deploy fine-tuned models using provisioned throughput endpoints.
Explore methods to monitor model quality and endpoint health.