Get started querying LLMs on Databricks

This article describes how to get started using Foundation Model APIs to serve and query LLMs on Databricks.

The easiest way to get started with serving and querying LLM models on Databricks is using Foundation Model APIs on a pay-per-token basis. The APIs provide access to popular foundation models from pay-per-token endpoints that are automatically available in the Serving UI of your Databricks workspace. See Supported models for pay-per-token.

You can also test out and chat with pay-per-token models using the AI Playground. See Chat with LLMs and prototype GenAI apps using AI Playground.

For production workloads, particularly those with a fine-tuned model or that require performance guarantees, Databricks recommends using Foundation Model APIs on a provisioned throughput endpoint.

Requirements

Important

As a security best practice for production scenarios, Databricks recommends that you use machine-to-machine OAuth tokens for authentication during production.

For testing and development, Databricks recommends using a personal access token belonging to service principals instead of workspace users. To create tokens for service principals, see Manage tokens for a service principal.

Get started using Foundation Model APIs

The following example is meant to be run in a Databricks notebook. The code example queries the Meta Llama 3.1 405B Instruct model that’s served on the pay-per-token endpoint databricks-meta-llama-3-1-405b-instruct.

In this example, you use the OpenAI client to query the model by populating the model field with the name of the model serving endpoint that hosts the model you want to query. Use your personal access token to populate the DATABRICKS_TOKEN and your Databricks workspace instance to connect the OpenAI client to Databricks.

from openai import OpenAI
import os

DATABRICKS_TOKEN = os.environ.get("DATABRICKS_TOKEN")

client = OpenAI(
  api_key=DATABRICKS_TOKEN, # your personal access token
  base_url='https://<workspace_id>.databricks.com/serving-endpoints', # your Databricks workspace instance
)

chat_completion = client.chat.completions.create(
  messages=[
    {
      "role": "system",
      "content": "You are an AI assistant",
    },
    {
      "role": "user",
      "content": "What is a mixture of experts model?",
    }
  ],
  model="databricks-meta-llama-3-1-405b-instruct",
  max_tokens=256
)

print(chat_completion.choices[0].message.content)

Note

If you encounter the following message ImportError: cannot import name 'OpenAI' from 'openai', upgrade your openai version using !pip install -U openai. After you install the package, run dbutils.library.restartPython().

Expected output:

{
  "id": "xxxxxxxxxxxxx",
  "object": "chat.completion",
  "created": "xxxxxxxxx",
  "model": "databricks-meta-llama-3-1-405b-instruct",
  "choices": [
    {
      "index": 0,
      "message":
        {
          "role": "assistant",
          "content": "A Mixture of Experts (MoE) model is a machine learning technique that combines the predictions of multiple expert models to improve overall performance. Each expert model specializes in a specific subset of the data, and the MoE model uses a gating network to determine which expert to use for a given input."
        },
      "finish_reason": "stop"
    }
  ],
  "usage":
    {
      "prompt_tokens": 123,
      "completion_tokens": 23,
      "total_tokens": 146
    }
}

Next steps