Optimized large language model (LLM) serving

Preview

Important

The code examples in this guide use deprecated APIs. Databricks recommends using the provisioned throughput experience for optimized inference of LLMs. See Migrate optimized LLM serving endpoints to provisioned throughput.

This article demonstrates how to enable optimizations for large language models (LLMs) on Databricks Model Serving.

Optimized LLM serving provides throughput and latency improvements in the range of 3-5 times better compared to traditional serving approaches. The following table summarizes the supported LLM families and their variants.

Databricks recommends installing foundation models using Databricks Marketplace. You can search for a model family and from the model page, select Get access and provide login credentials to install the model to Unity Catalog.

Model family	Install from Marketplace
Llama 2	Llama 2 Models
MPT
Mistral	Mistral models

Requirements

Optimized LLM serving is supported as part of the Public Preview of GPU deployments.
Your model must be logged using MLflow 2.4 and above or Databricks Runtime 13.2 ML and above.
Databricks recommends using models in Unity Catalog for faster upload and download of large models.

When deploying models, it’s essential to match your model’s parameter size with the appropriate compute size. See the table below for recommendations. For models with 50 billion or more parameters, please reach out to your Databricks account team to access the necessary GPUs.

Model parameter size	Recommended compute size	GPU type
7 billion	1xA10	`GPU_MEDIUM`
13 billion	4xA10	`MULTIGPU_MEDIUM`
30-34 billion	4xA10	`MULTIGPU_MEDIUM`
70 billion	8xA10 or 8xA100	`GPU_MEDIUM_8` or `GPU_LARGE_8`

Log your large language model

First, log your model with the MLflow transformers flavor and specify the task field in the MLflow metadata with metadata = {"task": "llm/v1/completions"}. This specifies the API signature used for the model serving endpoint.

Optimized LLM serving is compatible with the route types supported by Databricks AI Gateway; currently, llm/v1/completions. If there is a model family or task type you want to serve that is not supported, reach out to your Databricks account team.

model = AutoModelForCausalLM.from_pretrained("mosaicml/mpt-7b-instruct",torch_dtype=torch.bfloat16, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("mosaicml/mpt-7b-instruct")
with mlflow.start_run():
    components = {
        "model": model,
        "tokenizer": tokenizer,
    }
    mlflow.transformers.log_model(
        artifact_path="model",
        transformers_model=components,
        input_example=["Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nWhat is Apache Spark?\n\n### Response:\n"],
        metadata={"task": "llm/v1/completions"},
        registered_model_name='mpt'
    )

After your model is logged you can register your models in Unity Catalog with the following where you replace CATALOG.SCHEMA.MODEL_NAME with the three-level name of the model.

mlflow.set_registry_uri("databricks-uc")

registered_model_name=CATALOG.SCHEMA.MODEL_NAME

Create your model serving endpoint

Next, create your model serving endpoint. If your model is supported by Optimized LLM serving, Databricks automatically creates an optimized model serving endpoint when you try to serve it.

import requests
import json

# Set the name of the MLflow endpoint
endpoint_name = "llama2-3b-chat"

# Name of the registered MLflow model
model_name = "ml.llm-catalog.llama-13b"

# Get the latest version of the MLflow model
model_version = 3

# Specify the type of compute (CPU, GPU_SMALL, GPU_MEDIUM, etc.)
workload_type = "GPU_MEDIUM"

# Specify the scale-out size of compute (Small, Medium, Large, etc.)
workload_size = "Small"

# Specify Scale to Zero (only supported for CPU endpoints)
scale_to_zero = False

# Get the API endpoint and token for the current notebook context
API_ROOT = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiUrl().get()
API_TOKEN = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()

# send the POST request to create the serving endpoint

data = {
    "name": endpoint_name,
    "config": {
        "served_models": [
            {
                "model_name": model_name,
                "model_version": model_version,
                "workload_size": workload_size,
                "scale_to_zero_enabled": scale_to_zero,
                "workload_type": workload_type,
            }
        ]
    },
}

headers = {"Context-Type": "text/json", "Authorization": f"Bearer {API_TOKEN}"}

response = requests.post(
    url=f"{API_ROOT}/api/2.0/serving-endpoints", json=data, headers=headers
)

print(json.dumps(response.json(), indent=4))

Input and output schema format

An optimized LLM serving endpoint has an input and output schemas that Databricks controls. Four different formats are supported.

dataframe_split is JSON-serialized Pandas Dataframe in the split orientation.

{
  "dataframe_split":{
    "columns":["prompt"],
    "index":[0],
    "data":[["Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instructions:\nWhat is Apache Spark?\n\n### Response:\n"]]
  },
  "params": {
    "temperature": 0.5,
    "max_tokens": 100,
    "stop": ["word1","word2"],
    "candidate_count": 1
  }
}

dataframe_records is JSON-serialized Pandas Dataframe in the records orientation.

{
  "dataframe_records": [{"prompt": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instructions:\nWhat is Apache Spark?\n\n### Response:\n"}],
  "params": {
    "temperature": 0.5,
    "max_tokens": 100,
    "stop": ["word1","word2"],
    "candidate_count": 1
  }
}

instances

{
  "instances": [
   {
     "prompt": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instructions:\nWhat is Apache Spark?\n\n### Response:\n"
   }
  ],
  "params": {
  "temperature": 0.5,
  "max_tokens": 100,
  "stop": ["word1","word2"],
  "candidate_count": 1
  }
}

inputs

{
  "inputs": {
    "prompt": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instructions:\nWhat is Apache Spark?\n\n### Response:\n"
  },
  "params": {
    "temperature": 0.5,
    "max_tokens": 100,
    "stop": ["word1","word2"],
    "candidate_count": 1
  }
}

Query your endpoint

After your endpoint is ready, you can query it by making an API request. Depending on the model size and complexity, it can take 30 minutes or more for the endpoint to get ready.

data = {
    "inputs": {
        "prompt": [
            "Hello, I'm a language model,"
        ]
    },
    "params": {
        "max_tokens": 100,
        "temperature": 0.0
    }
}

headers = {"Context-Type": "text/json", "Authorization": f"Bearer {API_TOKEN}"}

response = requests.post(
    url=f"{API_ROOT}/serving-endpoints/{endpoint_name}/invocations", json=data, headers=headers
)

print(json.dumps(response.json()))

Limitations

Given the increased installation requirements for models served on GPU, container image creation for GPU serving takes longer than image creation for CPU serving.
- Model size also impacts image creation. For example, models that have 30 billion parameters or more can take at least an hour to build.
- Databricks reuses the same container the next time the same version of the model is deployed, so subsequent deployments will take less time.
Autoscaling for GPU serving takes longer than for CPU serving, due to increased set up time for models served on GPU compute. Databricks recommends over-provisioning to avoid requests time-outs.

Notebook example

The following notebook shows how to create an optimized serving endpoint:

Optimized LLM serving for Llama2 model notebook

Open notebook in new tab

Optimized LLM serving for MPT model notebook

Open notebook in new tab