Serve custom LLMs with Custom Model Serving

Beta

This feature is in Beta. Workspace admins can control access to this feature from the Previews page. See Manage Databricks previews.

This article shows you how to deploy custom large language models (LLMs) on Model Serving using a vLLM engine. Use this workflow to serve fine-tuned models, PEFT variants, multimodal models, and other foundation models that are not available in Foundation Model APIs (FMAPI).

When to use custom LLM serving

Databricks recommends custom LLM serving when you have one of the following use cases:

Fully fine-tuned models with custom weights that you trained on Databricks.
Models from Hugging Face that are not available in FMAPI.
Custom PEFT recipes that FMAPI does not support.
Specialized models outside the FMAPI catalog, such as MedGemma.
Multimodal (vision-language) models such as Qwen/Qwen2.5-VL-3B-Instruct.
Any model that fits on a 1xH100 (80 GB of GPU memory).

Requirements

Custom LLM serving Beta is opt-in. Contact your Databricks account team to enable it on your workspace. The starter notebook at the end of this page contains all runnable code for the following steps.
Access to serverless GPU compute. An A10 GPU is the recommended development environment for smaller models, H100 for larger models.
MLflow 3.12 or later. The starter notebook pins mlflow==3.12.0. If you build your own environment, match this version.

Step 1: Set up your environment

Create a notebook on serverless GPU compute with an A10 GPU. Install vLLM and its dependencies. The starter notebook pins a tested vLLM version.

You can also specify dependencies through a serverless environment instead of using %pip install.

important

Set your working directory to local disk (for example, using tempfile.mkdtemp()). The /Workspace filesystem does not support large files like model weights.

Step 2: Download your model

Download model weights from Hugging Face with snapshot_download. The starter notebook uses Qwen/Qwen3-4B as an example, but you can substitute any model that fits your selected GPU's memory budget, including the following:

Multimodal models such as Qwen/Qwen2.5-VL-3B-Instruct for vision-language use cases.
Larger models that fit on a 1xH100, such as openai/gpt-oss-120b.

Select a GPU based on your model's memory and performance needs.

GPU	GPU memory	`workload_type`
T4	16 GB	`GPU_SMALL`
A10	24 GB	`GPU_MEDIUM`
H100	80 GB	`GPU_XLARGE`

Step 3: Test the model locally with vLLM

Before you deploy, test the model directly in your serverless GPU notebook by launching a local vLLM server. Local testing lets you verify the model, experiment with vLLM parameters, and troubleshoot issues before you create a serving endpoint.

Key things to know:

Serverless GPU compute allows only ports 3000–3999 for local testing. Select a port in that range; the starter notebook uses 3080.
The vLLM server exposes an OpenAI-compatible API at /invocations.
You can test both regular and streaming requests.
Tune parameters such as --dtype, --max-model-len, and --gpu-memory-utilization for your model.
Add --enforce-eager for faster startup, at the cost of some inference performance.
For larger models, use an H100 serverless GPU variant for local testing.

When you are satisfied with the configuration, stop the local server before you continue.

Step 4: Log the model with a custom entrypoint

This step connects your local setup to Model Serving and has the following configuration requirements:

The task must be "llm/v1/chat".
The entrypoint must launch on port 8080, the port that Model Serving expects.
The entrypoint command must mirror what you tested in Step 3, with port 8080 instead of your local port.
The entrypoint launches from the MLflow model artifacts folder, so model paths are relative to that folder.

Python
metadata = {
    "task": "llm/v1/chat",
    "entrypoint": (
        "python -u -m vllm.entrypoints.openai.api_server "
        "--model qwen3 --served-model-name qwen "
        "--host 0.0.0.0 --port 8080 "
        "--dtype float16 --max-model-len 16384 "
        "--gpu-memory-utilization 0.85"
    ),
}

Step 5: Register the model to Unity Catalog

Register the model to Unity Catalog using mlflow.register_model. Custom LLM serving depends on express deployments, use the env_pack="databricks_model_serving" parameter to enable it.

For example, add the following to your notebook:

Python

model_version = mlflow.register_model(model_info.model_uri, UC_MODEL_NAME, env_pack="databricks_model_serving")

Step 6: Create a serving endpoint

Create the endpoint from the UI or programmatically with the Databricks SDK. The key decisions are compute type, workload size, and scale-to-zero behavior.

Pick a workload_type based on your model and cloud:

`workload_type`	GPU	Notes
`GPU_SMALL`	1x T4 (16 GB)	Smallest option.
`GPU_MEDIUM`	1x A10 (24 GB)	Default for general inference. Matches the notebook development environment.
`GPU_XLARGE`	1x H100 (80 GB), `us-west-2`	Recommended for large LLM workloads. Requires enrollment; see Limitations.

workload_size (Small, Medium, or Large) controls the number of provisioned replicas behind the endpoint. Use Small for development and low-traffic workloads.

The following example shows a typical configuration:

Python
ServedEntityInput(
    entity_name="main.<catalog>.<model_name>",
    entity_version="<version>",
    workload_type=ServingModelWorkloadType.GPU_MEDIUM,
    workload_size="Small",
    scale_to_zero_enabled=True,
)

Scale-to-zero and capacity planning

Custom LLM serving in Beta provisions a fixed number of replicas behind your endpoint. Autoscaling between more than zero replicas is not yet supported, so you must size workload_type and workload_size for your peak traffic. Requests above the capacity of provisioned replicas are queued.

Set scale_to_zero_enabled=True to let the endpoint scale down to zero replicas when idle, then cold-start the first replica on the next request.

warning

LLM endpoints have long cold-start times. Loading model weights and starting vLLM typically takes one to several minutes, depending on model size and GPU. Use scale_to_zero_enabled=True for development or low-priority workloads where occasional multi-minute first-request latency is acceptable. For latency-sensitive production traffic, set scale_to_zero_enabled=False so the endpoint is always ready.

Step 7: Query your endpoint

After the endpoint is ready, it appears automatically in the AI Playground from the endpoint's page. You can also query it programmatically using the Databricks SDK, OpenAI SDK or curl.

Databricks SDK
OpenAI SDK
curl

Python
w.serving_endpoints.query(
    name="<endpoint-name>",
    messages=[ChatMessage(role=ChatMessageRole.USER, content="Hello")],
)

Python
client = OpenAI(
    api_key=DATABRICKS_TOKEN,
    base_url=f"{DATABRICKS_HOST}/serving-endpoints",
)
client.chat.completions.create(
    model="<endpoint-name>",
    messages=[{"role": "user", "content": "Hello"}],
)

Shell
curl -X POST \
  -u "token:$DATABRICKS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Hello"}]}' \
  https://<workspace-url>/serving-endpoints/<endpoint-name>/invocations

Monitor your endpoint

Custom LLM serving uses the same observability infrastructure as standard custom model serving endpoints, but with a few vLLM-specific extras described in the following sections.

Live logs

stdout and stderr from your vLLM process are available in real time in the Logs tab of the endpoint page in the Serving UI, and through the logs API.

Persisted logs and metrics

When telemetry is enabled, both logs and metrics are persisted to Unity Catalog Delta tables for long-term retention, SQL querying, and compliance. See Persist custom model serving data to Unity Catalog for full setup instructions, requirements, and table schemas.

For custom LLM serving specifically:

Logs: stdout and stderr from the vLLM process are captured automatically. No application-side logging code is required.
Metrics: Databricks automatically scrapes the vLLM server's Prometheus /metrics endpoint and persists the metrics alongside logs. Per-request latency, throughput, token counts, queue depth, and KV-cache utilization are all available by default.

Query telemetry data

During Beta, there is no platform UI for visualizing logs or metrics. Query the persisted data directly in Unity Catalog using SQL or a notebook. See the metric and log schemas documented in Persist custom model serving data to Unity Catalog.

The following notebook shows how to parse and visualize the persisted vLLM metrics:

Custom LLM serving metrics notebook

Open notebook in new tab

Example notebook

Develop and test the model in a serverless GPU notebook, then log and deploy the same configuration as a serving endpoint. The following notebook contains the complete runnable flow from this guide.

Custom LLM serving starter notebook

Open notebook in new tab

Limitations

The following limitations apply during Beta.

GPU_XLARGE (1xH100) endpoints are available only in us-west-2 and require additional enrollment with your Databricks account team. Enrollment and region availability will expand during Beta.
No autoscaling between replicas. Scale-to-zero is supported.
Only the LLM chat task (llm/v1/chat) is supported, including multimodal.
No route optimization.
No platform UI for visualizing logs or metrics. Query telemetry directly in Unity Catalog.

Reach out to your Databricks account team for feedback or questions.

When to use custom LLM serving​

Requirements​

Step 1: Set up your environment​

Step 2: Download your model​

Step 3: Test the model locally with vLLM​

Step 4: Log the model with a custom entrypoint​

Step 5: Register the model to Unity Catalog​

Step 6: Create a serving endpoint​

Scale-to-zero and capacity planning​

Step 7: Query your endpoint​

Monitor your endpoint​

Live logs​

Persisted logs and metrics​

Query telemetry data​