Serve custom LLMs with Custom Model Serving
This feature is in Beta. Workspace admins can control access to this feature from the Previews page. See Manage Databricks previews.
This article shows you how to deploy custom large language models (LLMs) on Model Serving using a vLLM engine. Use this workflow to serve fine-tuned models, PEFT variants, multimodal models, and other foundation models that are not available in Foundation Model APIs (FMAPI).
When to use custom LLM serving
Databricks recommends custom LLM serving when you have one of the following use cases:
- Fully fine-tuned models with custom weights that you trained on Databricks.
- Models from Hugging Face that are not available in FMAPI.
- Custom PEFT recipes that FMAPI does not support.
- Specialized models outside the FMAPI catalog, such as MedGemma.
- Multimodal (vision-language) models such as
Qwen/Qwen2.5-VL-3B-Instruct. - Any model that fits on a 1xH100 (80 GB of GPU memory).
Requirements
-
Custom LLM serving Beta is opt-in. Contact your Databricks account team to enable it on your workspace. The starter notebook at the end of this page contains all runnable code for the following steps.
-
Access to serverless GPU compute. An A10 GPU is the recommended development environment for smaller models, H100 for larger models.
-
MLflow 3.12 or later. The starter notebook pins
mlflow==3.12.0. If you build your own environment, match this version.
Step 1: Set up your environment
Create a notebook on serverless GPU compute with an A10 GPU. Install vLLM and its dependencies. The starter notebook pins a tested vLLM version.
You can also specify dependencies through a serverless environment instead of using %pip install.
Set your working directory to local disk (for example, using tempfile.mkdtemp()). The /Workspace filesystem does not support large files like model weights.
Step 2: Download your model
Download model weights from Hugging Face with snapshot_download. The starter notebook uses Qwen/Qwen3-4B as an example, but you can substitute any model that fits your selected GPU's memory budget, including the following:
- Multimodal models such as
Qwen/Qwen2.5-VL-3B-Instructfor vision-language use cases. - Larger models that fit on a 1xH100, such as
openai/gpt-oss-120b.
Select a GPU based on your model's memory and performance needs.
GPU | GPU memory |
|
|---|---|---|
T4 | 16 GB |
|
A10 | 24 GB |
|
H100 | 80 GB |
|
Step 3: Test the model locally with vLLM
Before you deploy, test the model directly in your serverless GPU notebook by launching a local vLLM server. Local testing lets you verify the model, experiment with vLLM parameters, and troubleshoot issues before you create a serving endpoint.
Key things to know:
- Serverless GPU compute allows only ports 3000–3999 for local testing. Select a port in that range; the starter notebook uses 3080.
- The vLLM server exposes an OpenAI-compatible API at
/invocations. - You can test both regular and streaming requests.
- Tune parameters such as
--dtype,--max-model-len, and--gpu-memory-utilizationfor your model. - Add
--enforce-eagerfor faster startup, at the cost of some inference performance. - For larger models, use an H100 serverless GPU variant for local testing.
When you are satisfied with the configuration, stop the local server before you continue.
Step 4: Log the model with a custom entrypoint
This step connects your local setup to Model Serving and has the following configuration requirements:
- The
taskmust be"llm/v1/chat". - The entrypoint must launch on port 8080, the port that Model Serving expects.
- The entrypoint command must mirror what you tested in Step 3, with port 8080 instead of your local port.
- The entrypoint launches from the MLflow model artifacts folder, so model paths are relative to that folder.
metadata = {
"task": "llm/v1/chat",
"entrypoint": (
"python -u -m vllm.entrypoints.openai.api_server "
"--model qwen3 --served-model-name qwen "
"--host 0.0.0.0 --port 8080 "
"--dtype float16 --max-model-len 16384 "
"--gpu-memory-utilization 0.85"
),
}
Step 5: Register the model to Unity Catalog
Register the model to Unity Catalog using mlflow.register_model. Custom LLM serving depends on express deployments, use the env_pack="databricks_model_serving" parameter to enable it.
For example, add the following to your notebook:
model_version = mlflow.register_model(model_info.model_uri, UC_MODEL_NAME, env_pack="databricks_model_serving")
Step 6: Create a serving endpoint
Create the endpoint from the UI or programmatically with the Databricks SDK. The key decisions are compute type, workload size, and scale-to-zero behavior.
Pick a workload_type based on your model and cloud:
| GPU | Notes |
|---|---|---|
| 1x T4 (16 GB) | Smallest option. |
| 1x A10 (24 GB) | Default for general inference. Matches the notebook development environment. |
| 1x H100 (80 GB), | Recommended for large LLM workloads. Requires enrollment; see Limitations. |
workload_size (Small, Medium, or Large) controls the number of provisioned replicas behind the endpoint. Use Small for development and low-traffic workloads.
The following example shows a typical configuration:
ServedEntityInput(
entity_name="main.<catalog>.<model_name>",
entity_version="<version>",
workload_type=ServingModelWorkloadType.GPU_MEDIUM,
workload_size="Small",
scale_to_zero_enabled=True,
)
Scale-to-zero and capacity planning
Custom LLM serving in Beta provisions a fixed number of replicas behind your endpoint. Autoscaling between more than zero replicas is not yet supported, so you must size workload_type and workload_size for your peak traffic. Requests above the capacity of provisioned replicas are queued.
Set scale_to_zero_enabled=True to let the endpoint scale down to zero replicas when idle, then cold-start the first replica on the next request.
LLM endpoints have long cold-start times. Loading model weights and starting vLLM typically takes one to several minutes, depending on model size and GPU. Use scale_to_zero_enabled=True for development or low-priority workloads where occasional multi-minute first-request latency is acceptable. For latency-sensitive production traffic, set scale_to_zero_enabled=False so the endpoint is always ready.
Step 7: Query your endpoint
After the endpoint is ready, it appears automatically in the AI Playground from the endpoint's page. You can also query it programmatically using the Databricks SDK, OpenAI SDK or curl.
- Databricks SDK
- OpenAI SDK
- curl
w.serving_endpoints.query(
name="<endpoint-name>",
messages=[ChatMessage(role=ChatMessageRole.USER, content="Hello")],
)
client = OpenAI(
api_key=DATABRICKS_TOKEN,
base_url=f"{DATABRICKS_HOST}/serving-endpoints",
)
client.chat.completions.create(
model="<endpoint-name>",
messages=[{"role": "user", "content": "Hello"}],
)
curl -X POST \
-u "token:$DATABRICKS_TOKEN" \
-H "Content-Type: application/json" \
-d '{"messages":[{"role":"user","content":"Hello"}]}' \
https://<workspace-url>/serving-endpoints/<endpoint-name>/invocations
Monitor your endpoint
Custom LLM serving uses the same observability infrastructure as standard custom model serving endpoints, but with a few vLLM-specific extras described in the following sections.
Live logs
stdout and stderr from your vLLM process are available in real time in the Logs tab of the endpoint page in the Serving UI, and through the logs API.
Persisted logs and metrics
When telemetry is enabled, both logs and metrics are persisted to Unity Catalog Delta tables for long-term retention, SQL querying, and compliance. See Persist custom model serving data to Unity Catalog for full setup instructions, requirements, and table schemas.
For custom LLM serving specifically:
- Logs:
stdoutandstderrfrom the vLLM process are captured automatically. No application-side logging code is required. - Metrics: Databricks automatically scrapes the vLLM server's Prometheus
/metricsendpoint and persists the metrics alongside logs. Per-request latency, throughput, token counts, queue depth, and KV-cache utilization are all available by default.
Query telemetry data
During Beta, there is no platform UI for visualizing logs or metrics. Query the persisted data directly in Unity Catalog using SQL or a notebook. See the metric and log schemas documented in Persist custom model serving data to Unity Catalog.
The following notebook shows how to parse and visualize the persisted vLLM metrics:
Custom LLM serving metrics notebook
Example notebook
Develop and test the model in a serverless GPU notebook, then log and deploy the same configuration as a serving endpoint. The following notebook contains the complete runnable flow from this guide.
Custom LLM serving starter notebook
Limitations
The following limitations apply during Beta.
GPU_XLARGE(1xH100) endpoints are available only inus-west-2and require additional enrollment with your Databricks account team. Enrollment and region availability will expand during Beta.- No autoscaling between replicas. Scale-to-zero is supported.
- Only the LLM chat task (
llm/v1/chat) is supported, including multimodal. - No route optimization.
- No platform UI for visualizing logs or metrics. Query telemetry directly in Unity Catalog.
Reach out to your Databricks account team for feedback or questions.