Debugging guide for Model Serving

This article demonstrates debugging steps for common issues that users might encounter when working with model serving endpoints. Common issues could include errors users encounter when the endpoint fails to initialize or start, build failures related to the container, or problems during the operation or running of the model on the endpoint.

Access and review logs

Databricks recommends reviewing build logs for debugging and troubleshooting errors in your model serving workloads. See Monitor model quality and endpoint health for information about logs and how to view them.

note

If your model code returns MlflowException errors, expect the response code to be mapped to a 4xx response. Databricks considers these model code errors to be customer-caused errors, since they can be resolved based on the resulting error message. 5xx error codes are reserved to communicate errors where Databricks is at fault.

Check the event logs for the model in the workspace UI and check for a successful container build message. If you do not see a build message after an hour, reach out to Databricks support for assistance.

If your build is successful, but you encounter other errors see Debugging after container build succeeds. If your build fails, see Debugging after container build failure.

Installed library package versions

In your build logs you can confirm the package versions that are installed.

For MLflow versions, if you do not have a version specified, Model Serving uses the latest version.
For custom GPU serving, Model Serving installs the recommended versions of cuda and cuDNN according to public PyTorch and Tensorflow documentation.

Log models that require `flash-attn`

If you are logging a model that requires flash-attn, Databricks recommends using a custom wheel version of flash-attn. Otherwise, build errors such as ModuleNotFoundError: No module named 'torch' can result.

To use a custom wheel version of flash-attn, specify all pip requirements as a list and pass it as a parameter into your mlflow.transformers.log_model function. You must also specify the pytorch, torch, and torchvision versions that are compatible with the CUDA version specified in your flash attn wheel.

For example, Databricks recommends using the following versions and wheels for CUDA 11.8:

Pytorch
Torch 2.0.1+cu118
Torchvision 0.15.2+cu118
Flash-Attn

Python

logged_model=mlflow.transformers.log_model(
transformers_model=test_pipeline,
       artifact_path="artifact_path",
       pip_requirements=["--extra-index-url https://download.pytorch.org/whl/cu118", "mlflow==2.13.1", "setuptools<70.0.0", "torch==2.0.1+cu118", "accelerate==0.31.0", "astunparse==1.6.3", "bcrypt==3.2.0", "boto3==1.34.39", "configparser==5.2.0", "defusedxml==0.7.1", "dill==0.3.6", "google-cloud-storage==2.10.0", "ipython==8.15.0", "lz4==4.3.2", "nvidia-ml-py==12.555.43", "optree==0.12.1", "pandas==1.5.3", "pyopenssl==23.2.0", "pytesseract==0.3.10", "scikit-learn==1.3.0", "sentencepiece==0.1.99", "torchvision==0.15.2+cu118", "transformers==4.41.2", "https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.8/flash_attn-2.5.8+cu118torch2.0cxx11abiFALSE-cp311-cp311-linux_x86_64.whl"],
       input_example=input_example,
       registered_model_name=registered_model_name)

Before model deployment validation checks

Databricks recommends applying the guidance in this section before you serve your model. The following parameters can catch issues early before waiting for the endpoint. See Validate the model input before deployment to validate your model input before deploying your model.

Test predictions before deployment

Before deploying your model to the serving endpoint, test offline predictions with a virtual environment using mlflow.models.predict and input examples. See MLflow documentation for testing predictions for more detailed guidance.

Python

input_example = {
                  "messages":
                  [
                    {"content": "How many categories of products do we have? Name them.", "role": "user"}
                  ]
                }

mlflow.models.predict(
   model_uri = logged_chain_info.model_uri,
   input_data = input_example,
)

Validate the model input before deployment

Model serving endpoints expect a special format of json input to validate that your model input works on a serving endpoint before deployment. You can use validate_serving_input in MLflow to do such validation.

The following is an example of the auto-generated code in the run's artifacts tab if your model is logged with a valid input example.

Python
from mlflow.models import validate_serving_input

model_uri = 'runs:/<run_id>/<artifact_path>'

serving_payload = """{
 "messages": [
   {
     "content": "How many product categories are there?",
     "role": "user"
   }
 ]
}
"""

# Validate the serving payload works on the model
validate_serving_input(model_uri, serving_payload)

You can also test any input examples against the logged model by using convert_input_example_to_serving_input API to generate a valid json serving input.

Python
from mlflow.models import validate_serving_input
from mlflow.models import convert_input_example_to_serving_input

model_uri = 'runs:/<run_id>/<artifact_path>'

# Define INPUT_EXAMPLE with your own input example to the model
# A valid input example is a data instance suitable for pyfunc prediction

serving_payload = convert_input_example_to_serving_input(INPUT_EXAMPLE)

# Validate the serving payload works on the model
validate_serving_input(model_uri, serving_payload)

Debugging after container build succeeds

Even if the container builds successfully, there might be issues when you run the model or during the operation of the endpoint itself. The following subsections detail common issues and how to troubleshoot and debug

Missing dependency

You might get an error like An error occurred while loading the model. No module named <module-name>.. This error might indicate that a dependency is missing from the container. Verify that you properly denoted all the dependencies that should be included in the build of the container. Pay special attention to custom libraries and ensure that the .whl files are included as artifacts.

Service logs looping

If your container build fails, check the service logs to see if you notice them looping when the endpoint tries to load the model. If you see this behavior try the following steps:

Open a notebook and attach to an All-Purpose cluster that uses a Databricks Runtime version, not Databricks Runtime for Machine Learning.
Load the model using MLflow and try debugging from there.

You can also load the model locally on your PC and debug from there. Load your model locally using the following:

Python
import os
import mlflow

os.environ["MLFLOW_TRACKING_URI"] = "databricks://PROFILE"

ARTIFACT_URI = "model_uri"
if '.' in ARTIFACT_URI:
    mlflow.set_registry_uri('databricks-uc')
local_path = mlflow.artifacts.download_artifacts(ARTIFACT_URI)
print(local_path)

conda env create -f local_path/artifact_path/conda.yaml
conda activate mlflow-env

mlflow.pyfunc.load_model(local_path/artifact_path)

Model fails or times out when requests are sent to the endpoint

You might receive an error like Encountered an unexpected error while evaluating the model. Verify that the input is compatible with the model for inference. when predict() is called on your model.

There is a code issue in the predict() function. Databricks recommends that you load the model from MLflow in a notebook and call it. Doing so highlights the issues in the predict() function, and you can see where the failure is happening within the method.

Root cause analysis of failed requests

If a request to an endpoint fails, you can perform root cause analysis by using inference tables. Inference tables automatically log all requests and responses to your endpoint in a Unity Catalog table for you to query.

For external models, provisioned throughput endpoints, and AI agents, see Monitor served models using AI Gateway-enabled inference tables.

For custom models, see Inference tables for monitoring and debugging models.

To query inference tables:

In your workspace, go to the Serving tab and select your endpoint name.
In the Inference tables section, find the inference table's fully-qualified name. For example, my-catalog.my-schema.my-table.

Run the following in a Databricks notebook:

Python
%sql
SELECT * FROM my-catalog.my-schema.my-table

View and filter on columns such as request, response, request_time and status_code to understand the requests and narrow down results.
Python
```
%sql
SELECT * FROM my-catalog.my-schema.my-table
WHERE status_code != 200
```
If you enabled agent tracing for AI agents, see the Response column to view detailed traces. See Enable inference tables for AI agents.

Workspace exceeds provisioned concurrency

You might receive a Workspace exceeded provisioned concurrency quota error.

You can increase concurrency depending on region availability. Reach out to your Databricks account team and provide your workspace ID to request a concurrency increase.

Debugging after container build failure

This section details issues that might occur when your build fails.

`OSError: [Errno 28] No space left on device`

The No space left error can be due to too many large artifacts being logged alongside the model unnecessarily. Check in MLflow that extraneous artifacts are not logged alongside the model and try to redeploy the slimmed down package.

Build failure due to lack of GPU availability

You might see an the error: Build could not start due to an internal error - please contact your Databricks representative..

Reach out to your Databricks account team to help resolve.

Access and review logs​

Installed library package versions​

Log models that require flash-attn​

Before model deployment validation checks​

Test predictions before deployment​

Validate the model input before deployment​

Debugging after container build succeeds​

Missing dependency​

Service logs looping​

Model fails or times out when requests are sent to the endpoint​

Root cause analysis of failed requests​

Workspace exceeds provisioned concurrency​

Debugging after container build failure​

OSError: [Errno 28] No space left on device​

Build failure due to lack of GPU availability​