Debugging guide for Model Serving
This article demonstrates debugging steps for common issues that users might encounter when working with model serving endpoints. Common issues could include errors users encounter when the endpoint fails to initialize or start, build failures related to the container, or problems during the operation or running of the model on the endpoint.
Access and review logs
Databricks recommends reviewing build logs for debugging and troubleshooting errors in your model serving workloads. See Monitor model quality and endpoint health for information about logs and how to view them.
Check the event logs for the model in the workspace UI and check for a successful container build message. If you do not see a build message after an hour, reach out to Databricks support for assistance.
If your build is successful, but you encounter other errors see Debugging after container build succeeds. If your build fails, see Debugging after container build failure.
Debugging after container build succeeds
Even if the container builds successfully, there might be issues when you run the model or during the operation of the endpoint itself. The following subsections detail common issues and how to troubleshoot and debug
Missing dependency
You might get an error like An error occurred while loading the model. No module named <module-name>.
. This error might indicate that a dependency is missing from the container. Verify that you properly denoted all the dependencies that should be included in the build of the container. Pay special attention to custom libraries and ensure that the .whl
files are included as artifacts.
Service logs looping
If your container build fails, check the service logs to see if you notice them looping when the endpoint tries to load the model. If you see this behavior try the following steps:
Open a notebook and attach to an All-Purpose cluster that uses a Databricks Runtime version, not Databricks Runtime for Machine Learning.
Load the model using MLflow and try debugging from there.
You can also load the model locally on your PC and debug from there. Load your model locally using the following:
import os
import mlflow
os.environ["MLFLOW_TRACKING_URI"] = "databricks://PROFILE"
ARTIFACT_URI = "model_uri"
if '.' in ARTIFACT_URI:
mlflow.set_registry_uri('databricks-uc')
local_path = mlflow.artifacts.download_artifacts(ARTIFACT_URI)
print(local_path)
conda env create -f local_path/artifact_path/conda.yaml
conda activate mlflow-env
mlflow.pyfunc.load_model(local_path/artifact_path)
Model fails when requests are sent to the endpoint
You might receive an error like Encountered an unexpected error while evaluating the model. Verify that the input is compatible with the model for inference.
when predict()
is called on your model.
There is a code issue in the predict()
function. Databricks recommends that you load the model from MLflow in a notebook and call it. Doing so highlights the issues in the predict()
function, and you can see where the failure is happening within the method.
Debugging after container build failure
This section details issues that might occur when your build fails.
OSError: [Errno 28] No space left on device
The No space left
error can be due to too many large artifacts being logged alongside the model unnecessarily. Check in MLflow that extraneous artifacts are not logged alongside the model and try to redeploy the slimmed down package.