Skip to main content

Debug a deployed AI agent

This page explains how to debug common issues when deploying AI agents using Mosaic AI Agent Framework's agents.deploy() API.

Agent Framework deploys to Model Serving endpoints, so you should review the Debugging guide for Model Serving in addition to the agent debugging steps on this page.

Author agents using best practices

Use the following best practices when authoring agents:

  • Improve debugging by using recommended agent authoring interfaces and MLflow tracing: Follow the best practices in Author AI agents in code such as enabling MLflow trace autologging to make your agents easier to debug.
  • Document tools clearly: Clear tool and parameter descriptions ensure your agent understands your tools and uses them appropriately. See Improve tool-calling with clear documentation.
  • Add timeouts and token limits to LLM calls: Add timeouts and token limits to LLM calls in your code to avoid delays caused by long-running steps.
    • If your agent uses the OpenAI client to query a Databricks LLM serving endpoint, set custom timeouts on serving endpoint calls as needed.

Debug slow or failed requests to deployed agents

If you enabled MLflow trace autologging while authoring your agent, traces are automatically logged in inference tables. These traces can help identify agent components that are slow or failing.

Identify problematic requests

Follow these steps to find problematic requests:

  1. In your workspace, go to the Serving tab and select your deployment name.
  2. In the Inference tables section, find the inference table's fully-qualified name. For example, my-catalog.my-schema.my-table.
  3. Run the following in a Databricks notebook:
    Python
    %sql
    SELECT * FROM my-catalog.my-schema.my-table
  4. Inspect the Response column for detailed trace information.
  5. Filter on request_time, databricks_request_id or status_code to narrow down results.
    Python
    %sql
    SELECT * FROM my-catalog.my-schema.my-table
    WHERE status_code != 200

Analyze root cause issues

After identifying failing or slow requests, use the mlflow.models.validate_serving_input API to invoke your agent against the failed input request. Then, view the resulting trace and perform root cause analysis on the failed response.

For a faster development loop, you can update your agent code directly and iterate by invoking your agent against the failed input example.

Debug authentication errors

If your deployed agent encounters authentication errors while accessing resources such as vector search indexes or LLM endpoints, check it was logged with the necessary resources for automatic authentication passthrough. See Automatic authentication passthrough.

To inspect the logged resources, run the following in a notebook:

Python
%pip install -U mlflow[databricks]
%restart_python

import mlflow
mlflow.set_registry_uri("databricks-uc")

# Replace with the model name and version of your deployed agent
agent_registered_model_name = ...
agent_model_version = ...

model_uri = f"models:/{agent_registered_model_name}/{agent_model_version}"
agent_info = mlflow.models.Model.load(model_uri)
print(f"Resources logged for agent model {model_uri}:", agent_info.resources)

To re-add missing or incorrect resources, you must log the agent and deploy it again.

If you’re using manual authentication for resources, verify that environment variables are correctly set. Manual settings override any automatic authentication configurations. See Manual authentication.