Skip to main content

Deploy agents with tracing

MLflow Tracing provides comprehensive observability for production gen AI agents and apps by capturing execution details that you can view in the MLflow UI or analyze as tables.

Databricks provides a fully managed, production‑ready MLflow Tracking service in every workspace. When you set your tracking URI to databricks, traces are securely stored and served by Databricks, with no separate trace database or server to deploy or operate.

MLflow production tracing overview

How production tracing works:

  1. Your app generates traces for each API call. Your app may run in Model Serving (this guide) or may be an external application.

  2. Traces are logged to an experiment in your Databricks MLflow tracking server in real-time, and optionally to Delta tables.

    1. For development, use traces logged to MLflow experiments.
    2. For production, use MLflow experiment logging and/or logging to Delta tables.
  3. Analyze and monitor traces using the MLflow UI, production monitoring, or custom evaluation.

When you deploy GenAI applications or agents that have been instrumented with MLflow Tracing through the Mosaic AI Agent Framework, MLflow Tracing works automatically without any additional configuration. This is the recommended deployment method. Traces are automatically stored in the agent's MLflow experiment. Optionally, traces can also be copied to Delta tables using Production monitoring.

Production tracing works for gen AI apps deployed inside or outside of Databricks. This section covers tracing apps deployed using Databricks Model Serving. For externally deployed apps, see Trace agents deployed outside of Databricks.

Steps for deployment

First, set up the storage location(s) for traces:

  1. If you plan to use Production Monitoring to store traces in Delta tables, then ensure it is enabled for your workspace.
  2. Create an MLflow Experiment for storing your app's production traces.

Next, in your Python notebook, instrument your agent with MLflow Tracing, and use Agent Framework to deploy your agent:

  1. Install mlflow[databricks] in your Python environment. Use the latest version.
  2. Connect to the MLflow Experiment using mlflow.set_experiment(...).
  3. Wrap your agent's code using Agent Framework's authoring interfaces. In your agent code, enable MLflow Tracing using automatic or manual instrumentation.
  4. Log your agent as an MLflow model, and register it to Unity Catalog.
  5. Ensure that mlflow is in the model's Python dependencies, with the same package version used in your notebook environment.
  6. Use agents.deploy(...) to deploy the Unity Catalog model (agent) to a Model Serving endpoint.

Traces from your agent now appear in the MLflow experiment in real-time.

Example notebook

The following notebook provides an example of deploying a simple gen AI app using Agent Framework to log traces to an MLflow experiment, using the steps outlined above.

Agent Framework and MLflow Tracing notebook

Open notebook in new tab

Deploy with custom CPU serving (alternative)

If you cannot deploy your agent using Agent Framework, then this section explains how to deploy your agent on Databricks using custom CPU Model Serving. Otherwise, skip to the next section.

First, set up the storage location(s) for traces:

  1. If you plan to use Production Monitoring to store traces in Delta tables, then ensure it is enabled for your workspace.
  2. Create an MLflow Experiment for storing your app's production traces.

Next, in your Python notebook, instrument your agent with MLflow Tracing, and use the Model Serving UI or APIs to deploy your agent:

  1. Log your agent as an MLflow model. In your agent code, enable MLflow Tracing using automatic or manual instrumentation. In your agent’s code ensure that Tracing is enabled using automatic or manual instrumentation
  2. Deploy the model to CPU serving.
  3. Provision a Service Principal or Personal Access Token (PAT) with CAN_EDIT access to the MLflow experiment.
  4. In the CPU serving endpoint page, go to "Edit endpoint." For each deployed model to trace, add the following environment variables:
  5. ENABLE_MLFLOW_TRACING=true
  6. MLFLOW_EXPERIMENT_ID=<ID of the experiment you created>
  7. If you provisioned a Service Principal, set DATABRICKS_CLIENT_ID and DATABRICKS_CLIENT_SECRET. If you provisioned a PAT, set DATABRICKS_HOST and DATABRICKS_TOKEN.

View production traces

After your agent is deployed, you can view its traces in the MLflow Experiments UI, just like traces from development. These production traces provide valuable insights into:

  • Real user queries and agent responses - See exactly what users are asking and how your agent responds
  • Quality insights from user feedback - View thumbs up/down ratings, comments, and other feedback attached to production traces
  • Error rates and failure patterns - Identify when and why your agent fails
  • Behavioral patterns - Understand how users interact with your agent and identify improvement opportunities
  • Latency and performance metrics - Monitor response times and system performance in production
  • Resource usage and costs - Track token consumption and associated costs

Production Traces UI

Log traces to Delta tables

Once your agent is deployed, you can optionally log traces to Delta tables, in addition to your MLflow experiment. This logging is supported in two ways:

  • Production monitoring tables (recommended): Enable by going to the Monitoring tab in the MLflow experiment and selecting a Unity Catalog schema. The job to sync traces to a Delta table runs every ~15 mins. You do not need to enable any monitoring metrics for this to work. Traces do not have size limits.
  • AI Gateway-enabled inference tables: Enable by editing the AI Gateway settings on the Model Serving endpoint page. Be aware of the limitations on trace sizes and delays in syncing traces to tables.

Add metadata to traces

After basic tracing works, add metadata or context for better debugging and insights. MLflow provides standardized tags and attributes to capture important contextual information, including:

  • Request tracking - Link traces to specific API calls for end-to-end debugging
  • User sessions - Group related interactions to understand user journeys
  • Environment data - Track which deployment, version, or region generated each trace
  • User feedback - Collect quality ratings and link them to specific interactions

Get started at Add metadata and user feedback to traces.

Track token usage and cost

In both development and production, MLflow Tracing can track token usage of LLM calls, which you can use to compute costs. Tracing uses token counts returned by LLM provider APIs.

MLflow Tracing natively supports token usage tracking for Databricks Foundation Model APIs called via the OpenAI client, as well as many other LLM providers such as OpenAI, LangChain, and LangGraph. Afterwards, token usage can be queried programmatically as in the example below.

Python
# Get aggregated token usage (if available)
token_usage = trace.info.token_usage
if token_usage:
print(f"Input tokens: {token_usage.get('input_tokens')}")
print(f"Output tokens: {token_usage.get('output_tokens')}")
print(f"Total tokens: {token_usage.get('total_tokens')}")

See Token usage information for more details.

MLflow Tracing allows you to instrument a specific agent or application. For monitoring usage across your AI platform, AI Gateway provides governance of shared serving endpoints. See AI Gateway usage tracking for logging platform-level token usage to system tables.

You can use these token counts to compute costs, based on the LLM provider's pricing schedule. Remember that many providers charge different rates for input and output tokens.

Limitations

Logging traces to MLflow experiments and to production monitoring tables comes with limits on the number of traces and peak load. If you need to store more than 100K traces per experiment or have a peak load of > 60 queries per second (QPS), use this form to request an increase.

Next steps

Feature references

For details on concepts and features in this guide, see: