Genie Code for agent observability and evaluation
Databricks Assistant provides a natural language interface for understanding, debugging, and improving your GenAI applications within MLflow. It has read access to everything in your experiment, from traces, prompts, and datasets to evaluation runs, scorers, and labeling sessions — so you can explore your observability and evaluation data conversationally instead of writing queries or navigating multiple UI pages.
To get started, click the Databricks Assistant icon in the top-right of your workspace while viewing an experiment.

Capabilities
Databricks Assistant can help you with a wide range of observability and evaluation tasks, including:
- Trace analysis and debugging: Investigate failing traces, find errors, examine span trees, pinpoint root causes, analyze latency, and identify bottlenecks in your agent's execution flow. Deep-dive into any trace to inspect its full span hierarchy, including inputs, outputs, metadata, and token usage at every step.
- Metrics and performance: Compute latency percentiles (P50/P95/P99), track error rates and throughput over time, analyze token usage patterns and costs, and compare performance across different time periods or filters.
- Quality and evaluations: Review assessment scores from human feedback, LLM judges, and programmatic checks. Inspect evaluation datasets, check registered scorers and their configurations, and get help setting up
mlflow.genai.evaluate()with the right scorers. - Labeling and review: View labeling sessions and who's assigned to review traces, and inspect labeling schemas to understand feedback criteria such as ratings, comments, and expectations.
- Prompt registry: Browse prompts in Unity Catalog, view templates, versions, and aliases.
- Instrumentation guidance: Get help adding tracing to your code with
autolog(),@mlflow.trace, or manual spans, with runnable code snippets you can paste directly into Databricks notebooks.
Example questions
Here are some things you can ask Databricks Assistant:
- "Help me discover issues with my agent's tool calling in the traces for this experiment over the last 3 hours"
- "Identify cases where users get frustrated in the conversations with my agent"
- "Which sessions have the lowest user feedback scores, and what went wrong in those conversations?"
- "What are the most common failure patterns in my traces this past week, and what scorers should I add to catch them?"
- "Which spans consume the most tokens across all my traces?"
- "Find traces where the retriever returned no results but the agent still tried to answer"
- "Help me set up evaluation for my RAG agent with the right scorers"
Next steps
- MLflow Tracing - GenAI observability — Learn about MLflow Tracing for end-to-end observability.
- Evaluate and monitor AI agents — Set up evaluation and monitoring for your GenAI agents.
- Get started: MLflow 3 for GenAI — Get started with MLflow 3 for GenAI.