Mosaic AI Agent Evaluation tutorial notebook (MLflow 2)
important
Databricks recommends using MLflow 3. In MLflow 3, Agent Evaluation APIs are included in the mlflow
package. For MLflow 3 information on this topic, see Evaluate & Monitor.
This page describes Agent Evaluation using MLflow 2.
The following notebook demonstrates how to evaluate a gen AI app using Agent Evaluation's proprietary LLM judges, custom metrics, and labels from domain experts. It demonstrates the following:
- How to load production logs (traces) into an evaluation dataset.
- How to run an evaluation and do root cause analysis.
- How to create custom metrics to automatically detect quality issues.
- How to send production logs for SMEs to label and evolve the evaluation dataset.
To get your agent ready for pre-production, see the Mosaic AI agent demo notebook. For general information, see Mosaic AI Agent Evaluation (MLflow 2).