Mosaic AI Agent Evaluation tutorial notebook (MLflow 2)
important
Databricks recommends using MLflow 3 for evaluating and monitoring GenAI apps. This page describes MLflow 2 Agent Evaluation.
- For an introduction to evaluation and monitoring on MLflow 3, see Evaluation and monitoring.
- For information about migrating to MLflow 3, see Migrate to MLflow 3 from Agent Evaluation.
- For MLflow 3 information on this topic, see Evaluation and monitoring.
The following notebook demonstrates how to evaluate a gen AI app using Agent Evaluation's proprietary LLM judges, custom metrics, and labels from domain experts. It demonstrates the following:
- How to load production logs (traces) into an evaluation dataset.
- How to run an evaluation and do root cause analysis.
- How to create custom metrics to automatically detect quality issues.
- How to send production logs for SMEs to label and evolve the evaluation dataset.
To get your agent ready for pre-production, see the Mosaic AI agent demo notebook. For general information, see Mosaic AI Agent Evaluation (MLflow 2).