Mosaic AI Agent Evaluation tutorial notebook (MLflow 2)
MLflow 2
This page describes usage of Agent Evaluation version 0.22
with MLflow 2. Databricks recommends using MLflow 3, which is integrated with Agent Evaluation >1.0
. In MLflow 3, Agent Evaluation APIs are now part of the mlflow
package.
For information on this topic, see Evaluate & Monitor.
The following notebook demonstrates how to evaluate a gen AI app using Agent Evaluation's proprietary LLM judges, custom metrics, and labels from domain experts. It demonstrates the following:
- How to load production logs (traces) into an evaluation dataset.
- How to run an evaluation and do root cause analysis.
- How to create custom metrics to automatically detect quality issues.
- How to send production logs for SMEs to label and evolve the evaluation dataset.
To get your agent ready for pre-production, see the Mosaic AI agent demo notebook. For general information, see Mosaic AI Agent Evaluation (MLflow 2).