MLflow for GenAI apps and agents

Traditional software and ML tests aren't built for GenAI's free-form language, making it difficult for teams to measure and improve quality.

MLflow solves this by combining AI-powered metrics that reliably measure GenAI quality with comprehensive trace observability, enabling you to measure, improve, and monitor quality throughout your entire application lifecycle.

How MLflow helps measure and improve quality of GenAI apps and agents

MLflow helps you orchestrate a continuous improvement cycle that incorporates both user feedback and domain expert judgment. From development through production, you use consistent quality metrics (scorers) that are tuned to align with human expertise, ensuring your automated evaluation reflects real-world quality standards.

The continuous improvement cycle

mlflow overview

🚀 Production App: Your deployed GenAI app serves users and generates traces with detailed execution traces with all the steps, inputs, outputs for every interaction
👍 👎 User Feedback: End users provide feedback (thumbs up/down, ratings) that gets attached to each trace, helping identify quality issues
🔍 Monitor & Score: Production monitoring automatically runs LLM-judge based scorers on traces to assess quality, attaching the feedback to each trace
⚠️ Identify Issues: You use the Trace UI to find patterns in low-scoring traces through the end user and LLM judge feedback
👥 Domain Expert Review: Optionally, you send a sample of traces to domain experts via the Review App for detailed labeling and quality assessment
📋 Build Eval Dataset: You curate both problematic traces and high-quality traces into evaluation datasets so you can fix the bad while preserving the good
🎯 Tune Scorers: Optionally, you use expert feedback to align your scorers and judges with human judgment, ensuring automated scorers represent human judgement

🧪 Evaluate New Versions: You use the evaluation harness to test improved app versions against your evaluation datasets, applying the same scorers from monitoring to evaluate if quality improved or regressed. Optionally, you use version and prompt management to track your work.
📈 Compare Results: You use evaluation runs, generated by the evaluation harness, to you compare across versions to identify the top performing versions
✅ Deploy or Iterate: If quality improves without regression, deploy; otherwise, iterate and re-evaluate

Why this approach works

Human-aligned metrics: Scorers are tuned to match domain expert judgment, ensuring automated evaluation reflects human quality standards
Consistent metrics: The same scorers work in both development and production
Real-world data: Production traces become test cases, ensuring you fix actual user issues
Systematic validation: Every change is tested against regression datasets before deployment
Continuous learning: Each cycle improves both your app and your evaluation datasets

Next steps

Follow a quickstart guide to set up tracing, run your first evaluation, collect feedback from domain experts, or enable production monitoring.
Gain a conceptual understanding of the key abstractions that power MLflow - from Traces to Evaluation Datasets to Scorers.

How MLflow helps measure and improve quality of GenAI apps and agents​

The continuous improvement cycle​

Why this approach works​

Next steps​

How MLflow helps measure and improve quality of GenAI apps and agents

The continuous improvement cycle

Why this approach works

Next steps