MLflow for GenAI apps and agents
Traditional software and ML tests aren't built for GenAI's free-form language, making it difficult for teams to measure and improve quality.
MLflow solves this by combining AI-powered metrics that reliably measure GenAI quality with comprehensive trace observability, enabling you to measure, improve, and monitor quality throughout your entire application lifecycle.
How MLflow helps measure and improve quality of GenAI apps and agents
MLflow helps you orchestrate a continuous improvement cycle that incorporates both user feedback and domain expert judgment. From development through production, you use consistent quality metrics (scorers) that are tuned to align with human expertise, ensuring your automated evaluation reflects real-world quality standards.
The continuous improvement cycle
-
🚀 Production App: Your deployed GenAI app serves users and generates traces with detailed execution traces with all the steps, inputs, outputs for every interaction
-
👍 👎 User Feedback: End users provide feedback (thumbs up/down, ratings) that gets attached to each trace, helping identify quality issues
-
🔍 Monitor & Score: Production monitoring automatically runs LLM-judge based scorers on traces to assess quality, attaching the feedback to each trace
-
⚠️ Identify Issues: You use the Trace UI to find patterns in low-scoring traces through the end user and LLM judge feedback
-
👥 Domain Expert Review: Optionally, you send a sample of traces to domain experts via the Review App for detailed labeling and quality assessment
-
📋 Build Eval Dataset: You curate both problematic traces and high-quality traces into evaluation datasets so you can fix the bad while preserving the good
-
🎯 Tune Scorers: Optionally, you use expert feedback to align your scorers and judges with human judgment, ensuring automated scorers represent human judgement
-
🧪 Evaluate New Versions: You use the evaluation harness to test improved app versions against your evaluation datasets, applying the same scorers from monitoring to evaluate if quality improved or regressed. Optionally, you use version and prompt management to track your work.
-
📈 Compare Results: You use evaluation runs, generated by the evaluation harness, to you compare across versions to identify the top performing versions
-
✅ Deploy or Iterate: If quality improves without regression, deploy; otherwise, iterate and re-evaluate
Why this approach works
- Human-aligned metrics: Scorers are tuned to match domain expert judgment, ensuring automated evaluation reflects human quality standards
- Consistent metrics: The same scorers work in both development and production
- Real-world data: Production traces become test cases, ensuring you fix actual user issues
- Systematic validation: Every change is tested against regression datasets before deployment
- Continuous learning: Each cycle improves both your app and your evaluation datasets
Next steps
- Follow a quickstart guide to set up tracing, run your first evaluation, collect feedback from domain experts, or enable production monitoring.
- Gain a conceptual understanding of the key abstractions that power MLflow - from Traces to Evaluation Datasets to Scorers.