Human feedback
Human feedback is essential for building high-quality GenAI applications that meet user expectations. MLflow provides tools and a data model to collect, manage, and utilize feedback from developers, end-users, and domain experts.
Data model overview
MLflow stores human feedback as Assessments, attached to individual MLflow Traces. This links feedback directly to a specific user query and your GenAI app's outputs and logic.
There are 2 assessment types:
- Feedback: Evaluates your app's actual outputs or intermediate steps. For example, it answers questions like, "Was the agent's response good?". Feedback assesses what the app produced, such as ratings or comments. Feedback assesses what was generated by the app and provides qualitative insights.
- Expectation: Defines the desired or correct outcome (ground truth) that your app should have produced. For example, this could be "The ideal response" to a user's query. For a given input, the Expectation is always the same. Expectations define what the app should generate and are useful for creating evaluation datasets,
Assessments can be attached to the entire Trace or a specific span within the Trace.
For more detail about the data model, see Tracing Data Model.
How to collect feedback
MLflow helps you collect feedback from three main sources. Each source is tailored for a different use case in your GenAI app's lifecycle. While feedback comes from differnt personas, the underlying data model is the same for all personas.
Developer feedback
During development, you can directly annotate traces. This is useful to track quality notes as you build and mark specific examples for future reference or regression testing.
To learn how to annotate feedback during development, see Label during development.
Domain expert feedback
Engage subject matter experts to provide structured feedback on your app's outputs and define expectations for correct responses. Their detailed evaluations help define what high-quality responses look like for your specific use case and are invaluable for aligning LLM judges with nuanced business requirements.

MLflow provides two approaches for collecting domain expert feedback using the Review App:
Interactive testing with Chat UI: Experts interact with your deployed app in real-time through a chat interface, providing immediate feedback on responses as they test conversational flows. This approach is ideal for "vibe checks" and qualitative validation before production deployment. To learn more, see Test an app version with the Chat UI.
Labeling existing traces: Experts systematically review and label traces that have already been captured from your app. This approach is ideal for structured evaluation sessions where experts assess specific examples and define ground truth expectations. To learn more, see Label existing traces.
End-user feedback
In production, capture feedback from users interacting with your live application. This provides crucial insights into real-world performance, helping you identify problematic queries that need fixing and highlight successful interactions to preserve during future updates. MLflow provides tools to capture, store, and analyze feedback directly from the users of your deployed applications.
To learn how to collect end user feedback, refer to the collect end user feedback guide in the tracing section.
Next steps
- Get started with collecting human feedback - Step through this holistic tutorial demonstrating common ways to collect human feedback.
- Label during development - Start annotating traces to track quality during development.
- Test an app version with the Chat UI - Test your app interactively using a live chat interface.
- Label existing traces - Set up systematic expert review processes.