Skip to main content

Evaluate & Monitor

MLflow's evaluation and monitoring capabilities help you systematically measure, improve, and maintain the quality of your GenAI applications throughout their lifecycle. From development through production, use the same quality scorers to ensure your applications deliver accurate, reliable responses while managing cost and latency.

This page gives an overview of core evaluation and monitoring workflows and concepts, and provides links for more information.

note

Agent Evaluation is integrated with Managed MLflow 3. The Agent Evaluation SDK methods are now exposed through the mlflow[databricks]>=3.1 SDK. See the migration guide to update your legacy Agent Evaluation code to MLflow 3 SDKs.

Evaluation during development

Test and improve your GenAI app iteratively by running evaluations against curated evaluation datasets using pre-built and custom scorers. MLflow's evaluation harness helps you test new versions of your app and prompts in order to:

  • Determine if your changes improved quality
  • Identify root causes of quality issues
  • Compare different versions of your app side-by-side
  • Verify that changes did not cause regressions

Monitoring in production

Beta

Monitoring is in Beta.

Continuously track your deployed app's performance and quality. MLflow's monitoring capabilities enable you to:

  • Automatically assess quality using the same scorers as development
  • Track operational metrics (latency, cost, errors)
  • Identify underperforming queries to create evaluation datasets

Getting started

Start with the Evaluation Quickstart to evaluate your first GenAI app in minutes.

Next steps

Continue your journey with these recommended actions and tutorials.

Reference guides

Explore detailed documentation for concepts and features mentioned in this guide.

  • Scorers - Understand how scorers assess GenAI applications
  • LLM judges - Learn about using LLMs as evaluators
  • Evaluation Harness - Explore how MLflow orchestrates evaluations