Evaluate and Monitor AI agents

MLflow provides comprehensive agent evaluation and LLM evaluation capabilities to help you measure, improve, and maintain the quality of your AI applications. MLflow supports the entire development lifecycle from testing through production monitoring for LLMs, agents, RAG systems, or other GenAI applications.

Evaluating AI agents and LLMs is more complex than traditional ML model evaluation. These applications involve multiple components, multi-turn conversations, and nuanced quality criteria. Both qualitative and quantitative metrics require specialized evaluation approaches to accurately assess performance.

The evaluation and monitoring component of MLflow 3 is designed to help you identify quality issues and the root cause of those issues. It is built on MLflow Tracing, which provides real-time trace logging in the development, testing, and production phases. It also includes built-in LLM judges and an integrated review app for collecting human feedback. As shown in the diagram, the same LLM judges are used in development and production, ensuring consistent evaluation throughout the application lifecycle.

The diagram shows the high-level iterative workflow.

Overview diagram of MLflow 3 evaluation and monitoring

During development, you test the app against an evaluation dataset. You can also use the Review App to deploy a version for your domain experts to test and add to the evaluation dataset based on their interactions with the app. You can use MLflow pre-built scorers or custom scorers to evaluate the app's performance on the dataset.

After you deploy the app to production, the same scorers are used to monitor its performance. You can save MLflow traces from production queries and add them to the evaluation dataset for future iterative app development.

Feature	Description
10-minute demo: Evaluate a GenAI app	Run a quick demo notebook that takes you through creating and tracing a simple GenAI application, defining evaluation criteria, running the evaluation, reviewing the results, and modifying the prompt and re-evaluating.
Tutorial: Evaluate and improve a GenAI application	Step through a tutorial of the complete evaluation workflow. Learn how to use evaluation datasets to evaluate quality, identify issues, and iteratively improve your app. Create evaluation datasets from real usage. Use the evaluation harness to evaluate quality using built-in and custom scorers. View results to help identify root causes of quality issues. Compare versions to determine if your changes improved quality and did not cause regressions.
Monitor apps in production (Beta)	Automatically run scorers on your production GenAI application traces to continuously monitor quality. You can schedule any scorer to automatically evaluate a sample of your production traffic.
Built-in LLM judges	Built-in LLM judges are the easiest way to get started.
Create custom LLM judges	As your application becomes more complex, you can create custom LLM judges to tune evaluation criteria for the specific business requirements of your use case and to align with the judgment of your domain experts.
Code your own custom scorers	Custom scorers provide flexibility to define evaluation metrics tailored to your specific business use case.
Build MLflow Evaluation Datasets	Build evaluation datasets to systematically test and improve your GenAI application's quality. Add traces from testing or production queries.

note

Agent Evaluation is integrated with managed MLflow 3. The Agent Evaluation SDK methods are now available using the mlflow[databricks]>=3.1 SDK. See Migrate to MLflow 3 from Agent Evaluation to update your MLflow 2 Agent Evaluation code to use MLflow 3.