Evaluation and monitoring
MLflow's evaluation and monitoring capabilities help you systematically measure, improve, and maintain the quality of your GenAI applications throughout their lifecycle from development through production.
Generative AI applications are complex and involve many different components. Evaluating the performance of these applications is not as straightforward as evaluating the performance of traditional ML models. Both qualitative and quantitative metrics that are used to evaluate quality are inherently more complex.
The evaluation and monitoring component of MLflow 3 is designed to help you identify quality issues and the root cause of those issues. It is built on MLflow Tracing, which provides real-time trace logging in the development, testing, and production phases. It also includes built-in LLM-based scorers and an integrated review app for collecting human feedback. As shown in the diagram, the same LLM-based scorers are used in development and production, ensuring consistent evaluation throughout the application lifecycle.
The diagram shows the high-level iterative workflow.
During development, you test the app against an evaluation dataset. You can also use the Review App to deploy a version for your domain experts to test and add to the evaluation dataset based on their interactions with the app. You can use MLflow pre-built scorers or custom scorers to evaluate the app's performance on the dataset.
After you deploy the app to production, the same scorers are used to monitor its performance. You can save MLflow traces from production queries and add them to the evaluation dataset for future iterative app development.
Feature | Description |
---|---|
Example notebook takes you through creating and tracing a simple GenAI application, defining evaluation criteria, running the evaluation, reviewing the results, and modifying the prompt and re-evaluating. | |
Step through the complete evaluation workflow. Learn how to use evaluation datasets to evaluate quality, identify issues, and iteratively improve your app. Create evaluation datasets from real usage. Use the evaluation harness to evaluate quality using pre-built and custom scorers. View results to help identify root causes of quality issues. Compare versions to determine if your changes improved quality and did not cause regressions. | |
Automatically run scorers on your production GenAI application traces to continuously monitor quality. You can schedule any scorer to automatically evaluate a sample of your production traffic. | |
Built-in LLM-based scorers are the easiest way to get started. | |
As your application becomes more complex, you can create custom LLM-based scorers to tune evaluation criteria for the specific business requirements of your use case and to align with the judgment of your domain experts. | |
Custom scorers provide flexibility to define evaluation metrics tailored to your specific business use case. | |
Build evaluation datasets to systematically test and improve your GenAI application's quality. Add traces from testing or production queries. |
Agent Evaluation is integrated with managed MLflow 3. The Agent Evaluation SDK methods are now available using the mlflow[databricks]>=3.1
SDK. See Migrate to MLflow 3 from Agent Evaluation to update your MLflow 2 Agent Evaluation code to use MLflow 3.