Step 6. Make & evaluate quality fixes on the AI agent

This article walks you through the steps to iterate through and evaluate quality fixes in your generative AI agent based on root cause analysis.

POC workflow diagram, iteration step

For more information about evaluating an AI agent, see What is Mosaic AI Agent Evaluation?.

Requirements

  1. Based on your root cause analysis, you have identified a potential fixes to either retrieval or generation to implement and evaluate.

  2. Your POC application (or another baseline chain) is logged to an MLflow run with an Agent Evaluation evaluation stored in the same run.

See the GitHub repository for the sample code in this section.

Expected outcome in Agent Evaluation

Animated GIF showing output of an agent evaluation run in Databricks MLflow.

The preceding image shows the Agent Evaluation output in MLflow.

How to fix, evaluate, and iterate on the AI agent

For all types, use the B_quality_iteration/02_evaluate_fixes notebook to evaluate the resulting chain versus your baseline configuration, your POC, and pick a “winner”. This notebook helps you pick the winning experiment and deploy it to the review app or a production-ready, scalable REST API.

  1. In Databricks, open the B_quality_iteration/02_evaluate_fixes notebook.

  2. Based on the type of fix you are implementing:

    • For data pipeline fixes:

    • For chain configuration fixes:

      • Follow the instructions in the Chain configuration section of the 02_evaluate_fixes notebook to add chain configuration fixes to the CHAIN_CONFIG_FIXES variable.

    • For chain code fixes:

      • Create a modified chain code file and save it to the B_quality_iteration/chain_code_fixes folder. Alternatively, select one of the provided chain code fixes from that folder.

      • Follow the instructions in the Chain code section of the 02_evaluate_fixes notebook to add the chain code file and any additional chain configuration that is required to the CHAIN_CODE_FIXES variable.

  3. The following happens when you run the notebook from the Run evaluation cell:

    • Evaluate each fix.

    • Determine the fix with the best quality/cost/latency metrics.

    • Deploy the best one to the Review App and a production-ready REST API to get stakeholder feedback.