What is Mosaic AI Agent Evaluation?

Preview

This feature is in Public Preview.

This article gives an overview of how to work with Mosaic AI Agent Evaluation. Agent Evaluation helps developers evaluate the quality, cost, and latency of agentic AI applications, including RAG applications and chains. Agent Evaluation is designed to both identify quality issues and determine the root cause of those issues. The capabilities of Agent Evaluation are unified across the development, staging, and production phases of the MLOps life cycle, and all evaluation metrics and data are logged to MLflow Runs.

Agent Evaluation integrates advanced, research-backed evaluation techniques into a user-friendly SDK and UI that is integrated with your lakehouse, MLflow, and the other Databricks Data Intelligence Platform components. Developed in collaboration with Mosaic AI research, this proprietary technology offers a comprehensive approach to analyzing and enhancing agent performance.

LLMOps diagram showing evaluation

Agentic AI applications are complex and involve many different components. Evaluating the performance of these applications is not as straightforward as evaluating the performance of traditional ML models. Both qualitative and quantitative metrics that are used to evaluate quality are inherently more complex. Agent Evaluation includes proprietary LLM judges and agent metrics to evaluate retrieval and request quality as well as overall performance metrics like latency and token cost.

How do I use Agent Evaluation?

The following code shows how to call and test Agent Evaluation on previously generated outputs. It returns a dataframe with evaluation scores calculated by LLM judges that are part of Agent Evaluation.

You can copy and paste the following into your existing Databricks notebook:

%pip install mlflow databricks-agents
dbutils.library.restartPython()

import mlflow
import pandas as pd

examples =  {
    "request": [
      {
      # Recommended `messages` format
        "messages": [{
          "role": "user",
          "content": "Spark is a data analytics framework."
        }],
      },
      # Primitive string format
      # Note: Using a primitive string is discouraged. The string will be wrapped in the
      # OpenAI messages format before being passed to your agent.
      "How do I convert a Spark DataFrame to Pandas?"
    ],
    "response": [
        "Spark is a data analytics framework.",
        "This is not possible as Spark is not a panda.",
    ],
    "retrieved_context": [ # Optional, needed for judging groundedness.
        [{"doc_uri": "doc1.txt", "content": "In 2013, Spark, a data analytics framework, was open sourced by UC Berkeley's AMPLab."}],
        [{"doc_uri": "doc2.txt", "content": "To convert a Spark DataFrame to Pandas, you can use toPandas()"}],
    ],
    "expected_response": [ # Optional, needed for judging correctness.
        "Spark is a data analytics framework.",
        "To convert a Spark DataFrame to Pandas, you can use the toPandas() method.",
    ]
}

result = mlflow.evaluate(
    data=pd.DataFrame(examples),    # Your evaluation set
    # model=logged_model.model_uri, # If you have an MLFlow model. `retrieved_context` and `response` will be obtained from calling the model.
    model_type="databricks-agent",  # Enable Mosaic AI Agent Evaluation
)

# Review the evaluation results in the MLFLow UI (see console output), or access them in place:
display(result.tables['eval_results'])

Alternatively, you can import and run the following notebook in your Databricks workspace:

Mosaic AI Agent Evaluation example notebook

Open notebook in new tab

Agent Evaluation inputs and outputs

The following diagram shows an overview of the inputs accepted by Agent Evaluation and the corresponding outputs produced by Agent Evaluation.

agent_eval_data_flows

Inputs

For details of the expected input for Agent Evaluation, including field names and data types, see the input schema. Some of the fields are the following:

  • User’s query (request): Input to the agent (user’s question or query). For example, “What is RAG?”.

  • Agent’s response (response): Response generated by the agent. For example, “Retrieval augmented generation is …”.

  • Expected response (expected_response): (Optional) A ground truth (correct) response.

  • MLflow trace (trace): (Optional) The agent’s MLflow trace, from which Agent Evaluation extracts intermediate outputs such as the retrieved context or tool calls. Alternatively, you can provide these intermediate outputs directly.

Outputs

Based on these inputs, Agent Evaluation produces two types of outputs:

  1. Evaluation Results (per row): For each row provided as input, Agent Evaluation produces a corresponding output row that contains a detailed assessment of your agent’s quality, cost, and latency.

    • LLM judges check different aspects of quality, such as correctness or groundedness, outputting a yes/no score and written rationale for that score. For details, see How quality, cost, and latency are assessed by Agent Evaluation.

    • The LLM judges’ assessments are combined to produce an overall score that indicates whether that row “passes” (is high quality) or “fails” (has a quality issue).

      • For any failing rows, a root cause is identified. Each root cause corresponds to a specific LLM judge’s assessment, allowing you to use the judge’s rationale to identify potential fixes.

    • Cost and latency are extracted from the MLflow trace. For details, see How cost and latency are assessed.

  2. Metrics (aggregate scores): Aggregated scores that summarize the quality, cost, and latency of your agent across all input rows. These include metrics such as the percentage of correct answers, average token count, average latency, and more. For details, see How cost and latency are assessed and How metrics are aggregated at the level of an MLflow run for quality, cost, and latency.

Development (offline evaluation) and production (online monitoring)

Agent Evaluation is designed to be consistent between your development (offline) and production (online) environments. This design enables a smooth transition from development to production, allowing you to quickly iterate, evaluate, deploy, and monitor high-quality agentic applications.

The main difference between development and production is that in production, you do not have ground-truth labels, while in development, you may optionally use ground-truth labels. Using ground-truth labels allows Agent Evaluation to compute additional quality metrics.

Development (offline)

agent_eval_overview_dev

In development, your requests and expected_responses come from an evaluation set. An evaluation set is a collection of representative inputs that your agent should be able to handle accurately. For more information about evaluation sets, see Evaluation sets.

To get response and trace, Agent Evaluation can call your agent’s code to generate these outputs for each row in the evaluation set. Alternatively, you can generate these outputs yourself and pass them to Agent Evaluation. See How to provide input to an evaluation run for more information.

Production (online)

agent_eval_overview_prod

In production, all inputs to Agent Evaluation come from your production logs.

If you use Mosaic AI Agent Framework to deploy your AI application, Agent Evaluation can be configured to automatically collect these inputs from the Agent-enhanced inference tables and continually update a monitoring dashboard. For more details, see How to monitor the quality of your agent on production traffic.

If you deploy your agent outside of Databricks, you can ETL your logs to the required input schema and similarly configure a monitoring dashboard.

Establish a quality benchmark with an evaluation set

To measure the quality of an AI application in development (offline), you need to define an evaluation set, that is, a set of representative questions and optional ground-truth answers. If the application involves a retrieval step, like in RAG workflows, then you can optionally provide supporting documents that you expect the response to be based on.

For details about evaluation sets, including metric dependencies and best practices, see Evaluation sets. For the required schema, see Agent Evaluation input schema.

Evaluation runs

For details about how to run an evaluation, see How to run an evaluation and view the results. Agent Evaluation supports two options for providing output from the chain:

  • You can run the application as part of the evaluation run. The application generates results for each input in the evaluation set.

  • You can provide output from a previous run of the application.

For details and explanation of when to use each option, see How to provide input to an evaluation run.

Get human feedback about the quality of a GenAI application

The Databricks review app makes it easy to gather feedback about the quality of an AI application from human reviewers. For details, see Get feedback about the quality of an agentic application.

Geo availability of Assistant features

Mosaic AI Agent Evaluation is a Designated Service that uses Geos to manage data residency when processing customer content. To learn more about the availability of Agent Evaluation in different geographic areas, see Databricks Designated Services.

Pricing

For pricing information, see Mosaic AI Agent Evaluation pricing.

Limitation

Agent Evaluation is not available in HIPAA-enabled workspaces.

Information about the models powering LLM judges

  • LLM judges might use third-party services to evaluate your GenAI applications, including Azure OpenAI operated by Microsoft.

  • For Azure OpenAI, Databricks has opted out of Abuse Monitoring so no prompts or responses are stored with Azure OpenAI.

  • For European Union (EU) workspaces, LLM judges use models hosted in the EU. All other regions use models hosted in the US.

  • Disabling Partner-powered AI assistive features prevents the LLM judge from calling Partner-powered models.

  • Data sent to the LLM judge is not used for any model training.

  • LLM judges are intended to help customers evaluate their RAG applications, and LLM judge outputs should not be used to train, improve, or fine-tune an LLM.