Introduction to Mosaic AI Agent Evaluation


This feature is in Public Preview.

This article describes Mosaic AI Agent Evaluation. Agent Evaluation enables developers to quickly and reliably evaluate the quality, latency, and cost of agentic generative AI applications, including the simpler forms of RAG applications and chains. The capabilities of Agent Evaluation are unified across the development, staging, and production phases of the MLOps life cycle, and all evaluation metrics and data are logged to MLflow Runs.

Agentic applications are complex and involve many different components. Evaluating the performance of these applications is not as straightforward as evaluating the performance of traditional ML models. Both qualitative and quantitative metrics that are used to evaluate quality are inherently more complex. This article gives an overview of how to work with Agent Evaluation and includes links to articles with more detail.

Agent Evaluation example

The following code shows how to call and test Agent Evaluation on previously generated outputs. It returns a dataframe with evaluation scores calculated by LLM judges that are part of Agent Evaluation.

You can copy and paste the following into your existing Databricks notebook:

%pip install mlflow databricks-agents

import mlflow
import pandas as pd

examples =  {
    "request": [
        "What is Spark?",
        "How do I convert a Spark DataFrame to Pandas?",
    "response": [
        "Spark is a data analytics framework.",
        "This is not possible as Spark is not a panda.",
    "retrieved_context": [ # Optional, needed for judging groundedness.
        [{"doc_uri": "doc1.txt", "content": "In 2013, Spark, a data analytics framework, was open sourced by UC Berkeley's AMPLab."}],
        [{"doc_uri": "doc2.txt", "content": "To convert a Spark DataFrame to Pandas, you can use toPandas()"}],
    "expected_response": [ # Optional, needed for judging correctness.
        "Spark is a data analytics framework.",
        "To convert a Spark DataFrame to Pandas, you can use the toPandas() method.",

result = mlflow.evaluate(
    data=pd.DataFrame(examples),    # Your evaluation set
    # model=logged_model.model_uri, # If you have an MLFlow model. `retrieved_context` and `response` will be obtained from calling the model.
    model_type="databricks-agent",  # Enable Mosaic AI Agent Evaluation

# Review the evaluation results in the MLFLow UI (see console output), or access them in place:

Alternatively, you can import and run the following notebook in your Databricks workspace:

Mosaic AI Agent Evaluation example notebook

Open notebook in new tab

Establish ground truth with an evaluation set

To measure the quality of an agentic application, you need to define what a high-quality, accurate response looks like. To do that, you create an evaluation set, which is a set of representative questions and ground-truth answers. If the application involves a retrieval step, like in RAG workflows, then you can optionally provide supporting documents that you expect the response to be based on.

For details about evaluation sets, including the schema, metric dependencies, and best practices, see Evaluation sets.

Assess performance with the right metrics

Evaluating an AI application requires several sets of metrics, including:

  • Response metrics, which measure whether the response is accurate, consistent with the retrieved context (if any), and relevant to the input request.

  • Retrieval metrics, which measure whether the retrieval steps (if any) returned chunks that are relevant to the input request.

  • Performance metrics, which measure the number of tokens across all LLM generation calls and the latency in seconds for the trace.

For details about metrics and LLM judges, see Use agent metrics & LLM judges to evaluate app performance.

Evaluation runs

For details about how to run an evaluation, see How to run an evaluation and view the results. Agent Evaluation supports two options for providing output from the chain:

  • You can run the application as part of the evaluation run. The application generates results for each input in the evaluation set.

  • You can provide output from a previous run of the application.

For details and explanation of when to use each option, see How to provide input to an evaluation run.

Get human feedback about the quality of a GenAI application

The Databricks review app makes it easy to gather feedback about the quality of an agentic application from human reviewers. For details, see Get feedback about the quality of an agentic application.

Information about the models powering LLM judges

  • LLM judges might use third-party services to evaluate your GenAI applications, including Azure OpenAI operated by Microsoft.

  • For Azure OpenAI, Databricks has opted out of Abuse Monitoring so no prompts or responses are stored with Azure OpenAI.

  • For European Union (EU) workspaces, LLM judges use models hosted in the EU. All other regions use models hosted in the US.

  • Disabling Partner-Powered AI assistive features will prevent the LLM judge from calling Partner-Powered models.

  • Data sent to the LLM judge is not used for any model training.

  • LLM judges are intended to help customers evaluate their RAG applications, and LLM judge outputs should not be used to train, improve, or fine-tune an LLM.