Agent Evaluation input schema (MLflow 2)

important

Databricks recommends using MLflow 3 for evaluating and monitoring GenAI apps. This page describes MLflow 2 Agent Evaluation.

For an introduction to evaluation and monitoring on MLflow 3, see Evaluate and monitor AI agents.
For information about migrating to MLflow 3, see Migrate to MLflow 3 from Agent Evaluation.
For MLflow 3 information on this topic, see Building MLflow evaluation datasets.

This article explains the input schema required by Agent Evaluation to assess your application's quality, cost, and latency.

During development, evaluation takes place offline, and an evaluation set is a required input to Agent Evaluation.
When an application is in production, all inputs to Agent Evaluation come from your inference tables or production logs.

The input schema is identical for both online and offline evaluations.

For general information about evaluation sets, see Evaluation sets (MLflow 2).

Evaluation input schema

The following table shows Agent Evaluation's input schema. The last two columns of the table refer to how input is provided to the mlflow.evaluate() call. See Provide inputs to an evaluation run for details.

Column	Data type	Description	Application passed as input argument	Previously generated outputs provided
request_id	string	Unique identifier of request.	Optional	Optional
request	See Schema for request.	Input to the application to evaluate, user's question or query. For example, `{'messages': [{"role": "user", "content": "What is RAG"}]}` or “What is RAG?”. When `request` is provided as a string, it will be transformed to `messages` before it is passed to your agent.	Required	Required
response	See Schema for response.	Response generated by the application being evaluated.	Generated by Agent Evaluation	Optional. If not provided then derived from the Trace. Either `response` or `trace` is required.
expected_facts	array of string	A list of facts that are expected in the model output. See `expected_facts` guidelines.	Optional	Optional
expected_response	string	Ground-truth (correct) answer for the input request. See `expected_response` guidelines.	Optional	Optional
guidelines	`guidelines` guidelines	A named dict or list of guidelines that the model's output is expected to adhere to. See `guidelines` guidelines.	Optional	Optional
expected_retrieved_context	array	Array of objects containing the expected retrieved context for the request (if the application includes a retrieval step). Array schema	Optional	Optional
retrieved_context	array	Retrieval results generated by the retriever in the application being evaluated. If multiple retrieval steps are in the application, this is the retrieval results from the last step (chronologically in the trace). Array schema	Generated by Agent Evaluation	Optional. If not provided then derived from the provided trace.
trace	JSON string of MLflow Trace	MLflow Trace of the application's execution on the corresponding request.	Generated by Agent Evaluation	Optional. Either `response` or `trace` is required.

`expected_facts` guidelines

The expected_facts field specifies the list of facts that is expected to appear in any correct model response for the specific input request. That is, a model response is deemed correct if it contains these facts, regardless of how the response is phrased.

Including only the required facts, and leaving out facts that are not strictly required in the answer, enables Agent Evaluation to provide a more robust signal on output quality.

You can specify at most one of expected_facts and expected_response. If you specify both, an error will be reported. Databricks recommends using expected_facts, as it is a more specific guideline that helps Agent Evaluation judge more effectively the quality of generated responses.

`guidelines` guidelines

The guidelines field specifies a set guidelines that any correct model response must adhere to. guidelines can be expressed in two formats:

List of guidelines (List[str]) provides a single set of guidelines.
Named guidelines (Dict[str, List[str]]) provides a mapping a name of guideline to an array of guidelines for that name. Named guidelines requires databricks-agents >= 0.16.0.

Guidelines can refer to various traits of the response, including stylistic or content-related elements. For the most robust signal on guideline adherence, Databricks recommends using the following language:

“The response must …”
“The response must not …”
“The response may optionally …”

Specifically, you should refer to the request and response directly and leave as little ambiguity as possible in the guidelines. For guidelines that apply to your entire evaluation set, such as ensuring the responses have a professional tone or are always in English, use the global_guidelines parameter in the evaluator configuration as follows:

Python
eval_set = [
    {
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
        # Note: You can also just pass an array to `guidelines`.
        "guidelines": {
            "english": ["The response must be in English"],
            "clarity": ["The response must be clear, coherent, and concise"],
        }
    }
]

mlflow.evaluate(
    data=pd.DataFrame(eval_set),
    model_type="databricks-agent",
    evaluator_config={
        "databricks-agent": {
            # Note: You can also just pass an array to `guidelines`.
            "global_guidelines": {
                "english": ["The response must be in English"],
                "clarity": ["The response must be clear, coherent, and concise"],
            }
        }
    }
)

`expected_response` guidelines

The expected_response field contains a fully formed response that represents a reference for correct model responses. That is, a model response is deemed correct if it matches the information content in expected_response. In contrast, expected_facts lists only the facts that are required to appear in a correct response and is not a fully formed reference response.

Similar to expected_facts, expected_response should contain only the minimal set of facts that is required for a correct response. Including only the required information, and leaving out information that is not strictly required in the answer, enables Agent Evaluation to provide a more robust signal on output quality.

Schema for request

The request schema can be one of the following:

An arbitrary serializable dictionary (for example, Dict[str, Any])
If the agent supports the OpenAI chat completion schema, you can pass a plain string. This format supports single-turn conversations only. Plain strings are converted to the messages format with "role": "user" before being passed to your agent. For example, a plain string "What is MLflow?" is converted to {"messages": [{"role": "user", "content": "What is MLflow?"}]} before being passed to your agent.

Note that the built-in judges work best with any format using a OpenAI chat completion schema. The OpenAI chat completion schema must have an array of objects as a messages parameter. The messages field can encode the full conversation.

The following example shows a few possible options in the same request column of the evaluation dataset:

Python
import pandas as pd

data = {
  "request": [

      # Plain string. Plain strings are transformed to the `messages` format before being passed to your agent.
      "What is the difference between reduceByKey and groupByKey in Spark?",

      # OpenAI chat completion schema. Use the `messages` field for a single- or multi-turn chat.
      {
          "messages": [
              {
                  "role": "user",
                  "content": "How can you minimize data shuffling in Spark?"
              }
          ]
      },

      # SplitChatMessagesRequest. Use the `query` and `history` fields for a single- or multi-turn chat.
      {
          "query": "Explain broadcast variables in Spark. How do they enhance performance?",
          "history": [
              {
                  "role": "user",
                  "content": "What are broadcast variables?"
              },
              {
                  "role": "assistant",
                  "content": "Broadcast variables allow the programmer to keep a read-only variable cached on each machine."
              }
          ]
      },

      # Arbitrary format. These must be JSON-serializable and are passed directly to your agent.
      {
        "message_history": [
            {
                "user_0": "What are broadcast variables?",
                "assistant_0": "Broadcast variables allow the programmer to keep a read-only variable cached on each machine.",
            }
        ],
        "last_user_request": "How can you minimize data shuffling in Spark?"
      },
  ],

  "expected_response": [
    "expected response for first question",
    "expected response for second question",
    "expected response for third question",
    "expected response for fourth question",
  ]
}

eval_dataset = pd.DataFrame(data)

Schema for response

The response schema, similar to the request schema, can be one of the following:

An arbitrary serializable dictionary (for example, Dict[str, Any]).
If the agent supports the OpenAI chat completion schema, you can pass a plain string. This format supports single-turn conversations only. Plain strings are converted to the choices format. For example, a plain string "MLFlow is a framework." is converted to {"choices": [{"message": {"content": "MLFlow is a framework."}}]}.

Schema for arrays in evaluation input

The schema of the arrays expected_retrieved_context and retrieved_context is shown in the following table:

Column	Data type	Description	Application passed as input argument	Previously generated outputs provided
content	string	Contents of the retrieved context. String in any format, such as HTML, plain text, or Markdown.	Optional	Optional
doc_uri	string	Unique identifier (URI) of the parent document where the chunk came from.	Required	Required

Computed metrics

The columns in the following table indicate the data included in the input, and ✓ indicates that the metric is supported when that data is provided.

For details about what these metrics measure, see How quality, cost, and latency are assessed by Agent Evaluation (MLflow 2).

Calculated metrics	`request`	`request` and `expected_response`	`request`, `expected_response`, `expected_retrieved_context`, and `guidelines`	`request` and `expected_retrieved_context`	`request` and `guidelines`
`response/llm_judged/relevance_to_query/rating`	✓	✓	✓
`response/llm_judged/safety/rating`	✓	✓	✓
`response/llm_judged/groundedness/rating`	✓	✓	✓
`retrieval/llm_judged/chunk_relevance_precision`	✓	✓	✓
`agent/total_token_count`	✓	✓	✓
`agent/input_token_count`	✓	✓	✓
`agent/output_token_count`	✓	✓	✓
`response/llm_judged/correctness/rating`		✓	✓
`retrieval/llm_judged/context_sufficiency/rating`		✓	✓
`retrieval/ground_truth/document_recall`			✓	✓
`response/llm_judged/guideline_adherence/rating`			✓		✓

Evaluation input schema​

expected_facts guidelines​

guidelines guidelines​

expected_response guidelines​

Schema for request​

Schema for response​

Schema for arrays in evaluation input​

Computed metrics​