Skip to main content

Evaluation harness

The evaluation harness serves as a testing framework for GenAI applications. Instead of manually running your app and checking outputs one by one, the harness provides a structured way to feed in test data, run your app, and automatically score the results. This makes it easier to compare versions, track improvements, and share results across teams.

With MLflow, the evaluation harness connects offline testing with production monitoring. That means the same evaluation logic you use in development can also run in production, giving you a consistent view of quality across the entire AI lifecycle.

The mlflow.genai.evaluate() function systematically tests GenAI app quality by running it against test data (evaluation datasets and applying scorers).

If you're new to evaluation, start with Tutorial: Evaluate and improve a GenAI application.

When to use

  • Nightly or weekly checks of your app against curated evaluation datasets
  • Validating prompt or model changes across app versions
  • Before a release or PR to prevent quality regressions

Process workflow

The evaluation harness does the following:

  1. Runs your app on test inputs, capturing traces.
  2. Applies scorers to assess quality, creating feedback.
  3. Stores results in an Evaluation Run.

Quick reference

For API details, see Parameters for mlflow.genai.evaluate() or the MLflow documentation.

The mlflow.genai.evaluate() function runs your GenAI app against an evaluation dataset using specified scorers and optionally a prediction function or model ID, returning an EvaluationResult.

Python
def mlflow.genai.evaluate(
data: Union[pd.DataFrame, List[Dict], mlflow.genai.datasets.EvaluationDataset], # Test data.
scorers: list[mlflow.genai.scorers.Scorer], # Quality metrics, built-in or custom.
predict_fn: Optional[Callable[..., Any]] = None, # App wrapper. Used for direct evaluation only.
model_id: Optional[str] = None, # Optional version tracking.
) -> mlflow.models.evaluation.base.EvaluationResult:

Prerequisites

  1. Install MLflow and required packages.

    Bash
    pip install --upgrade "mlflow[databricks]>=3.1.0" openai "databricks-connect>=16.1"
  2. Create an MLflow experiment by following the setup your environment quickstart.

(Optional) Configure parallelization

MLflow by default uses background threadpool to speed up the evaluation process. To configure the number of workers, set the environment variable MLFLOW_GENAI_EVAL_MAX_WORKERS.

Bash
export MLFLOW_GENAI_EVAL_MAX_WORKERS=10

Evaluation modes

There are two evaluation modes:

MLflow calls your GenAI app directly to generate and evaluate traces. You can either pass your application's entry point wrapped in a Python function (predict_fn) or, if your app is deployed as a Databricks Model Serving endpoint, pass that endpoint wrapped in to_predict_fn.

By calling your app directly, this mode enables you to reuse the scorers defined for offline evaluation in production monitoring since the resulting traces will be identical.

As shown in the diagram, data, your app, and selected scorers are provided as inputs to mlflow.genai.evaluate(), which runs the app and scorers in parallel and records output as traces and feedback.

How evaluate works with tracing

Data formats for direct evaluation

For schema details, see Evaluation dataset reference.

Field

Data type

Required

Description

inputs

dict[Any, Any]

Yes

Dictionary passed to your predict_fn

expectations

dict[str, Any]

No

Optional ground truth for scorers

Example using direct evaluation

The following code shows an example of how to run the evaluation:

Python
import mlflow
from mlflow.genai.scorers import RelevanceToQuery, Safety

# Your GenAI app with MLflow tracing
@mlflow.trace
def my_chatbot_app(question: str) -> dict:
# Your app logic here
if "MLflow" in question:
response = "MLflow is an open-source platform for managing ML and GenAI workflows."
else:
response = "I can help you with MLflow questions."

return {"response": response}

# Evaluate your app
results = mlflow.genai.evaluate(
data=[
{"inputs": {"question": "What is MLflow?"}},
{"inputs": {"question": "How do I get started?"}}
],
predict_fn=my_chatbot_app,
scorers=[RelevanceToQuery(), Safety()]
)

You can then view the results in the UI:

Evaluation results

Answer sheet evaluation

Use this mode when you can't - or don't want to - run your GenAI app directly during evaluation. For example, you already have outputs (for example, from external systems, historical traces, or batch jobs) and you just want to score them. You provide the inputs and the output, and the harness runs scorers and logs an evaluation run.

important

If you use an answer sheet with different traces than your production environment, you may need to re-write your scorer functions to use them for production monitoring.

As shown in the diagram, you provide evaluation data and selected scorers as inputs to mlflow.genai.evaluate(). Evaluation data can consist of existing traces, or of inputs and pre-computed outputs. If inputs and pre-computed outputs are provided, mlflow.genai.evaluate() constructs traces from the inputs and outputs. For both input options, mlflow.genai.evaluate() runs the scorers on the traces and outputs feedback from the scorers.

How evaluate works with answer sheet

Data formats for answer sheet evaluation

For schema details, see Evaluation dataset reference.

If inputs and outputs are provided

Field

Data type

Required

Description

inputs

dict[Any, Any]

Yes

Original inputs to your GenAI app

outputs

dict[Any, Any]

Yes

Pre-computed outputs from your app

expectations

dict[str, Any]

No

Optional ground truth for scorers

If existing traces are provided

Field

Data type

Required

Description

trace

mlflow.entities.Trace

Yes

MLflow Trace objects with inputs/outputs

expectations

dict[str, Any]

No

Optional ground truth for scorers

Example using inputs and outputs

The following code shows an example of how to run the evaluation:

Python
import mlflow
from mlflow.genai.scorers import Safety, RelevanceToQuery

# Pre-computed results from your GenAI app
results_data = [
{
"inputs": {"question": "What is MLflow?"},
"outputs": {"response": "MLflow is an open-source platform for managing machine learning workflows, including tracking experiments, packaging code, and deploying models."},
},
{
"inputs": {"question": "How do I get started?"},
"outputs": {"response": "To get started with MLflow, install it using 'pip install mlflow' and then run 'mlflow ui' to launch the web interface."},
}
]

# Evaluate pre-computed outputs
evaluation = mlflow.genai.evaluate(
data=results_data,
scorers=[Safety(), RelevanceToQuery()]
)

You can then view the results in the UI:

Evaluation results

Example using existing traces

The following code shows how to run the evaluation using existing traces:

Python
import mlflow

# Retrieve traces from production
traces = mlflow.search_traces(
filter_string="trace.status = 'OK'",
)

# Evaluate problematic traces
evaluation = mlflow.genai.evaluate(
data=traces,
scorers=[Safety(), RelevanceToQuery()]
)

Parameters for mlflow.genai.evaluate()

This section describes each of the parameters used by mlflow.genai.evaluate().

Python
def mlflow.genai.evaluate(
data: Union[pd.DataFrame, List[Dict], mlflow.genai.datasets.EvaluationDataset], # Test data.
scorers: list[mlflow.genai.scorers.Scorer], # Quality metrics, built-in or custom.
predict_fn: Optional[Callable[..., Any]] = None, # App wrapper. Used for direct evaluation only.
model_id: Optional[str] = None, # Optional version tracking.
) -> mlflow.models.evaluation.base.EvaluationResult:

data

The evaluation dataset must be in one of the following formats:

  • EvaluationDataset (recommended).
  • List of dictionaries, Pandas DataFrame, or Spark DataFrame.

If the data argument is provided as a DataFrame or list of dictionaries, it must follow the following schema. This is consistent with the schema used by EvaluationDataset. Databricks recommends using an EvaluationDataset as it enforces schema validation, in addition to tracking the lineage of each record.

Field

Data type

Description

Use with direct evaluation

Use with answer sheet

inputs

dict[Any, Any]

A dict that is passed to your predict_fn using **kwargs. Must be JSON serializable. Each key must correspond to a named argument in predict_fn.

Required

Either inputs + outputs or trace is required. Cannot pass both.

Derived from trace if not provided.

outputs

dict[Any, Any]

A dict with the outputs of your GenAI app for the corresponding input. Must be JSON serializable.

Must not be provided, generated by MLflow from the Trace.

Either inputs + outputs or trace is required. Cannot pass both.

Derived from trace if not provided.

expectations

dict[str, Any]

A dict with ground-truth labels corresponding to input. Used by scorers to check quality. Must be JSON serializable and each key must be a str.

Optional

Optional

trace

mlflow.entities.Trace

The trace object for the request. If the trace is provided, the expectations can be provided as Assessments on the trace rather than as a separate column.

Must not be provided, generated by MLflow from the Trace.

Either inputs + outputs or trace is required. Cannot pass both.

scorers

List of quality metrics to apply. You can provide:

See Scorers for more details.

predict_fn

The GenAI app's entry point. This parameter is only used with direct evaluation. predict_fn must meet the following requirements:

  • Accept the keys from the inputs dictionary in data as keyword arguments.
  • Return a JSON-serializable dictionary.
  • Be instrumented with MLflow Tracing.
  • Emit exactly one trace per call.

model_id

Optional model identifier to link results to your app version (for example, "models:/my-app/1"). See Version Tracking for more details.

Next steps