Skip to main content

Evaluation harness

The mlflow.genai.evaluate() function systematically tests GenAI app quality by running it against test data (evaluation datasets) and applying scorers.

Quick reference

For details, see mlflow.genai.evaluate().

Parameter

Type

Description

data

MLflow EvaluationDataset, List[Dict], Pandas DataFrame, Spark DataFrame

Test data

predict_fn

Callable

Your app (direct evaluation only)

scorers

List[Scorer]

Quality metrics

model_id

str

Optional version tracking

How it works

  1. Runs your app on test inputs, capturing traces.
  2. Applies scorers to assess quality, creating Feedback.
  3. Stores results in an Evaluation Run.

Prerequisites

  1. Install MLflow and required packages.

    Bash
    pip install --upgrade "mlflow[databricks]>=3.1.0" openai "databricks-connect>=16.1"
  2. Create an MLflow experiment by following the setup your environment quickstart.

Evaluation modes

There are two evaluation modes:

MLflow calls your GenAI app directly to generate and evaluate traces. You can either pass your application's entry point wrapped in a Python function (predict_fn) or, if your app is deployed as a Databricks Model Serving endpoint, pass that endpoint wrapped in to_predict_fn.

By calling your app directly, this mode enables you to reuse the scorers defined for offline evaluation in production monitoring since the resulting traces will be identical.

How evaluate works with tracing

The following code shows an example of how to run the evaluation:

Python
import mlflow
from mlflow.genai.scorers import RelevanceToQuery, Safety

# Your GenAI app with MLflow tracing
@mlflow.trace
def my_chatbot_app(question: str) -> dict:
# Your app logic here
if "MLflow" in question:
response = "MLflow is an open-source platform for managing ML and GenAI workflows."
else:
response = "I can help you with MLflow questions."

return {"response": response}

# Evaluate your app
results = mlflow.genai.evaluate(
data=[
{"inputs": {"question": "What is MLflow?"}},
{"inputs": {"question": "How do I get started?"}}
],
predict_fn=my_chatbot_app,
scorers=[RelevanceToQuery(), Safety()]
)

You can then view the results in the UI:

Evaluation results

Answer sheet evaluation

When you can't run your GenAI app directly, you can provide existing traces or pre-computed outputs for evaluation. Example use cases include testing outputs from external systems, evaluating historical traces, and comparing outputs across different platforms.

important

If you use an answer sheet with different traces than your production environment, you may need to re-write your scorer functions to use them for production monitoring.

How evaluate works with answer sheet

Example using inputs and outputs

The following code shows an example of how to run the evaluation:

Python
import mlflow
from mlflow.genai.scorers import Safety, RelevanceToQuery

# Pre-computed results from your GenAI app
results_data = [
{
"inputs": {"question": "What is MLflow?"},
"outputs": {"response": "MLflow is an open-source platform for managing machine learning workflows, including tracking experiments, packaging code, and deploying models."},
},
{
"inputs": {"question": "How do I get started?"},
"outputs": {"response": "To get started with MLflow, install it using 'pip install mlflow' and then run 'mlflow ui' to launch the web interface."},
}
]

# Evaluate pre-computed outputs
evaluation = mlflow.genai.evaluate(
data=results_data,
scorers=[Safety(), RelevanceToQuery()]
)

You can then view the results in the UI:

Evaluation results

Example using existing traces

The following code shows how to run the evaluation using existing traces:

Python
import mlflow

# Retrieve traces from production
traces = mlflow.search_traces(
filter_string="trace.status = 'OK'",
)

# Evaluate problematic traces
evaluation = mlflow.genai.evaluate(
data=traces,
scorers=[Safety(), RelevanceToQuery()]
)

Key parameters

Python
def mlflow.genai.evaluate(
data: Union[pd.DataFrame, List[Dict], mlflow.genai.datasets.EvaluationDataset],
scorers: list[mlflow.genai.scorers.Scorer],
predict_fn: Optional[Callable[..., Any]] = None,
model_id: Optional[str] = None,
) -> mlflow.models.evaluation.base.EvaluationResult:

data

The evaluation dataset must be in one of the following formats:

  • EvaluationDataset (recommended).
  • List of dictionaries, Pandas DataFrame, or Spark DataFrame.

If the data argument is provided as a DataFrame or list of dictionaries, it must follow the following schema. This is consistent with the schema used by EvaluationDataset. Databricks recommends using an EvaluationDataset as it enforces schema validation, in addition to tracking the lineage of each record.

Field

Data type

Description

Use with direct evaluation

Use with answer sheet

inputs

dict[Any, Any]

A dict that is passed to your predict_fn using **kwargs. Must be JSON serializable. Each key must correspond to a named argument in predict_fn.

Required

Either inputs + outputs or trace is required. Cannot pass both.

Derived from trace if not provided.

outputs

dict[Any, Any]

A dict with the outputs of your GenAI app for the corresponding input. Must be JSON serializable.

Must not be provided, generated by MLflow from the Trace.

Either inputs + outputs or trace is required. Cannot pass both.

Derived from trace if not provided.

expectations

dict[str, Any]

A dict with ground-truth labels corresponding to input. Used by scorers to check quality. Must be JSON serializable and each key must be a str.

Optional

Optional

trace

mlflow.entities.Trace

The trace object for the request. If the trace is provided, the expectations can be provided as Assessments on the trace rather than as a separate column.

Must not be provided, generated by MLflow from the Trace.

Either inputs + outputs or trace is required. Cannot pass both.

predict_fn

The GenAI app's entry point. This parameter is only used with direct evaluation. predict_fn must meet the following requirements:

  • Accept the keys from the inputs dictionary in data as keyword arguments.
  • Return a JSON-serializable dictionary.
  • Be instrumented with MLflow Tracing.
  • Emit exactly one trace per call.

scorers

List of quality metrics to apply. You can provide:

See Scorers for more details.

model_id

Optional model identifier to link results to your app version (for example, "models:/my-app/1"). See Version Tracking for more details.

Data formats

For direct evaluation

Field

Required

Description

inputs

Dictionary passed to your predict_fn

expectations

Optional

Optional ground truth for scorers

For answer sheet evaluation

Option A - Provide inputs and outputs:

Field

Required

Description

inputs

Original inputs to your GenAI app

outputs

Pre-computed outputs from your app

expectations

Optional

Optional ground truth for scorers

Option B - Provide existing traces:

Field

Required

Description

trace

MLflow Trace objects with inputs/outputs

expectations

Optional

Optional ground truth for scorers

Common data input patterns

MLflow Evaluation Datasets provide versioning, lineage tracking, and integration with Unity Catalog for production-ready evaluation. They are useful when you need version control and lineage tracking for your evaluation data, and when you need to convert traces to evaluation records.

Python
import mlflow
from mlflow.genai.scorers import Correctness, Safety
from my_app import agent # Your GenAI app with tracing

# Load versioned evaluation dataset
dataset = mlflow.genai.datasets.get_dataset("catalog.schema.eval_dataset_name")

# Run evaluation
results = mlflow.genai.evaluate(
data=dataset,
predict_fn=agent,
scorers=[Correctness(), Safety()],
)

To create datasets from traces or scratch, see Build evaluation datasets.

Evaluate using a list of dictionaries

Use a simple list of dictionaries for quick prototyping without creating a formal evaluation dataset. This is useful for quick prototypoing, small datasets (fewer than 100 examples), and informal development testing.

Python
import mlflow
from mlflow.genai.scorers import Correctness, RelevanceToQuery
from my_app import agent # Your GenAI app with tracing

# Define test data as a list of dictionaries
eval_data = [
{
"inputs": {"question": "What is MLflow?"},
"expectations": {"expected_facts": ["open-source platform", "ML lifecycle management"]}
},
{
"inputs": {"question": "How do I track experiments?"},
"expectations": {"expected_facts": ["mlflow.start_run()", "log metrics", "log parameters"]}
},
{
"inputs": {"question": "What are MLflow's main components?"},
"expectations": {"expected_facts": ["Tracking", "Projects", "Models", "Registry"]}
}
]

# Run evaluation
results = mlflow.genai.evaluate(
data=eval_data,
predict_fn=agent,
scorers=[Correctness(), RelevanceToQuery()],
)

For production, convert to an MLflow Evaluation Dataset.

Evaluate using a Pandas DataFrame

Use Pandas DataFrames for evaluation when working with CSV files or existing data science workflows. This is useful for quick prototypoing, small datasets (fewer than 100 examples), and informal development testing.

Python
import mlflow
import pandas as pd
from mlflow.genai.scorers import Correctness, Safety
from my_app import agent # Your GenAI app with tracing

# Create evaluation data as a Pandas DataFrame
eval_df = pd.DataFrame([
{
"inputs": {"question": "What is MLflow?"},
"expectations": {"expected_response": "MLflow is an open-source platform for ML lifecycle management"}
},
{
"inputs": {"question": "How do I log metrics?"},
"expectations": {"expected_response": "Use mlflow.log_metric() to log metrics"}
}
])

# Run evaluation
results = mlflow.genai.evaluate(
data=eval_df,
predict_fn=agent,
scorers=[Correctness(), Safety()],
)

Evaluate using a Spark DataFrame

Use Spark DataFrames for large-scale evaluations or when data is already in Delta Lake or Unity Catalog. This is useful when the data already exists in Delta Lake or Unity Catalog, or if you need to filter the records in an MLflow Evaluation Dataset before running the evaluation.

The DataFrame must comply with the evaluation dataset schema.

Python
import mlflow
from mlflow.genai.scorers import Safety, RelevanceToQuery
from my_app import agent # Your GenAI app with tracing

# Load evaluation data from a Delta table in Unity Catalog
eval_df = spark.table("catalog.schema.evaluation_data")

# Or load from any Spark-compatible source
# eval_df = spark.read.parquet("path/to/evaluation/data")

# Run evaluation
results = mlflow.genai.evaluate(
data=eval_df,
predict_fn=agent,
scorers=[Safety(), RelevanceToQuery()],
)

Common predict_fn patterns

Call your app directly

Pass your app directly as predict_fn when parameter names match your evaluation dataset keys. This is useful for apps that have parameter names that match the inputs in your evaluation dataset.

Python
import mlflow
from mlflow.genai.scorers import RelevanceToQuery, Safety

# Your GenAI app that accepts 'question' as a parameter
@mlflow.trace
def my_chatbot_app(question: str) -> dict:
# Your app logic here
response = f"I can help you with: {question}"
return {"response": response}

# Evaluation data with 'question' key matching the function parameter
eval_data = [
{"inputs": {"question": "What is MLflow?"}},
{"inputs": {"question": "How do I track experiments?"}}
]

# Pass your app directly since parameter names match
results = mlflow.genai.evaluate(
data=eval_data,
predict_fn=my_chatbot_app, # Direct reference, no wrapper needed
scorers=[RelevanceToQuery(), Safety()]
)

Wrap your app in a callable

If your app expects different parameter names or data structures than your evaluation dataset's inputs, wrap it in a callable function. This is useful when there are parameter name mismatches between your app's parameters and evaluation dataset input keys (for example, user_input vs question), or when data format conversions are required (for example, string to list or JSON parsing).

Python
import mlflow
from mlflow.genai.scorers import RelevanceToQuery, Safety

# Your existing GenAI app with different parameter names
@mlflow.trace
def customer_support_bot(user_message: str, chat_history: list = None) -> dict:
# Your app logic here
context = f"History: {chat_history}" if chat_history else "New conversation"
return {
"bot_response": f"Helping with: {user_message}. {context}",
"confidence": 0.95
}

# Wrapper function to translate evaluation data to your app's interface
def evaluate_support_bot(question: str, history: str = None) -> dict:
# Convert evaluation dataset format to your app's expected format
chat_history = history.split("|") if history else []

# Call your app with the translated parameters
result = customer_support_bot(
user_message=question,
chat_history=chat_history
)

# Translate output to standard format if needed
return {
"response": result["bot_response"],
"confidence_score": result["confidence"]
}

# Evaluation data with different key names
eval_data = [
{"inputs": {"question": "Reset password", "history": "logged in|forgot email"}},
{"inputs": {"question": "Track my order"}}
]

# Use the wrapper function for evaluation
results = mlflow.genai.evaluate(
data=eval_data,
predict_fn=evaluate_support_bot, # Wrapper handles translation
scorers=[RelevanceToQuery(), Safety()]
)

Evaluate a deployed endpoint

Use the to_predict_fn function to evaluate Mosaic AI Agent Framework, Model Serving chat endpoints, and custom endpoints.

This function creates a predict function that's compatible with those endpoints and automatically extracts traces from tracing-enabled endpoints for full observability.

note

The to_predict_fn function performs a kwargs pass-through directly to your endpoint. Your evaluation data must match the input format that your endpoint expects. If the formats don't match, the evaluation fails with an error message about unrecognized input keys.

Model Serving chat endpoints require data that is formatted with the messages key.

Python
import mlflow
from mlflow.genai.scorers import RelevanceToQuery

# Create predict function for a chat endpoint
predict_fn = mlflow.genai.to_predict_fn("endpoints:/my-chatbot-endpoint")

# Evaluate the chat endpoint
results = mlflow.genai.evaluate(
data=[{"inputs": {"messages": [{"role": "user", "content": "How does MLflow work?"}]}}],
predict_fn=predict_fn,
scorers=[RelevanceToQuery()]
)

Evaluate a logged model

Wrap logged MLflow models to translate between evaluation's named parameters and the model's single-parameter interface.

Most logged models (such as those using PyFunc or logging flavors like LangChain) accept a single input parameter (for example, model_inputs for PyFunc), while predict_fn expects named parameters that correspond to the keys in your evaluation dataset.

Python
import mlflow
from mlflow.genai.scorers import Safety

# Make sure to load your logged model outside of the predict_fn so MLflow only loads it once!
model = mlflow.pyfunc.load_model("models:/chatbot/staging")

def evaluate_model(question: str) -> dict:
return model.predict({"question": question})

results = mlflow.genai.evaluate(
data=[{"inputs": {"question": "Tell me about MLflow"}}],
predict_fn=evaluate_model,
scorers=[Safety()]
)

Next Steps