Evaluation harness
The mlflow.genai.evaluate()
function systematically tests GenAI app quality by running it against test data (evaluation datasets) and applying scorers.
Quick reference
For details, see mlflow.genai.evaluate()
.
Parameter | Type | Description |
---|---|---|
| MLflow EvaluationDataset, List[Dict], Pandas DataFrame, Spark DataFrame | Test data |
| Callable | Your app (direct evaluation only) |
| List[Scorer] | Quality metrics |
| str | Optional version tracking |
How it works
- Runs your app on test inputs, capturing traces.
- Applies scorers to assess quality, creating Feedback.
- Stores results in an Evaluation Run.
Prerequisites
-
Install MLflow and required packages.
Bashpip install --upgrade "mlflow[databricks]>=3.1.0" openai "databricks-connect>=16.1"
-
Create an MLflow experiment by following the setup your environment quickstart.
Evaluation modes
There are two evaluation modes:
- (Recommended) Direct evaluation. MLflow calls your app directly to generate traces for evaluation.
- Answer sheet evaluation. You provide pre-computed outputs or existing traces for evaluation.
Direct evaluation (recommended)
MLflow calls your GenAI app directly to generate and evaluate traces. You can either pass your application's entry point wrapped in a Python function (predict_fn
) or, if your app is deployed as a Databricks Model Serving endpoint, pass that endpoint wrapped in to_predict_fn
.
By calling your app directly, this mode enables you to reuse the scorers defined for offline evaluation in production monitoring since the resulting traces will be identical.
The following code shows an example of how to run the evaluation:
import mlflow
from mlflow.genai.scorers import RelevanceToQuery, Safety
# Your GenAI app with MLflow tracing
@mlflow.trace
def my_chatbot_app(question: str) -> dict:
# Your app logic here
if "MLflow" in question:
response = "MLflow is an open-source platform for managing ML and GenAI workflows."
else:
response = "I can help you with MLflow questions."
return {"response": response}
# Evaluate your app
results = mlflow.genai.evaluate(
data=[
{"inputs": {"question": "What is MLflow?"}},
{"inputs": {"question": "How do I get started?"}}
],
predict_fn=my_chatbot_app,
scorers=[RelevanceToQuery(), Safety()]
)
You can then view the results in the UI:
Answer sheet evaluation
When you can't run your GenAI app directly, you can provide existing traces or pre-computed outputs for evaluation. Example use cases include testing outputs from external systems, evaluating historical traces, and comparing outputs across different platforms.
If you use an answer sheet with different traces than your production environment, you may need to re-write your scorer functions to use them for production monitoring.
Example using inputs and outputs
The following code shows an example of how to run the evaluation:
import mlflow
from mlflow.genai.scorers import Safety, RelevanceToQuery
# Pre-computed results from your GenAI app
results_data = [
{
"inputs": {"question": "What is MLflow?"},
"outputs": {"response": "MLflow is an open-source platform for managing machine learning workflows, including tracking experiments, packaging code, and deploying models."},
},
{
"inputs": {"question": "How do I get started?"},
"outputs": {"response": "To get started with MLflow, install it using 'pip install mlflow' and then run 'mlflow ui' to launch the web interface."},
}
]
# Evaluate pre-computed outputs
evaluation = mlflow.genai.evaluate(
data=results_data,
scorers=[Safety(), RelevanceToQuery()]
)
You can then view the results in the UI:
Example using existing traces
The following code shows how to run the evaluation using existing traces:
import mlflow
# Retrieve traces from production
traces = mlflow.search_traces(
filter_string="trace.status = 'OK'",
)
# Evaluate problematic traces
evaluation = mlflow.genai.evaluate(
data=traces,
scorers=[Safety(), RelevanceToQuery()]
)
Key parameters
def mlflow.genai.evaluate(
data: Union[pd.DataFrame, List[Dict], mlflow.genai.datasets.EvaluationDataset],
scorers: list[mlflow.genai.scorers.Scorer],
predict_fn: Optional[Callable[..., Any]] = None,
model_id: Optional[str] = None,
) -> mlflow.models.evaluation.base.EvaluationResult:
data
The evaluation dataset must be in one of the following formats:
EvaluationDataset
(recommended).- List of dictionaries, Pandas DataFrame, or Spark DataFrame.
If the data argument is provided as a DataFrame or list of dictionaries, it must follow the following schema. This is consistent with the schema used by EvaluationDataset. Databricks recommends using an EvaluationDataset
as it enforces schema validation, in addition to tracking the lineage of each record.
Field | Data type | Description | Use with direct evaluation | Use with answer sheet |
---|---|---|---|---|
|
| A | Required | Either Derived from |
|
| A | Must not be provided, generated by MLflow from the Trace. | Either Derived from |
|
| A | Optional | Optional |
|
| The trace object for the request. If the | Must not be provided, generated by MLflow from the Trace. | Either |
predict_fn
The GenAI app's entry point. This parameter is only used with direct evaluation. predict_fn
must meet the following requirements:
- Accept the keys from the
inputs
dictionary indata
as keyword arguments. - Return a JSON-serializable dictionary.
- Be instrumented with MLflow Tracing.
- Emit exactly one trace per call.
scorers
List of quality metrics to apply. You can provide:
See Scorers for more details.
model_id
Optional model identifier to link results to your app version (for example, "models:/my-app/1"
). See Version Tracking for more details.
Data formats
For direct evaluation
Field | Required | Description |
---|---|---|
| ✅ | Dictionary passed to your |
| Optional | Optional ground truth for scorers |
For answer sheet evaluation
Option A - Provide inputs and outputs:
Field | Required | Description |
---|---|---|
| ✅ | Original inputs to your GenAI app |
| ✅ | Pre-computed outputs from your app |
| Optional | Optional ground truth for scorers |
Option B - Provide existing traces:
Field | Required | Description |
---|---|---|
| ✅ | MLflow Trace objects with inputs/outputs |
| Optional | Optional ground truth for scorers |
Common data input patterns
Evaluate using an MLflow Evaluation Dataset (recommended)
MLflow Evaluation Datasets provide versioning, lineage tracking, and integration with Unity Catalog for production-ready evaluation. They are useful when you need version control and lineage tracking for your evaluation data, and when you need to convert traces to evaluation records.
import mlflow
from mlflow.genai.scorers import Correctness, Safety
from my_app import agent # Your GenAI app with tracing
# Load versioned evaluation dataset
dataset = mlflow.genai.datasets.get_dataset("catalog.schema.eval_dataset_name")
# Run evaluation
results = mlflow.genai.evaluate(
data=dataset,
predict_fn=agent,
scorers=[Correctness(), Safety()],
)
To create datasets from traces or scratch, see Build evaluation datasets.
Evaluate using a list of dictionaries
Use a simple list of dictionaries for quick prototyping without creating a formal evaluation dataset. This is useful for quick prototypoing, small datasets (fewer than 100 examples), and informal development testing.
import mlflow
from mlflow.genai.scorers import Correctness, RelevanceToQuery
from my_app import agent # Your GenAI app with tracing
# Define test data as a list of dictionaries
eval_data = [
{
"inputs": {"question": "What is MLflow?"},
"expectations": {"expected_facts": ["open-source platform", "ML lifecycle management"]}
},
{
"inputs": {"question": "How do I track experiments?"},
"expectations": {"expected_facts": ["mlflow.start_run()", "log metrics", "log parameters"]}
},
{
"inputs": {"question": "What are MLflow's main components?"},
"expectations": {"expected_facts": ["Tracking", "Projects", "Models", "Registry"]}
}
]
# Run evaluation
results = mlflow.genai.evaluate(
data=eval_data,
predict_fn=agent,
scorers=[Correctness(), RelevanceToQuery()],
)
For production, convert to an MLflow Evaluation Dataset.
Evaluate using a Pandas DataFrame
Use Pandas DataFrames for evaluation when working with CSV files or existing data science workflows. This is useful for quick prototypoing, small datasets (fewer than 100 examples), and informal development testing.
import mlflow
import pandas as pd
from mlflow.genai.scorers import Correctness, Safety
from my_app import agent # Your GenAI app with tracing
# Create evaluation data as a Pandas DataFrame
eval_df = pd.DataFrame([
{
"inputs": {"question": "What is MLflow?"},
"expectations": {"expected_response": "MLflow is an open-source platform for ML lifecycle management"}
},
{
"inputs": {"question": "How do I log metrics?"},
"expectations": {"expected_response": "Use mlflow.log_metric() to log metrics"}
}
])
# Run evaluation
results = mlflow.genai.evaluate(
data=eval_df,
predict_fn=agent,
scorers=[Correctness(), Safety()],
)
Evaluate using a Spark DataFrame
Use Spark DataFrames for large-scale evaluations or when data is already in Delta Lake or Unity Catalog. This is useful when the data already exists in Delta Lake or Unity Catalog, or if you need to filter the records in an MLflow Evaluation Dataset before running the evaluation.
The DataFrame must comply with the evaluation dataset schema.
import mlflow
from mlflow.genai.scorers import Safety, RelevanceToQuery
from my_app import agent # Your GenAI app with tracing
# Load evaluation data from a Delta table in Unity Catalog
eval_df = spark.table("catalog.schema.evaluation_data")
# Or load from any Spark-compatible source
# eval_df = spark.read.parquet("path/to/evaluation/data")
# Run evaluation
results = mlflow.genai.evaluate(
data=eval_df,
predict_fn=agent,
scorers=[Safety(), RelevanceToQuery()],
)
Common predict_fn
patterns
Call your app directly
Pass your app directly as predict_fn
when parameter names match your evaluation dataset keys. This is useful for apps that have parameter names that match the inputs
in your evaluation dataset.
import mlflow
from mlflow.genai.scorers import RelevanceToQuery, Safety
# Your GenAI app that accepts 'question' as a parameter
@mlflow.trace
def my_chatbot_app(question: str) -> dict:
# Your app logic here
response = f"I can help you with: {question}"
return {"response": response}
# Evaluation data with 'question' key matching the function parameter
eval_data = [
{"inputs": {"question": "What is MLflow?"}},
{"inputs": {"question": "How do I track experiments?"}}
]
# Pass your app directly since parameter names match
results = mlflow.genai.evaluate(
data=eval_data,
predict_fn=my_chatbot_app, # Direct reference, no wrapper needed
scorers=[RelevanceToQuery(), Safety()]
)
Wrap your app in a callable
If your app expects different parameter names or data structures than your evaluation dataset's inputs
, wrap it in a callable function. This is useful when there are parameter name mismatches between your app's parameters and evaluation dataset input
keys (for example, user_input
vs question
), or when data format conversions are required (for example, string to list or JSON parsing).
import mlflow
from mlflow.genai.scorers import RelevanceToQuery, Safety
# Your existing GenAI app with different parameter names
@mlflow.trace
def customer_support_bot(user_message: str, chat_history: list = None) -> dict:
# Your app logic here
context = f"History: {chat_history}" if chat_history else "New conversation"
return {
"bot_response": f"Helping with: {user_message}. {context}",
"confidence": 0.95
}
# Wrapper function to translate evaluation data to your app's interface
def evaluate_support_bot(question: str, history: str = None) -> dict:
# Convert evaluation dataset format to your app's expected format
chat_history = history.split("|") if history else []
# Call your app with the translated parameters
result = customer_support_bot(
user_message=question,
chat_history=chat_history
)
# Translate output to standard format if needed
return {
"response": result["bot_response"],
"confidence_score": result["confidence"]
}
# Evaluation data with different key names
eval_data = [
{"inputs": {"question": "Reset password", "history": "logged in|forgot email"}},
{"inputs": {"question": "Track my order"}}
]
# Use the wrapper function for evaluation
results = mlflow.genai.evaluate(
data=eval_data,
predict_fn=evaluate_support_bot, # Wrapper handles translation
scorers=[RelevanceToQuery(), Safety()]
)
Evaluate a deployed endpoint
Use the to_predict_fn
function to evaluate Mosaic AI Agent Framework, Model Serving chat endpoints, and custom endpoints.
This function creates a predict function that's compatible with those endpoints and automatically extracts traces from tracing-enabled endpoints for full observability.
The to_predict_fn
function performs a kwargs
pass-through directly to your endpoint. Your evaluation data must match the input format that your endpoint expects. If the formats don't match, the evaluation fails with an error message about unrecognized input keys.
- Model Serving chat
- Agent Framework
- Custom endpoint
Model Serving chat endpoints require data that is formatted with the messages
key.
import mlflow
from mlflow.genai.scorers import RelevanceToQuery
# Create predict function for a chat endpoint
predict_fn = mlflow.genai.to_predict_fn("endpoints:/my-chatbot-endpoint")
# Evaluate the chat endpoint
results = mlflow.genai.evaluate(
data=[{"inputs": {"messages": [{"role": "user", "content": "How does MLflow work?"}]}}],
predict_fn=predict_fn,
scorers=[RelevanceToQuery()]
)
Agent Framework endpoints can have different input interfaces. The following example shows an input
key:
import mlflow
from mlflow.genai.scorers import RelevanceToQuery
# Create a predict function for a Knowledge Assistant agent endpoint
predict_fn = mlflow.genai.to_predict_fn("endpoints:/ka-56a301ab-endpoint")
# Evaluate the agent endpoint
results = mlflow.genai.evaluate(
data=[{"inputs": {"input": [{"role": "user", "content": "How do I use the Models from Code feature in MLflow?"}]}}],
predict_fn=predict_fn,
scorers=[RelevanceToQuery()]
)
Custom endpoints might have entirely different access patterns for submitting data to them. Ensure that the data
input format is compatible with the endpoint being used for evaluation.
If your evaluation data format is incompatible with your endpoint, wrap the model's interface. A translation layer can ensure that the correct payload is submitted to the evaluation endpoint.
import mlflow
from mlflow.genai.scorers import RelevanceToQuery
def custom_predict_fn(inputs):
# Transform inputs to match your endpoint's expected format
# For example, if your endpoint expects a 'query' key instead of 'messages'
transformed_inputs = {
"query": inputs["messages"][0]["content"],
"context": inputs.get("context", "")
}
# Call your endpoint with the transformed data
original_predict_fn = mlflow.genai.to_predict_fn("endpoints:/my-custom-endpoint")
return original_predict_fn(transformed_inputs)
# Use your wrapper function for evaluation
results = mlflow.genai.evaluate(
data=[{"inputs": {"messages": [{"role": "user", "content": "What is machine learning?"}], "context": "technical documentation"}}],
predict_fn=custom_predict_fn,
scorers=[RelevanceToQuery()]
)
Evaluate a logged model
Wrap logged MLflow models to translate between evaluation's named parameters and the model's single-parameter interface.
Most logged models (such as those using PyFunc or logging flavors like LangChain) accept a single input parameter (for example, model_inputs
for PyFunc), while predict_fn
expects named parameters that correspond to the keys in your evaluation dataset.
import mlflow
from mlflow.genai.scorers import Safety
# Make sure to load your logged model outside of the predict_fn so MLflow only loads it once!
model = mlflow.pyfunc.load_model("models:/chatbot/staging")
def evaluate_model(question: str) -> dict:
return model.predict({"question": question})
results = mlflow.genai.evaluate(
data=[{"inputs": {"question": "Tell me about MLflow"}}],
predict_fn=evaluate_model,
scorers=[Safety()]
)
Next Steps
- Evaluate your app - Step-by-step guide to running your first evaluation.
- Build evaluation datasets - Create structured test data from production logs or scratch.
- Define custom scorers - Build metrics tailored to your specific use case.