Evaluate conversations

Conversation evaluation enables you to assess entire conversation sessions rather than individual turns. This is essential for evaluating conversational AI systems where quality emerges over multiple interactions, such as user frustration patterns, conversation completeness, or overall dialogue coherence.

Multi-turn judges can be used both for offline evaluation during development (as described on this page) and for continuous monitoring in production.

note

Multi-turn evaluation is experimental. The API and behavior might change in future releases.

Prerequisites

Install MLflow 3.10.0 or later:

Bash
pip install --upgrade 'mlflow[databricks]>=3.10'

Your agent must be instrumented to track session IDs on traces. See Track users and sessions for how to set session IDs on your traces.

Two approaches

MLflow supports two approaches for evaluating conversations:

Evaluate pre-generated conversations: Evaluate existing conversations that have already been traced. Use this approach when you have:
- Production conversation data to analyze
- Pre-recorded test conversations from QA or user studies
- Conversations from a previous agent version for comparison
Simulate conversations during evaluation: Generate new conversations by simulating user interactions with your agent. Use this approach when you want to:
- Test a new agent version systematically with consistent scenarios
- Generate diverse test scenarios at scale
- Stress-test your agent with specific user behaviors and edge cases

Overview

Traditional single-turn evaluation assesses each agent response independently. However, conversational agents require evaluation at the session level to capture:

User frustration: Did the user become frustrated? Was it resolved?
Conversation completeness: Were all user questions answered by the end of the conversation?
Knowledge retention: Does the agent remember information from earlier in the conversation?
Dialogue coherence: Does the conversation flow naturally?

Multi-turn evaluation addresses these needs by grouping traces into conversation sessions and applying judges that analyze the entire conversation history.

Evaluate pre-generated conversations

Evaluate conversations that have already been traced. This is useful for evaluating production data or pre-recorded test conversations.

Step 1: Tag traces with session IDs

When building your agent, set session IDs on traces to group them into conversations:

Python
import mlflow

@mlflow.trace
def my_chatbot(question, session_id):
    mlflow.update_current_trace(
        tags={"mlflow.trace.session": session_id}
    )
    # ... your chatbot logic

For complete documentation on tracking sessions, see Track users and sessions.

Step 2: Retrieve and evaluate sessions

Get traces from your experiment and pass them to mlflow.genai.evaluate. MLflow automatically groups traces by session ID:

Python
from mlflow.genai.scorers import ConversationCompleteness, UserFrustration

# Get traces from your experiment
traces = mlflow.search_traces(
    filter_string="attributes.status = 'OK'",
    return_type="list",
)

# Evaluate the conversations
# MLflow automatically groups traces by their session ID tag
results = mlflow.genai.evaluate(
    data=traces,
    scorers=[
        ConversationCompleteness(),  # Did the agent answer all questions?
        UserFrustration(),           # Did the user become frustrated?
    ],
)

You can also retrieve complete sessions directly using mlflow.search_sessions:

Python
import mlflow

# Get complete sessions (each session is a list of traces)
sessions = mlflow.search_sessions(
    locations=["<your-experiment-id>"],
    max_results=50,
)

# Flatten for evaluation
all_traces = [trace for session in sessions for trace in session]

results = mlflow.genai.evaluate(
    data=all_traces,
    scorers=[ConversationCompleteness(), UserFrustration()],
)

Simulate conversations during evaluation

Generate new conversations by simulating user interactions. This enables testing different agent versions with consistent goals and personas.

Python
import mlflow
from mlflow.genai.simulators import ConversationSimulator
from mlflow.genai.scorers import ConversationCompleteness, Safety

# Define test scenarios
simulator = ConversationSimulator(
    test_cases=[
        {"goal": "Successfully set up experiment tracking"},
        {"goal": "Identify the root cause of a deployment error"},
        {
            "goal": "Understand how to implement model versioning",
            "persona": "You are a beginner who needs detailed explanations",
        },
    ],
    max_turns=5,
)


# Your agent's predict function
def predict_fn(input: list[dict], **kwargs) -> str:
    # input is the conversation history
    response = your_agent.chat(input)
    return response


# Simulate conversations and evaluate
results = mlflow.genai.evaluate(
    data=simulator,
    predict_fn=predict_fn,
    scorers=[
        ConversationCompleteness(),
        Safety(),
    ],
)

For complete documentation on conversation simulation, including test case definition, predict function interfaces, and configuration options, see Conversation simulation.

Multi-turn judges

Built-in judges

MLflow provides built-in multi-turn judges for evaluating conversation quality. For the complete list and detailed documentation, see the MLflow predefined scorers documentation and the Scorers and LLM judges page.

Custom judges

Create custom multi-turn judges using make_judge. Use the {{ conversation }} template variable to access the full conversation history:

Python
from mlflow.genai.judges import make_judge
from typing import Literal

# Create a custom multi-turn judge
politeness_judge = make_judge(
    name="politeness",
    instructions=(
        "Evaluate whether the assistant maintained a polite and professional "
        "tone throughout this conversation:\n\n{{ conversation }}\n\n"
        "Rate as 'consistently_polite', 'mostly_polite', or 'impolite'."
    ),
    feedback_value_type=Literal["consistently_polite", "mostly_polite", "impolite"],
)

# Get traces from your experiment
traces = mlflow.search_traces(
    filter_string="attributes.status = 'OK'",
    return_type="list",
)

# Use in evaluation
results = mlflow.genai.evaluate(
    data=traces,
    scorers=[politeness_judge],
)

The {{ conversation }} variable injects the complete conversation history in a readable format for the judge to analyze.

note

The {{ conversation }} variable can only be used with {{ expectations }}, not with {{ inputs }}, {{ outputs }}, or {{ trace }}.

How assessments are stored

Multi-turn assessments are stored on the first trace (chronologically) in each session. This design ensures:

Assessments remain stable even as new turns are added to a conversation
You can easily find conversation-level assessments by looking at session start traces
The Sessions UI can efficiently display conversation metrics

Assessments include metadata identifying them as conversation-level:

session_id: The session ID linking the assessment to the full conversation

Working with specific sessions

To evaluate a specific session, use mlflow.search_traces with a filter string:

Python
import mlflow
from mlflow.genai.scorers import ConversationCompleteness, UserFrustration

# Get traces for a specific session using filter
traces = mlflow.search_traces(
    filter_string="tags.`mlflow.trace.session` = '<your-session-id>'",
    return_type="list",
)

# Evaluate the session
results = mlflow.genai.evaluate(
    data=traces,
    scorers=[ConversationCompleteness(), UserFrustration()],
)

Next steps

Monitor conversations in production - Use multi-turn judges for continuous production monitoring.
Conversation Simulation - Generate synthetic conversations to test your agent with diverse scenarios and user behaviors.
Predefined scorers - Complete reference for all built-in single-turn and multi-turn scorers.
Custom judges - Build custom LLM judges using make_judge to evaluate conversation-specific criteria.
Track users and sessions - Learn how to instrument your agent with session IDs.

Prerequisites​

Two approaches​

Overview​

Evaluate pre-generated conversations​

Step 1: Tag traces with session IDs​

Step 2: Retrieve and evaluate sessions​

Simulate conversations during evaluation​

Multi-turn judges​

Built-in judges​

Custom judges​

How assessments are stored​

Working with specific sessions​

Next steps​