DeepEval scorers

DeepEval is a comprehensive evaluation framework for LLM applications that provides metrics for RAG systems, agents, conversational AI, and safety evaluation. MLflow integrates with DeepEval so that you can use DeepEval metrics as scorers.

Requirements

Install the deepeval package:

Python
%pip install deepeval

Quick start

To call a DeepEval scorer directly:

Python
from mlflow.genai.scorers.deepeval import AnswerRelevancy

scorer = AnswerRelevancy(threshold=0.7, model="databricks:/databricks-gpt-5-mini")
feedback = scorer(
    inputs="What is MLflow?",
    outputs="MLflow is an open-source AI engineering platform for agents and LLMs.",
)

print(feedback.value)  # "yes" or "no"
print(feedback.metadata["score"])  # 0.85

To call DeepEval scorers using mlflow.genai.evaluate():

Python
import mlflow
from mlflow.genai.scorers.deepeval import AnswerRelevancy, Faithfulness

eval_dataset = [
    {
        "inputs": {"query": "What is MLflow?"},
        "outputs": "MLflow is an open-source AI engineering platform for agents and LLMs.",
    },
    {
        "inputs": {"query": "How do I track experiments?"},
        "outputs": "You can use mlflow.start_run() to begin tracking experiments.",
    },
]

results = mlflow.genai.evaluate(
    data=eval_dataset,
    scorers=[
        AnswerRelevancy(threshold=0.7, model="databricks:/databricks-gpt-5-mini"),
        Faithfulness(threshold=0.8, model="databricks:/databricks-gpt-5-mini"),
    ],
)

Available DeepEval scorers

RAG metrics

These scorers evaluate retrieval quality and answer generation in retrieval-augmented generation (RAG) applications.

Scorer	What does it evaluate?	DeepEval Docs
`AnswerRelevancy`	Is the output relevant to the input query?	Link
`Faithfulness`	Is the output factually consistent with the retrieval context?	Link
`ContextualRecall`	Does the retrieval context contain all the necessary information?	Link
`ContextualPrecision`	Are relevant nodes ranked higher than irrelevant ones?	Link
`ContextualRelevancy`	Is the retrieval context relevant to the query?	Link

Agentic metrics

These scorers evaluate AI agent behavior, including task completion and tool usage.

Scorer	What does it evaluate?	DeepEval Docs
`TaskCompletion`	Does the agent successfully complete its assigned task?	Link
`ToolCorrectness`	Does the agent use the correct tools?	Link
`ArgumentCorrectness`	Are tool arguments correct?	Link
`StepEfficiency`	Does the agent take an optimal path?	Link
`PlanAdherence`	Does the agent follow its plan?	Link
`PlanQuality`	Is the agent's plan well-structured?	Link

Conversational metrics

These scorers evaluate multi-turn conversational AI quality.

Scorer	What does it evaluate?	DeepEval Docs
`TurnRelevancy`	Is each turn relevant to the conversation?	Link
`RoleAdherence`	Does the assistant maintain its assigned role?	Link
`KnowledgeRetention`	Does the agent retain information across turns?	Link
`ConversationCompleteness`	Are all user questions addressed?	Link
`GoalAccuracy`	Does the conversation achieve its goal?	Link
`ToolUse`	Does the agent use tools appropriately in conversation?	Link
`TopicAdherence`	Does the conversation stay on topic?	Link

Safety metrics

These scorers evaluate the safety and responsibility of model outputs.

Scorer	What does it evaluate?	DeepEval Docs
`Bias`	Does the output contain biased content?	Link
`Toxicity`	Does the output contain toxic language?	Link
`NonAdvice`	Does the model inappropriately provide advice in restricted domains?	Link
`Misuse`	Could the output be used for harmful purposes?	Link
`PIILeakage`	Does the output leak personally identifiable information?	Link
`RoleViolation`	Does the assistant break out of its assigned role?	Link

Other metrics

Scorer	What does it evaluate?	DeepEval Docs
`Hallucination`	Does the LLM fabricate information not in the context?	Link
`Summarization`	Is the summary accurate and complete?	Link
`JsonCorrectness`	Does the JSON output match the expected schema?	Link
`PromptAlignment`	Does the output align with prompt instructions?	Link

Non-LLM metrics

Scorer	What does it evaluate?	DeepEval Docs
`ExactMatch`	Does the output exactly match the expected output?	Link
`PatternMatch`	Does the output match a regex pattern?	Link

Create a scorer by name

You can dynamically create a scorer using get_scorer by passing the metric name as a string:

Python
from mlflow.genai.scorers.deepeval import get_scorer

scorer = get_scorer(
    metric_name="AnswerRelevancy",
    threshold=0.7,
    model="databricks:/databricks-gpt-5-mini",
)
feedback = scorer(
    inputs="What is MLflow?",
    outputs="MLflow is a platform for ML workflows.",
)

Configuration

DeepEval scorers accept metric-specific parameters as keyword arguments to the constructor. LLM-based metrics require a model parameter.

Python
from mlflow.genai.scorers.deepeval import AnswerRelevancy, TurnRelevancy

# LLM-based metric with common parameters
scorer = AnswerRelevancy(
    model="databricks:/databricks-gpt-5-mini",
    threshold=0.7,
    include_reason=True,
)

# Metric-specific parameters
conversational_scorer = TurnRelevancy(
    model="openai:/gpt-4o",
    threshold=0.8,
    window_size=3,
    strict_mode=True,
)

For metric-specific parameters and advanced usage options, see the DeepEval documentation.

Requirements​

Quick start​

Available DeepEval scorers​

RAG metrics​

Agentic metrics​

Conversational metrics​

Safety metrics​

Other metrics​

Non-LLM metrics​

Create a scorer by name​

Configuration​