Skip to main content

DeepEval scorers

DeepEval is a comprehensive evaluation framework for LLM applications that provides metrics for RAG systems, agents, conversational AI, and safety evaluation. MLflow integrates with DeepEval so that you can use DeepEval metrics as scorers.

Requirements

Install the deepeval package:

Python
%pip install deepeval

Quick start

To call a DeepEval scorer directly:

Python
from mlflow.genai.scorers.deepeval import AnswerRelevancy

scorer = AnswerRelevancy(threshold=0.7, model="databricks:/databricks-gpt-5-mini")
feedback = scorer(
inputs="What is MLflow?",
outputs="MLflow is an open-source AI engineering platform for agents and LLMs.",
)

print(feedback.value) # "yes" or "no"
print(feedback.metadata["score"]) # 0.85

To call DeepEval scorers using mlflow.genai.evaluate():

Python
import mlflow
from mlflow.genai.scorers.deepeval import AnswerRelevancy, Faithfulness

eval_dataset = [
{
"inputs": {"query": "What is MLflow?"},
"outputs": "MLflow is an open-source AI engineering platform for agents and LLMs.",
},
{
"inputs": {"query": "How do I track experiments?"},
"outputs": "You can use mlflow.start_run() to begin tracking experiments.",
},
]

results = mlflow.genai.evaluate(
data=eval_dataset,
scorers=[
AnswerRelevancy(threshold=0.7, model="databricks:/databricks-gpt-5-mini"),
Faithfulness(threshold=0.8, model="databricks:/databricks-gpt-5-mini"),
],
)

Available DeepEval scorers

RAG metrics

These scorers evaluate retrieval quality and answer generation in retrieval-augmented generation (RAG) applications.

Scorer

What does it evaluate?

DeepEval Docs

AnswerRelevancy

Is the output relevant to the input query?

Link

Faithfulness

Is the output factually consistent with the retrieval context?

Link

ContextualRecall

Does the retrieval context contain all the necessary information?

Link

ContextualPrecision

Are relevant nodes ranked higher than irrelevant ones?

Link

ContextualRelevancy

Is the retrieval context relevant to the query?

Link

Agentic metrics

These scorers evaluate AI agent behavior, including task completion and tool usage.

Scorer

What does it evaluate?

DeepEval Docs

TaskCompletion

Does the agent successfully complete its assigned task?

Link

ToolCorrectness

Does the agent use the correct tools?

Link

ArgumentCorrectness

Are tool arguments correct?

Link

StepEfficiency

Does the agent take an optimal path?

Link

PlanAdherence

Does the agent follow its plan?

Link

PlanQuality

Is the agent's plan well-structured?

Link

Conversational metrics

These scorers evaluate multi-turn conversational AI quality.

Scorer

What does it evaluate?

DeepEval Docs

TurnRelevancy

Is each turn relevant to the conversation?

Link

RoleAdherence

Does the assistant maintain its assigned role?

Link

KnowledgeRetention

Does the agent retain information across turns?

Link

ConversationCompleteness

Are all user questions addressed?

Link

GoalAccuracy

Does the conversation achieve its goal?

Link

ToolUse

Does the agent use tools appropriately in conversation?

Link

TopicAdherence

Does the conversation stay on topic?

Link

Safety metrics

These scorers evaluate the safety and responsibility of model outputs.

Scorer

What does it evaluate?

DeepEval Docs

Bias

Does the output contain biased content?

Link

Toxicity

Does the output contain toxic language?

Link

NonAdvice

Does the model inappropriately provide advice in restricted domains?

Link

Misuse

Could the output be used for harmful purposes?

Link

PIILeakage

Does the output leak personally identifiable information?

Link

RoleViolation

Does the assistant break out of its assigned role?

Link

Other metrics

Scorer

What does it evaluate?

DeepEval Docs

Hallucination

Does the LLM fabricate information not in the context?

Link

Summarization

Is the summary accurate and complete?

Link

JsonCorrectness

Does the JSON output match the expected schema?

Link

PromptAlignment

Does the output align with prompt instructions?

Link

Non-LLM metrics

Scorer

What does it evaluate?

DeepEval Docs

ExactMatch

Does the output exactly match the expected output?

Link

PatternMatch

Does the output match a regex pattern?

Link

Create a scorer by name

You can dynamically create a scorer using get_scorer by passing the metric name as a string:

Python
from mlflow.genai.scorers.deepeval import get_scorer

scorer = get_scorer(
metric_name="AnswerRelevancy",
threshold=0.7,
model="databricks:/databricks-gpt-5-mini",
)
feedback = scorer(
inputs="What is MLflow?",
outputs="MLflow is a platform for ML workflows.",
)

Configuration

DeepEval scorers accept metric-specific parameters as keyword arguments to the constructor. LLM-based metrics require a model parameter.

Python
from mlflow.genai.scorers.deepeval import AnswerRelevancy, TurnRelevancy

# LLM-based metric with common parameters
scorer = AnswerRelevancy(
model="databricks:/databricks-gpt-5-mini",
threshold=0.7,
include_reason=True,
)

# Metric-specific parameters
conversational_scorer = TurnRelevancy(
model="openai:/gpt-4o",
threshold=0.8,
window_size=3,
strict_mode=True,
)

For metric-specific parameters and advanced usage options, see the DeepEval documentation.