DeepEval scorers
DeepEval is a comprehensive evaluation framework for LLM applications that provides metrics for RAG systems, agents, conversational AI, and safety evaluation. MLflow integrates with DeepEval so that you can use DeepEval metrics as scorers.
Requirements
Install the deepeval package:
%pip install deepeval
Quick start
To call a DeepEval scorer directly:
from mlflow.genai.scorers.deepeval import AnswerRelevancy
scorer = AnswerRelevancy(threshold=0.7, model="databricks:/databricks-gpt-5-mini")
feedback = scorer(
inputs="What is MLflow?",
outputs="MLflow is an open-source AI engineering platform for agents and LLMs.",
)
print(feedback.value) # "yes" or "no"
print(feedback.metadata["score"]) # 0.85
To call DeepEval scorers using mlflow.genai.evaluate():
import mlflow
from mlflow.genai.scorers.deepeval import AnswerRelevancy, Faithfulness
eval_dataset = [
{
"inputs": {"query": "What is MLflow?"},
"outputs": "MLflow is an open-source AI engineering platform for agents and LLMs.",
},
{
"inputs": {"query": "How do I track experiments?"},
"outputs": "You can use mlflow.start_run() to begin tracking experiments.",
},
]
results = mlflow.genai.evaluate(
data=eval_dataset,
scorers=[
AnswerRelevancy(threshold=0.7, model="databricks:/databricks-gpt-5-mini"),
Faithfulness(threshold=0.8, model="databricks:/databricks-gpt-5-mini"),
],
)
Available DeepEval scorers
RAG metrics
These scorers evaluate retrieval quality and answer generation in retrieval-augmented generation (RAG) applications.
Scorer | What does it evaluate? | DeepEval Docs |
|---|---|---|
Is the output relevant to the input query? | ||
Is the output factually consistent with the retrieval context? | ||
Does the retrieval context contain all the necessary information? | ||
Are relevant nodes ranked higher than irrelevant ones? | ||
Is the retrieval context relevant to the query? |
Agentic metrics
These scorers evaluate AI agent behavior, including task completion and tool usage.
Scorer | What does it evaluate? | DeepEval Docs |
|---|---|---|
Does the agent successfully complete its assigned task? | ||
Does the agent use the correct tools? | ||
Are tool arguments correct? | ||
Does the agent take an optimal path? | ||
Does the agent follow its plan? | ||
Is the agent's plan well-structured? |
Conversational metrics
These scorers evaluate multi-turn conversational AI quality.
Scorer | What does it evaluate? | DeepEval Docs |
|---|---|---|
Is each turn relevant to the conversation? | ||
Does the assistant maintain its assigned role? | ||
Does the agent retain information across turns? | ||
Are all user questions addressed? | ||
Does the conversation achieve its goal? | ||
Does the agent use tools appropriately in conversation? | ||
Does the conversation stay on topic? |
Safety metrics
These scorers evaluate the safety and responsibility of model outputs.
Scorer | What does it evaluate? | DeepEval Docs |
|---|---|---|
Does the output contain biased content? | ||
Does the output contain toxic language? | ||
Does the model inappropriately provide advice in restricted domains? | ||
Could the output be used for harmful purposes? | ||
Does the output leak personally identifiable information? | ||
Does the assistant break out of its assigned role? |
Other metrics
Scorer | What does it evaluate? | DeepEval Docs |
|---|---|---|
Does the LLM fabricate information not in the context? | ||
Is the summary accurate and complete? | ||
Does the JSON output match the expected schema? | ||
Does the output align with prompt instructions? |
Non-LLM metrics
Scorer | What does it evaluate? | DeepEval Docs |
|---|---|---|
Does the output exactly match the expected output? | ||
Does the output match a regex pattern? |
Create a scorer by name
You can dynamically create a scorer using get_scorer by passing the metric name as a string:
from mlflow.genai.scorers.deepeval import get_scorer
scorer = get_scorer(
metric_name="AnswerRelevancy",
threshold=0.7,
model="databricks:/databricks-gpt-5-mini",
)
feedback = scorer(
inputs="What is MLflow?",
outputs="MLflow is a platform for ML workflows.",
)
Configuration
DeepEval scorers accept metric-specific parameters as keyword arguments to the constructor. LLM-based metrics require a model parameter.
from mlflow.genai.scorers.deepeval import AnswerRelevancy, TurnRelevancy
# LLM-based metric with common parameters
scorer = AnswerRelevancy(
model="databricks:/databricks-gpt-5-mini",
threshold=0.7,
include_reason=True,
)
# Metric-specific parameters
conversational_scorer = TurnRelevancy(
model="openai:/gpt-4o",
threshold=0.8,
window_size=3,
strict_mode=True,
)
For metric-specific parameters and advanced usage options, see the DeepEval documentation.