Skip to main content

RAGAS scorers

RAGAS (Retrieval Augmented Generation Assessment) is an evaluation framework for LLM applications. MLflow integrates with RAGAS so that you can use RAGAS metrics as scorers for evaluating retrieval quality, answer generation, agent behavior, and text similarity.

Requirements

Install the ragas package:

Python
%pip install ragas

Quick start

To call a RAGAS scorer directly:

Python
from mlflow.genai.scorers.ragas import Faithfulness

scorer = Faithfulness(model="databricks:/databricks-gpt-5-mini")
feedback = scorer(trace=trace)

print(feedback.value) # Score between 0.0 and 1.0
print(feedback.rationale) # Explanation of the score

To call RAGAS scorers using mlflow.genai.evaluate():

Python
import mlflow
from mlflow.genai.scorers.ragas import Faithfulness, ContextPrecision

traces = mlflow.search_traces()
results = mlflow.genai.evaluate(
data=traces,
scorers=[
Faithfulness(model="databricks:/databricks-gpt-5-mini"),
ContextPrecision(model="databricks:/databricks-gpt-5-mini"),
],
)

Available RAGAS scorers

RAG metrics

These scorers evaluate retrieval quality and answer generation in retrieval-augmented generation (RAG) applications.

Scorer

What does it evaluate?

RAGAS Docs

ContextPrecision

Are relevant retrieved documents ranked higher than irrelevant ones?

Link

ContextUtilization

How effectively is the retrieved context being used in the answer?

Link

NonLLMContextPrecisionWithReference

Non-LLM version of context precision using reference answers.

Link

ContextRecall

Does the retrieval context contain all information needed to answer the query?

Link

NonLLMContextRecall

Non-LLM variant of context recall using reference answers.

Link

ContextEntityRecall

Are entities from the expected answer present in the retrieved context?

Link

NoiseSensitivity

How sensitive is the model to irrelevant information in the context?

Link

AnswerRelevancy

How relevant is the generated answer to the input query?

Link

Faithfulness

Is the output factually consistent with the retrieval context?

Link

AnswerAccuracy

How accurate is the answer compared to ground truth?

Link

ContextRelevance

How relevant is the retrieved context to the input query?

Link

ResponseGroundedness

Is the response grounded in the provided context?

Link

Agent and tool use metrics

These scorers evaluate AI agent behavior, including tool invocation accuracy and goal achievement.

Scorer

What does it evaluate?

RAGAS Docs

TopicAdherence

Does the agent stay on topic during conversation?

Link

ToolCallAccuracy

Are the correct tools called with appropriate parameters?

Link

ToolCallF1

F1 score for tool call prediction.

Link

AgentGoalAccuracyWithReference

Does the agent achieve its goal? Compared against a reference answer.

Link

AgentGoalAccuracyWithoutReference

Does the agent achieve its goal? Evaluated without a reference answer.

Link

Natural language comparison

These scorers compare generated text against expected output using both semantic and deterministic methods.

Scorer

What does it evaluate?

RAGAS Docs

FactualCorrectness

Is the output factually correct compared to the expected answer?

Link

SemanticSimilarity

Semantic similarity between the output and the expected answer.

Link

NonLLMStringSimilarity

String similarity between the output and the expected answer.

Link

BleuScore

BLEU score for text comparison.

Link

ChrfScore

CHRF score for text comparison.

Link

RougeScore

ROUGE score for text comparison.

Link

StringPresence

Is a specific string present in the output?

Link

ExactMatch

Does the output exactly match the expected output?

Link

General purpose

These scorers provide flexible, customizable evaluation logic.

Scorer

What does it evaluate?

RAGAS Docs

AspectCritic

Evaluates specific aspects of the output using an LLM.

Link

DiscreteMetric

Custom discrete metric with flexible scoring logic.

Link

RubricsScore

Scores output based on predefined rubrics.

Link

InstanceSpecificRubrics

Scores output based on instance-specific rubrics.

Link

Other tasks

Scorer

What does it evaluate?

RAGAS Docs

SummarizationScore

Quality of text summarization.

Link

Create a scorer by name

You can dynamically create a scorer using get_scorer by passing the metric name as a string:

Python
from mlflow.genai.scorers.ragas import get_scorer

scorer = get_scorer(
metric_name="Faithfulness",
model="databricks:/databricks-gpt-5-mini",
)
feedback = scorer(trace=trace)

Configuration

RAGAS scorers accept metric-specific parameters as keyword arguments to the constructor. LLM-based metrics require a model parameter. Non-LLM metrics do not require a model.

Python
from mlflow.genai.scorers.ragas import Faithfulness, ExactMatch

# LLM-based metric with model specification
scorer = Faithfulness(model="databricks:/databricks-gpt-5-mini")

# Non-LLM metric (no model required)
deterministic_scorer = ExactMatch()

For metric-specific parameters and advanced usage options, see the RAGAS documentation.