RAGAS scorers

RAGAS (Retrieval Augmented Generation Assessment) is an evaluation framework for LLM applications. MLflow integrates with RAGAS so that you can use RAGAS metrics as scorers for evaluating retrieval quality, answer generation, agent behavior, and text similarity.

Requirements

Install the ragas package:

Python
%pip install ragas

Quick start

To call a RAGAS scorer directly:

Python
from mlflow.genai.scorers.ragas import Faithfulness

scorer = Faithfulness(model="databricks:/databricks-gpt-5-mini")
feedback = scorer(trace=trace)

print(feedback.value)  # Score between 0.0 and 1.0
print(feedback.rationale)  # Explanation of the score

To call RAGAS scorers using mlflow.genai.evaluate():

Python
import mlflow
from mlflow.genai.scorers.ragas import Faithfulness, ContextPrecision

traces = mlflow.search_traces()
results = mlflow.genai.evaluate(
    data=traces,
    scorers=[
        Faithfulness(model="databricks:/databricks-gpt-5-mini"),
        ContextPrecision(model="databricks:/databricks-gpt-5-mini"),
    ],
)

Available RAGAS scorers

RAG metrics

These scorers evaluate retrieval quality and answer generation in retrieval-augmented generation (RAG) applications.

Scorer	What does it evaluate?	RAGAS Docs
`ContextPrecision`	Are relevant retrieved documents ranked higher than irrelevant ones?	Link
`ContextUtilization`	How effectively is the retrieved context being used in the answer?	Link
`NonLLMContextPrecisionWithReference`	Non-LLM version of context precision using reference answers.	Link
`ContextRecall`	Does the retrieval context contain all information needed to answer the query?	Link
`NonLLMContextRecall`	Non-LLM variant of context recall using reference answers.	Link
`ContextEntityRecall`	Are entities from the expected answer present in the retrieved context?	Link
`NoiseSensitivity`	How sensitive is the model to irrelevant information in the context?	Link
`AnswerRelevancy`	How relevant is the generated answer to the input query?	Link
`Faithfulness`	Is the output factually consistent with the retrieval context?	Link
`AnswerAccuracy`	How accurate is the answer compared to ground truth?	Link
`ContextRelevance`	How relevant is the retrieved context to the input query?	Link
`ResponseGroundedness`	Is the response grounded in the provided context?	Link

Agent and tool use metrics

These scorers evaluate AI agent behavior, including tool invocation accuracy and goal achievement.

Scorer	What does it evaluate?	RAGAS Docs
`TopicAdherence`	Does the agent stay on topic during conversation?	Link
`ToolCallAccuracy`	Are the correct tools called with appropriate parameters?	Link
`ToolCallF1`	F1 score for tool call prediction.	Link
`AgentGoalAccuracyWithReference`	Does the agent achieve its goal? Compared against a reference answer.	Link
`AgentGoalAccuracyWithoutReference`	Does the agent achieve its goal? Evaluated without a reference answer.	Link

Natural language comparison

These scorers compare generated text against expected output using both semantic and deterministic methods.

Scorer	What does it evaluate?	RAGAS Docs
`FactualCorrectness`	Is the output factually correct compared to the expected answer?	Link
`SemanticSimilarity`	Semantic similarity between the output and the expected answer.	Link
`NonLLMStringSimilarity`	String similarity between the output and the expected answer.	Link
`BleuScore`	BLEU score for text comparison.	Link
`ChrfScore`	CHRF score for text comparison.	Link
`RougeScore`	ROUGE score for text comparison.	Link
`StringPresence`	Is a specific string present in the output?	Link
`ExactMatch`	Does the output exactly match the expected output?	Link

General purpose

These scorers provide flexible, customizable evaluation logic.

Scorer	What does it evaluate?	RAGAS Docs
`AspectCritic`	Evaluates specific aspects of the output using an LLM.	Link
`DiscreteMetric`	Custom discrete metric with flexible scoring logic.	Link
`RubricsScore`	Scores output based on predefined rubrics.	Link
`InstanceSpecificRubrics`	Scores output based on instance-specific rubrics.	Link

Other tasks

Scorer	What does it evaluate?	RAGAS Docs
`SummarizationScore`	Quality of text summarization.	Link

Create a scorer by name

You can dynamically create a scorer using get_scorer by passing the metric name as a string:

Python
from mlflow.genai.scorers.ragas import get_scorer

scorer = get_scorer(
    metric_name="Faithfulness",
    model="databricks:/databricks-gpt-5-mini",
)
feedback = scorer(trace=trace)

Configuration

RAGAS scorers accept metric-specific parameters as keyword arguments to the constructor. LLM-based metrics require a model parameter. Non-LLM metrics do not require a model.

Python
from mlflow.genai.scorers.ragas import Faithfulness, ExactMatch

# LLM-based metric with model specification
scorer = Faithfulness(model="databricks:/databricks-gpt-5-mini")

# Non-LLM metric (no model required)
deterministic_scorer = ExactMatch()

For metric-specific parameters and advanced usage options, see the RAGAS documentation.

Requirements​

Quick start​

Available RAGAS scorers​

RAG metrics​

Agent and tool use metrics​

Natural language comparison​

General purpose​

Other tasks​

Create a scorer by name​

Configuration​