RAGAS scorers
RAGAS (Retrieval Augmented Generation Assessment) is an evaluation framework for LLM applications. MLflow integrates with RAGAS so that you can use RAGAS metrics as scorers for evaluating retrieval quality, answer generation, agent behavior, and text similarity.
Requirements
Install the ragas package:
%pip install ragas
Quick start
To call a RAGAS scorer directly:
from mlflow.genai.scorers.ragas import Faithfulness
scorer = Faithfulness(model="databricks:/databricks-gpt-5-mini")
feedback = scorer(trace=trace)
print(feedback.value) # Score between 0.0 and 1.0
print(feedback.rationale) # Explanation of the score
To call RAGAS scorers using mlflow.genai.evaluate():
import mlflow
from mlflow.genai.scorers.ragas import Faithfulness, ContextPrecision
traces = mlflow.search_traces()
results = mlflow.genai.evaluate(
data=traces,
scorers=[
Faithfulness(model="databricks:/databricks-gpt-5-mini"),
ContextPrecision(model="databricks:/databricks-gpt-5-mini"),
],
)
Available RAGAS scorers
RAG metrics
These scorers evaluate retrieval quality and answer generation in retrieval-augmented generation (RAG) applications.
Scorer | What does it evaluate? | RAGAS Docs |
|---|---|---|
Are relevant retrieved documents ranked higher than irrelevant ones? | ||
How effectively is the retrieved context being used in the answer? | ||
Non-LLM version of context precision using reference answers. | ||
Does the retrieval context contain all information needed to answer the query? | ||
Non-LLM variant of context recall using reference answers. | ||
Are entities from the expected answer present in the retrieved context? | ||
How sensitive is the model to irrelevant information in the context? | ||
How relevant is the generated answer to the input query? | ||
Is the output factually consistent with the retrieval context? | ||
How accurate is the answer compared to ground truth? | ||
How relevant is the retrieved context to the input query? | ||
Is the response grounded in the provided context? |
Agent and tool use metrics
These scorers evaluate AI agent behavior, including tool invocation accuracy and goal achievement.
Scorer | What does it evaluate? | RAGAS Docs |
|---|---|---|
Does the agent stay on topic during conversation? | ||
Are the correct tools called with appropriate parameters? | ||
F1 score for tool call prediction. | ||
Does the agent achieve its goal? Compared against a reference answer. | ||
Does the agent achieve its goal? Evaluated without a reference answer. |
Natural language comparison
These scorers compare generated text against expected output using both semantic and deterministic methods.
Scorer | What does it evaluate? | RAGAS Docs |
|---|---|---|
Is the output factually correct compared to the expected answer? | ||
Semantic similarity between the output and the expected answer. | ||
String similarity between the output and the expected answer. | ||
BLEU score for text comparison. | ||
CHRF score for text comparison. | ||
ROUGE score for text comparison. | ||
Is a specific string present in the output? | ||
Does the output exactly match the expected output? |
General purpose
These scorers provide flexible, customizable evaluation logic.
Scorer | What does it evaluate? | RAGAS Docs |
|---|---|---|
Evaluates specific aspects of the output using an LLM. | ||
Custom discrete metric with flexible scoring logic. | ||
Scores output based on predefined rubrics. | ||
Scores output based on instance-specific rubrics. |
Other tasks
Scorer | What does it evaluate? | RAGAS Docs |
|---|---|---|
Quality of text summarization. |
Create a scorer by name
You can dynamically create a scorer using get_scorer by passing the metric name as a string:
from mlflow.genai.scorers.ragas import get_scorer
scorer = get_scorer(
metric_name="Faithfulness",
model="databricks:/databricks-gpt-5-mini",
)
feedback = scorer(trace=trace)
Configuration
RAGAS scorers accept metric-specific parameters as keyword arguments to the constructor. LLM-based metrics require a model parameter. Non-LLM metrics do not require a model.
from mlflow.genai.scorers.ragas import Faithfulness, ExactMatch
# LLM-based metric with model specification
scorer = Faithfulness(model="databricks:/databricks-gpt-5-mini")
# Non-LLM metric (no model required)
deterministic_scorer = ExactMatch()
For metric-specific parameters and advanced usage options, see the RAGAS documentation.