Answer & Context Relevance judge & scorers
The judges.is_context_relevant()
predefined judge assesses whether context either retrieved by your RAG system or generated by a tool call is relevant to the user's request. This is crucial for diagnosing quality issues - if context isn't relevant, the generation step cannot produce a helpful response.
This judge is available through two predefined scorers:
RelevanceToQuery
: Evaluates if your app's response directly addresses the user's inputRetrievalRelevance
: Evaluates if each document returned by your app's retriever(s) is relevant
API Signature
For details, see mlflow.genai.judges.is_context_relevant()
.
from mlflow.genai.judges import is_context_relevant
def is_context_relevant(
*,
request: str, # User's question or query
context: Any, # Context to evaluate for relevance, can be any Python primitive or a JSON-seralizable dict
name: Optional[str] = None, # Optional custom name for display in the MLflow UIs
model: Optional[str] = None, # Optional LiteLLM compatible custom judge model
) -> mlflow.entities.Feedback:
"""Returns Feedback with 'yes' or 'no' value and a rationale"""
By default, this judge uses a specially tuned, Databricks-hosted LLM model designed to perform GenAI quality assessments. You can change the judge model by using the model argument in the scorer definition. The model must be specified in the format <provider>:/<model-name>
, where provider is a LiteLLM-compatible model provider. If you use databricks
as the model provider, the model name is the same as the serving endpoint name.
Prerequisites for running the examples
-
Install MLflow and required packages
Bashpip install --upgrade "mlflow[databricks]>=3.4.0" openai "databricks-connect>=16.1"
-
Create an MLflow experiment by following the setup your environment quickstart.
Direct SDK Usage
from mlflow.genai.judges import is_context_relevant
# Example 1: Relevant context
feedback = is_context_relevant(
request="What is the capital of France?",
context="Paris is the capital of France."
)
print(feedback.value) # "yes"
print(feedback.rationale) # Explanation of relevance
# Example 2: Irrelevant context
feedback = is_context_relevant(
request="What is the capital of France?",
context="Paris is known for its Eiffel Tower."
)
print(feedback.value) # "no"
print(feedback.rationale) # Explanation of why it's not relevant
# Example 3: Custom judge model
feedback = is_context_relevant(
request="What is the capital of France?",
context="Paris is known for its Eiffel Tower.",
model="databricks:/databricks-gpt-oss-120b",
)
Using the prebuilt scorer
The is_context_relevant
judge is available through two prebuilt scorers:
1. RelevanceToQuery
scorer
This scorer evaluates if your app's response directly addresses the user's input without deviating into unrelated topics.
Requirements:
- Trace requirements:
inputs
andoutputs
must be on the Trace's root span
from mlflow.genai.scorers import RelevanceToQuery
eval_dataset = [
{
"inputs": {"query": "What is the capital of France?"},
"outputs": {
"response": "Paris is the capital of France. It's known for the Eiffel Tower and is a major European city."
},
},
{
"inputs": {"query": "What is the capital of France?"},
"outputs": {
"response": "France is a beautiful country with great wine and cuisine."
},
}
]
# Run evaluation with RelevanceToQuery scorer
eval_results = mlflow.genai.evaluate(
data=eval_dataset,
scorers=[
RelevanceToQuery(
model="databricks:/databricks-gpt-oss-120b", # Optional. Defaults to custom Databricks model.
)
],
)
2. RetrievalRelevance
scorer
This scorer evaluates if each document returned by your app's retriever(s) is relevant to the input request.
Requirements:
- Trace requirements: The MLflow Trace must contain at least one span with
span_type
set toRETRIEVER
import mlflow
from mlflow.genai.scorers import RetrievalRelevance
from mlflow.entities import Document
from typing import List
# Define a retriever function with proper span type
@mlflow.trace(span_type="RETRIEVER")
def retrieve_docs(query: str) -> List[Document]:
# Simulated retrieval - in practice, this would query a vector database
if "capital" in query.lower() and "france" in query.lower():
return [
Document(
id="doc_1",
page_content="Paris is the capital of France.",
metadata={"source": "geography.txt"}
),
Document(
id="doc_2",
page_content="The Eiffel Tower is located in Paris.",
metadata={"source": "landmarks.txt"}
)
]
else:
return [
Document(
id="doc_3",
page_content="Python is a programming language.",
metadata={"source": "tech.txt"}
)
]
# Define your app that uses the retriever
@mlflow.trace
def rag_app(query: str):
docs = retrieve_docs(query)
# In practice, you would pass these docs to an LLM
return {"response": f"Found {len(docs)} relevant documents."}
# Create evaluation dataset
eval_dataset = [
{
"inputs": {"query": "What is the capital of France?"}
},
{
"inputs": {"query": "How do I use Python?"}
}
]
# Run evaluation with RetrievalRelevance scorer
eval_results = mlflow.genai.evaluate(
data=eval_dataset,
predict_fn=rag_app,
scorers=[
RetrievalRelevance(
model="databricks:/databricks-gpt-oss-120b", # Optional. Defaults to custom Databricks model.
)
]
)
Using in a custom scorer
When evaluating applications with different data structures than the requirements the predefined scorer, wrap the judge in a custom scorer:
from mlflow.genai.judges import is_context_relevant
from mlflow.genai.scorers import scorer
from typing import Dict, Any
eval_dataset = [
{
"inputs": {"query": "What are MLflow's main components?"},
"outputs": {
"retrieved_context": [
{"content": "MLflow has four main components: Tracking, Projects, Models, and Registry."}
]
}
},
{
"inputs": {"query": "What are MLflow's main components?"},
"outputs": {
"retrieved_context": [
{"content": "Python is a popular programming language."}
]
}
}
]
@scorer
def context_relevance_scorer(inputs: Dict[Any, Any], outputs: Dict[Any, Any]):
# Extract first context chunk for evaluation
context = outputs["retrieved_context"]
return is_context_relevant(
request=inputs["query"],
context=context
)
# Run evaluation
eval_results = mlflow.genai.evaluate(
data=eval_dataset,
scorers=[context_relevance_scorer]
)
Interpreting Results
The judge returns a Feedback
object with:
value
: "yes" if context is relevant, "no" if notrationale
: Explanation of why the context was deemed relevant or irrelevant
Next Steps
- Explore other predefined judges - Learn about groundedness, safety, and correctness judges
- Create custom judges - Build specialized judges for your use case
- Evaluate RAG applications - Apply relevance judges in comprehensive RAG evaluation