Context Sufficiency judge & scorer
The judges.is_context_sufficient()
predefined judge evaluates whether the context either retrieved by your RAG system or generated by a tool call contains enough information to adequately answer the user's request based on the ground truth label provided as expected_facts
or an expected_response
.
This judge is available through the predefined RetrievalSufficiency
scorer for evaluating RAG systems where you need to ensure that your retrieval process is providing all necessary information.
API Signature
from mlflow.genai.judges import is_context_sufficient
def is_context_sufficient(
*,
request: str, # User's question or query
context: Any, # Context to evaluate for relevance, can be any Python primitive or a JSON-seralizable dict
expected_facts: Optional[list[str]], # List of expected facts (provide either expected_response or expected_facts)
expected_response: Optional[str] = None, # Ground truth response (provide either expected_response or expected_facts)
name: Optional[str] = None # Optional custom name for display in the MLflow UIs
) -> mlflow.entities.Feedback:
"""Returns Feedback with 'yes' or 'no' value and a rationale"""
Prerequisites for running the examples
-
Install MLflow and required packages
Bashpip install --upgrade "mlflow[databricks]>=3.1.0"
-
Create an MLflow experiment by following the setup your environment quickstart.
Direct SDK Usage
from mlflow.genai.judges import is_context_sufficient
# Example 1: Context contains sufficient information
feedback = is_context_sufficient(
request="What is the capital of France?",
context=[
{"content": "Paris is the capital of France."},
{"content": "Paris is known for its Eiffel Tower."}
],
expected_facts=["Paris is the capital of France."]
)
print(feedback.value) # "yes"
print(feedback.rationale) # Explanation of sufficiency
# Example 2: Context lacks necessary information
feedback = is_context_sufficient(
request="What are MLflow's components?",
context=[
{"content": "MLflow is an open-source platform."}
],
expected_facts=[
"MLflow has four main components",
"Components include Tracking",
"Components include Projects"
]
)
print(feedback.value) # "no"
print(feedback.rationale) # Explanation of what's missing
Using the prebuilt scorer
The is_context_sufficient
judge is available through the RetrievalSufficiency
prebuilt scorer.
Requirements:
- Trace requirements:
- The MLflow Trace must contain at least one span with
span_type
set toRETRIEVER
inputs
andoutputs
must be on the Trace's root span
- The MLflow Trace must contain at least one span with
- Ground-truth labels: Required - must provide either
expected_facts
orexpected_response
in theexpectations
dictionary
import os
import mlflow
from openai import OpenAI
from mlflow.genai.scorers import RetrievalSufficiency
from mlflow.entities import Document
from typing import List
mlflow.openai.autolog()
# Connect to a Databricks LLM via OpenAI using the same credentials as MLflow
# Alternatively, you can use your own OpenAI credentials here
mlflow_creds = mlflow.utils.databricks_utils.get_databricks_host_creds()
client = OpenAI(
api_key=cred.token,
base_url=f"{cred.host}/serving-endpoints"
)
# Define a retriever function with proper span type
@mlflow.trace(span_type="RETRIEVER")
def retrieve_docs(query: str) -> List[Document]:
# Simulated retrieval - some queries return insufficient context
if "capital of france" in query.lower():
return [
Document(
id="doc_1",
page_content="Paris is the capital of France.",
metadata={"source": "geography.txt"}
),
Document(
id="doc_2",
page_content="France is a country in Western Europe.",
metadata={"source": "countries.txt"}
)
]
elif "mlflow components" in query.lower():
# Incomplete retrieval - missing some components
return [
Document(
id="doc_3",
page_content="MLflow has multiple components including Tracking and Projects.",
metadata={"source": "mlflow_intro.txt"}
)
]
else:
return [
Document(
id="doc_4",
page_content="General information about data science.",
metadata={"source": "ds_basics.txt"}
)
]
# Define your RAG app
@mlflow.trace
def rag_app(query: str):
# Retrieve documents
docs = retrieve_docs(query)
context = "\n".join([doc.page_content for doc in docs])
# Generate response
messages = [
{"role": "system", "content": f"Answer based on this context: {context}"},
{"role": "user", "content": query}
]
response = client.chat.completions.create(
# This example uses Databricks hosted Claude. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
model="databricks-claude-3-7-sonnet",
messages=messages
)
return {"response": response.choices[0].message.content}
# Create evaluation dataset with ground truth
eval_dataset = [
{
"inputs": {"query": "What is the capital of France?"},
"expectations": {
"expected_facts": ["Paris is the capital of France."]
}
},
{
"inputs": {"query": "What are all the MLflow components?"},
"expectations": {
"expected_facts": [
"MLflow has four main components",
"Components include Tracking",
"Components include Projects",
"Components include Models",
"Components include Registry"
]
}
}
]
# Run evaluation with RetrievalSufficiency scorer
eval_results = mlflow.genai.evaluate(
data=eval_dataset,
predict_fn=rag_app,
scorers=[RetrievalSufficiency()]
)
Understanding the results
The RetrievalSufficiency
scorer evaluates each retriever span separately. It will:
- Return "yes" if the retrieved documents contain all the information needed to generate the expected facts
- Return "no" if the retrieved documents are missing critical information, along with a rationale explaining what's missing
This helps you identify when your retrieval system is failing to fetch all necessary information, which is a common cause of incomplete or incorrect responses in RAG applications.
Using in a custom scorer
When evaluating applications with different data structures than the requirements the predefined scorer, wrap the judge in a custom scorer:
from mlflow.genai.judges import is_context_sufficient
from mlflow.genai.scorers import scorer
from typing import Dict, Any
eval_dataset = [
{
"inputs": {"query": "What are the benefits of MLflow?"},
"outputs": {
"retrieved_context": [
{"content": "MLflow simplifies ML lifecycle management."},
{"content": "MLflow provides experiment tracking and model versioning."},
{"content": "MLflow enables easy model deployment."}
]
},
"expectations": {
"expected_facts": [
"MLflow simplifies ML lifecycle management",
"MLflow provides experiment tracking",
"MLflow enables model deployment"
]
}
},
{
"inputs": {"query": "How does MLflow handle model versioning?"},
"outputs": {
"retrieved_context": [
{"content": "MLflow is an open-source platform."}
]
},
"expectations": {
"expected_facts": [
"MLflow Model Registry handles versioning",
"Models can have multiple versions",
"Versions can be promoted through stages"
]
}
}
]
@scorer
def context_sufficiency_scorer(inputs: Dict[Any, Any], outputs: Dict[Any, Any], expectations: Dict[Any, Any]):
return is_context_sufficient(
request=inputs["query"],
context=outputs["retrieved_context"],
expected_facts=expectations["expected_facts"]
)
# Run evaluation
eval_results = mlflow.genai.evaluate(
data=eval_dataset,
scorers=[context_sufficiency_scorer]
)
Interpreting Results
The judge returns a Feedback
object with:
value
: "yes" if context is sufficient, "no" if insufficientrationale
: Explanation of which expected facts are covered or missing in the context
Next Steps
- Evaluate context relevance - Ensure retrieved documents are relevant before checking sufficiency
- Evaluate groundedness - Verify that responses use only the provided context
- Build evaluation datasets - Create ground truth datasets with expected facts for testing