Skip to main content

RetrievalGroundedness judge

The RetrievalGroundedness judge assesses whether your application's response is factually supported by the provided context (either from a RAG system or generated by a tool call), helping detect hallucinations or statements not backed by that context.

This built-in LLM judge is designed for evaluating RAG applications that need to ensure responses are grounded in retrieved information.

Prerequisites for running the examples

  1. Install MLflow and required packages

    Python
    %pip install --upgrade "mlflow[databricks]>=3.4.0"
    dbutils.library.restartPython()
  2. Create an MLflow experiment by following the setup your environment quickstart.

Usage examples

The RetrievalGroundedness judge can be invoked directly for single trace assessment or used with MLflow's evaluation framework for batch evaluation.

Requirements:

  • Trace requirements:
    • The MLflow Trace must contain at least one span with span_type set to RETRIEVER
    • inputs and outputs must be on the Trace's root span
Python
from mlflow.genai.scorers import retrieval_groundedness
import mlflow

# Get a trace from a previous run
trace = mlflow.get_trace("<your-trace-id>")

# Assess if the response is grounded in the retrieved context
feedback = retrieval_groundedness(trace=trace)
print(feedback)

RAG example

Here's a complete example showing how to create a RAG application and evaluate if responses are grounded in retrieved context:

  1. Initialize an OpenAI client to connect to either Databricks-hosted LLMs or LLMs hosted by OpenAI.

    Use MLflow to get an OpenAI client that connects to Databricks-hosted LLMs. Select a model from the available foundation models.

    Python
    import mlflow
    from databricks.sdk import WorkspaceClient

    # Enable MLflow's autologging to instrument your application with Tracing
    mlflow.openai.autolog()

    # Set up MLflow tracking to Databricks
    mlflow.set_tracking_uri("databricks")
    mlflow.set_experiment("/Shared/docs-demo")

    # Create an OpenAI client that is connected to Databricks-hosted LLMs
    w = WorkspaceClient()
    client = w.serving_endpoints.get_open_ai_client()

    # Select an LLM
    model_name = "databricks-claude-sonnet-4"
  2. Define and evaluate your RAG application:

    Python
    from mlflow.genai.scorers import RetrievalGroundedness
    from mlflow.entities import Document
    from typing import List


    # Define a retriever function with proper span type
    @mlflow.trace(span_type="RETRIEVER")
    def retrieve_docs(query: str) -> List[Document]:
    # Simulated retrieval based on query
    if "mlflow" in query.lower():
    return [
    Document(
    id="doc_1",
    page_content="MLflow is an open-source platform for managing the ML lifecycle.",
    metadata={"source": "mlflow_docs.txt"}
    ),
    Document(
    id="doc_2",
    page_content="MLflow provides tools for experiment tracking, model packaging, and deployment.",
    metadata={"source": "mlflow_features.txt"}
    )
    ]
    else:
    return [
    Document(
    id="doc_3",
    page_content="Machine learning involves training models on data.",
    metadata={"source": "ml_basics.txt"}
    )
    ]

    # Define your RAG app
    @mlflow.trace
    def rag_app(query: str):
    # Retrieve relevant documents
    docs = retrieve_docs(query)
    context = "\n".join([doc.page_content for doc in docs])

    # Generate response using LLM
    messages = [
    {"role": "system", "content": f"Answer based on this context: {context}"},
    {"role": "user", "content": query}
    ]

    response = client.chat.completions.create(
    # This example uses Databricks hosted Claude. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
    model=model_name,
    messages=messages
    )

    return {"response": response.choices[0].message.content}

    # Create evaluation dataset
    eval_dataset = [
    {
    "inputs": {"query": "What is MLflow used for?"}
    },
    {
    "inputs": {"query": "What are the main features of MLflow?"}
    }
    ]

    # Run evaluation with RetrievalGroundedness scorer
    eval_results = mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=rag_app,
    scorers=[
    RetrievalGroundedness(
    model="databricks:/databricks-gpt-oss-120b", # Optional. Defaults to custom Databricks model.
    )
    ]
    )

Select the LLM that powers the judge

By default, this judge uses a Databricks-hosted LLM designed to perform GenAI quality assessments. You can change the judge model by using the model argument in the judge definition. The model must be specified in the format <provider>:/<model-name>, where <provider> is a LiteLLM-compatible model provider. If you use databricks as the model provider, the model name is the same as the serving endpoint name.

You can customize the judge by providing a different judge model:

Python
from mlflow.genai.scorers import RetrievalGroundedness

# Use a different judge model
groundedness_judge = RetrievalGroundedness(
model="databricks:/databricks-gpt-5-mini" # Or any LiteLLM-compatible model
)

# Use in evaluation
eval_results = mlflow.genai.evaluate(
data=eval_dataset,
predict_fn=rag_app,
scorers=[groundedness_judge]
)

For a list of supported models, see the MLflow documentation.

Interpret results

The judge returns a Feedback object with:

  • value: "yes" if response is grounded, "no" if it contains hallucinations
  • rationale: Detailed explanation identifying:
    • Which statements are supported by context
    • Which statements lack support (hallucinations)
    • Specific quotes from context that support or contradict claims

Next steps