Built-in AI judges

Preview

This feature is in Public Preview.

This article covers the details of each of the AI judges that are built into Mosaic AI Agent Evaluation, including required inputs and output metrics.

See also:

AI judges overview

Note

Not all judges require ground-truth labels. Judges that do not require labels are useful when you have only a set of requests to evaluate your agent.

Name of the judge

Quality aspect that the judge assesses

Required inputs

Requires ground truth

relevance_to_query

Does the response address (is it relevant to) the user’s request?

response, request

No

groundedness

Is the generated response grounded in the retrieved context (not hallucinating)?

response, trace[retrieved_context]

No

safety

Is there harmful or toxic content in the response?

response

No

correctness

Is the generated response accurate (as compared to the ground truth)?

response, expected_response

Yes

guideline_adherence

Does the generated response adhere to the provided per-question guidelines?

request, response, guidelines

Yes

global_guideline_adherence

Does the generated response adhere to the global guidelines?

request, response, global_guidelines (from the evaluator_config)

No (but requires global_guidelines)

chunk_relevance

Did the retriever find chunks that are useful (relevant) in answering the user’s request?

Note: This judge is applied separately to each retrieved chunk, producing a score & rationale for each chunk. These scores are aggregated into a chunk_relevance/precision score for each row that represents the % of chunks that are relevant.

retrieved_context, request

No

document_recall

How many of the known relevant documents did the retriever find?

retrieved_context, expected_retrieved_context[].doc_uri

Yes

context_sufficiency

Did the retriever find documents with sufficient information to produce the expected response?

retrieved_context, expected_response

Yes

Note

For multi-turn conversations, AI judges evaluate only the last entry in the conversation.

AI judge outputs

Each judge used in evaluation output the following columns:

Data field

Type

Description

response/llm_judged/{judge_name}/rating

string

yes if the judge passes, no if the judge fails.

response/llm_judged/{judge_name}/rationale

string

LLM’s written reasoning for yes or no.

response/llm_judged/{judge_name}/error_message

string

If there was an error computing this assessment, details of the error are here. If no error, this is NULL.

Each judge will also produce an aggregate metric for the entire run:

Metric name

Type

Description

response/llm_judged/safety/rating/average

float, [0, 1]

Percentage of all evaluations that were judged to be yes.

Correctness

Definition: Did the agent respond with a factually accurate answer?

Requires ground-truth: Yes, expected_facts or expected_response.

Correctness compares the agent’s actual response to a ground-truth label and is a good way to detect factual errors.

Required inputs

The input evaluation set must have the following columns:

Important

Databricks recommends using expected_facts instead of expected_response. expected_facts represent the minimal set of facts required in a correct response and are easier for subject matter experts to curate.

If you must use expected_response, it should include only the minimal set of facts that is required for a correct response. If you copy a response from another source, edit the response to remove any text that is not required for an answer to be considered correct.

Including only the required information, and leaving out information that is not strictly required in the answer, enables Agent Evaluation to provide a more robust signal on output quality.

Examples

Use correctness from an evaluation set:

import mlflow

eval_set = [{
  "request": "What is the difference between reduceByKey and groupByKey in Spark?",
  "response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
  "expected_facts": [
    "reduceByKey aggregates data before shuffling",
    "groupByKey shuffles all data",
  ]
}]

mlflow.evaluate(
  data=eval_set,
  model_type="databricks-agent",
  evaluator_config={
      "databricks-agent": {
          "metrics": ["correctness"]
      }
  }
)

Use correctness with the callable judge SDK:

from databricks.agents.evals import judges

assessment = judges.correctness(
  request="What is the difference between reduceByKey and groupByKey in Spark?",
  response="reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
  expected_facts=[
    "reduceByKey aggregates data before shuffling",
    "groupByKey shuffles all data",
  ]
)
print(assessment)

What to do when a response is incorrect?

When an agent responds with a factually inaccurate answer, you should:

  • Understand if any context retrieved by the agent is irrelevant or innacurate. For RAG applications, you can use the Context sufficiency judge to determine if the context is sufficient to generate the expected_facts or expected_response.

  • If there is sufficient context, adjust the prompt to include relevant information.

Relevance to query

Definition: Is the response relevant to the input request?

Requires ground-truth: No.

Relevance ensures that the agent’s response directly addresses the user’s input without deviating into unrelated topics.

Required inputs

The input evaluation set must have the following columns:

  • request

  • response if you have not specified the model parameter to mlflow.evaluate().

Examples

Use relevance from an evaluation set:

import mlflow

eval_set = [{
  "request": "What is the capital of France?",
  "response": "The capital of France is Paris."
}]

mlflow.evaluate(
  data=eval_set,
  model_type="databricks-agent",
  evaluator_config={
      "databricks-agent": {
          "metrics": ["relevance_to_query"]
      }
  }
)

Use relevance with the callable judge SDK:

from databricks.agents.evals import judges

assessment = judges.relevance_to_query(
  request="What is the capital of France?",
  response="The capital of France is Paris."
)
print(assessment)

What to do when a response is not relevant?

When the agent provides an irrelevant response, consider the following steps:

  • Evaluate the model’s understanding of the request and adjust its retriever, training data, or prompt instructions accordingly.

Groundedness

Definition: Is the response factually consistent with the retrieved context?

Requires ground-truth: No.

Groundedness assesses whether the agent’s response is aligned with the information provided in the retrieved context.

Required inputs

The input evaluation set must have the following columns:

  • request

  • response if you have not specified the model parameter to mlflow.evaluate().

  • retrieved_context[].content if you do not use the model argument in the call to mlflow.evaluate().

Examples

Use groundedness from an evaluation set:

import mlflow

eval_set = [{
  "request": "What is the capital of France?",
  "response": "The capital of France is Paris.",
  "retrieved_context": [
    {"content": "Paris is the capital city of France."}
  ]
}]

mlflow.evaluate(
  data=eval_set,
  model_type="databricks-agent",
  evaluator_config={
      "databricks-agent": {
          "metrics": ["groundedness"]
      }
  }
)

Use groundedness with the callable judge SDK:

from databricks.agents.evals import judges

assessment = judges.groundedness(
  request="What is the capital of France?",
  response="The capital of France is Paris.",
  retrieved_context=[
    {"content": "Paris is the capital city of France."}
  ]
)
print(assessment)

What to do when the response lacks groundedness?

When the response is not grounded:

  • Review the retrieved context to ensure it includes the necessary information to generate the expected response.

  • If the context is insufficient, improve the retrieval mechanism or dataset to include relevant documents.

  • Modify the prompt to instruct the model to prioritize using the retrieved context when generating responses.

Guideline adherence

Definition: Does the response adhere to the provided guidelines?

Requires ground-truth: No when using global_guidelines. Yes when using per-row guidelines.

Guideline adherence evaluates whether the agent’s response follows specific constraints or instructions provided in the guidelines.

Guidelines can be defined:

  • per-row: The response of a specific request must adhere to guidelines defined on that evaluation row.

  • globally: All responses for any request must adhere to global guidelines.

Required inputs

The input evaluation set must have the following columns:

  • request

  • response if you have not specified the model parameter to mlflow.evaluate().

  • per-row guidelines or global_guidelines defined in the config.

Examples

Use per-row guideline adherence from an evaluation set:

import mlflow

eval_set = [{
  "request": "What is the capital of France?",
  "response": "The capital of France is Paris.",
  "guidelines": ["The response must be in English", "The response must be concise"]
}]

mlflow.evaluate(
  data=eval_set,
  model_type="databricks-agent",
  evaluator_config={
      "databricks-agent": {
          "metrics": ["guideline_adherence"]
      }
  }
)

Use global guideline adherence from an evaluation set:

import mlflow

eval_set = [{
  "request": "What is the capital of France?",
  "response": "The capital of France is Paris.",
}]

mlflow.evaluate(
  data=eval_set,
  model_type="databricks-agent",
  evaluator_config={
      "databricks-agent": {
          "metrics": ["guideline_adherence"],
          "global_guidelines": ["The response must be in English", "The response must be concise"]
      }
  }
)

Use guideline adherence with the callable judge SDK:

from databricks.agents.evals import judges

assessment = judges.guideline_adherence(
  request="What is the capital of France?",
  response="The capital of France is Paris.",
  guidelines=["The response must be in English", "The response must be concise"]
)
print(assessment)

What to do when the response does not adhere to guidelines?

When the response violates the guidelines:

  • Identify which guideline was violated and analyze why the agent failed to adhere to it.

  • Adjust the prompt to emphasize adherence to specific guidelines or retrain the model with additional examples that align with the desired behavior.

  • For global guidelines, ensure they are specified correctly in the evaluator configuration.

Safety

Definition: Does the response avoid harmful or toxic content?

Requires ground-truth: No.

Safety ensures that the agent’s responses do not contain harmful, offensive, or toxic content.

Required inputs

The input evaluation set must have the following columns:

  • request

  • response if you have not specified the model parameter to mlflow.evaluate().

Examples

Use safety from an evaluation set:

import mlflow

eval_set = [{
  "request": "What is the capital of France?",
  "response": "The capital of France is Paris."
}]

mlflow.evaluate(
  data=eval_set,
  model_type="databricks-agent",
  evaluator_config={
      "databricks-agent": {
          "metrics": ["safety"]
      }
  }
)

Use safety with the callable judge SDK:

from databricks.agents.evals import judges

assessment = judges.safety(
  request="What is the capital of France?",
  response="The capital of France is Paris."
)
print(assessment)

What to do when the response is unsafe?

When the response includes harmful content:

  • Analyze the request to identify if it might inadvertently lead to unsafe responses. Modify the input if necessary.

  • Refine the model or prompt to explicitly avoid generating harmful or toxic content.

  • Employ additional safety mechanisms, such as content filters, to intercept unsafe responses before they reach the user.

Context sufficiency

Definition: Are the retrieved documents sufficient to produce the expected response?

Requires ground-truth: Yes, expected_facts or expected_response.

Context sufficiency evaluates whether the retrieved documents provide all necessary information to generate the expected response.

Required inputs

The input evaluation set must have the following columns:

  • request

  • response if you have not specified the model parameter to mlflow.evaluate().

  • retrieved_context[].content if you have not specified the model parameter to mlflow.evaluate().

Examples

Use context sufficiency from an evaluation set:

import mlflow

eval_set = [{
  "request": "What is the capital of France?",
  "response": "The capital of France is Paris.",
  "retrieved_context": [
    {"content": "Paris is the capital city of France."}
  ],
  "expected_facts": [
    "Paris"
  ]
}]

mlflow.evaluate(
  data=eval_set,
  model_type="databricks-agent",
  evaluator_config={
      "databricks-agent": {
          "metrics": ["context_sufficiency"]
      }
  }
)

Use context sufficiency with the callable judge SDK:

from databricks.agents.evals import judges

assessment = judges.context_sufficiency(
  request="What is the capital of France?",
  retrieved_context=[
    {"content": "Paris is the capital city of France."}
  ]
)
print(assessment)

What to do when the context is insufficient?

When the context is insufficient:

  • Enhance the retrieval mechanism to ensure that all necessary documents are included.

  • Modify the model prompt to explicitly reference missing information or prioritize relevant context.

Chunk relevance

Definition: Are the retrieved chunks relevant to the input request?

Requires ground-truth: No.

Chunk relevance measures whether each chunk is relevant to the input request.

Required inputs

The input evaluation set must have the following columns:

  • request

  • retrieved_context[].content if you have not specified the model parameter to mlflow.evaluate().

If you do not use the model argument in the call to mlflow.evaluate(), you must also provide either retrieved_context[].content or trace.

Examples

Use chunk relevance precision from an evaluation set:

import mlflow

eval_set = [{
  "request": "What is the capital of France?",
  "retrieved_context": [
    {"content": "Paris is the capital of France."},
    {"content": "France is a country in Europe."}
  ]
}]

mlflow.evaluate(
  data=eval_set,
  model_type="databricks-agent",
  evaluator_config={
      "databricks-agent": {
          "metrics": ["chunk_relevance_precision"]
      }
  }
)

What to do when retrieved chunks are irrelevant?

When irrelevant chunks are retrieved:

  • Assess the retriever’s configuration and adjust parameters to prioritize relevance.

  • Refine the retriever’s training data to include more diverse or accurate examples.

Document recall

Definition: How many of the known relevant documents did the retriever find?

Requires ground-truth: Yes, expected_retrieved_context[].doc_uri.

Document recall measures the proportion of ground truth relevant documents that were retrieved compared to the total number of relevant documents in ground truth.

Required inputs

The input evaluation set must have the following column:

  • expected_retrieved_context[].doc_uri

In addition, if you do not use the model argument in the call to mlflow.evaluate(), you must also provide either retrieved_context[].doc_uri or trace.

Examples

Use document recall from an evaluation set:

import mlflow

eval_set = [{
  "request": "What is the capital of France?",
  "expected_retrieved_context": [
    {"doc_uri": "doc_123"},
    {"doc_uri": "doc_456"}
  ],
  "retrieved_context": [
    {"doc_uri": "doc_123"}
  ]
}]

mlflow.evaluate(
  data=eval_set,
  model_type="databricks-agent",
  evaluator_config={
      "databricks-agent": {
          "metrics": ["document_recall"]
      }
  }
)

There is no callable judge SDK for this metric as it does not use an AI judge.

What to do when document recall is low?

When recall is low:

  • Verify that the ground truth data accurately reflects relevant documents.

  • Improve the retriever or adjust search parameters to increase recall.

Custom judges

You can create a custom judge to perform assessments specific to your use case. For details, see Create custom LLM judges.

The output produced by a custom judge depends on its assessment_type, ANSWER or RETRIEVAL.

Custom LLM judge for ANSWER assessment

A custom LLM judge for ANSWER assessment evaluates the response for each question.

Outputs provided for each assessment:

Data field

Type

Description

response/llm_judged/{assessment_name}/rating

string

yes or no.

response/llm_judged/{assessment_name}/rationale

string

LLM’s written reasoning for yes or no.

response/llm_judged/{assessment_name}/error_message

string

If there was an error computing this metric, details of the error are here. If no error, this is NULL.

The following metric is calculated for the entire evaluation set:

Metric name

Type

Description

response/llm_judged/{assessment_name}/rating/percentage

float, [0, 1]

Across all questions, percentage where {assessment_name} is judged as yes.

Custom LLM judge for RETRIEVAL assessment

A custom LLM judge for RETRIEVAL assessment evaluates each retrieved chunk across all questions.

Outputs provided for each assessment:

Data field

Type

Description

retrieval/llm_judged/{assessment_name}/ratings

array[string]

Evaluation of the custom judge for each chunk,yes or no.

retrieval/llm_judged/{assessment_name}/rationales

array[string]

For each chunk, LLM’s written reasoning for yes or no.

retrieval/llm_judged/{assessment_name}/error_messages

array[string]

For each chunk, if there was an error computing this metric, details of the error are here, and other values are NULL. If no error, this is NULL.

retrieval/llm_judged/{assessment_name}/precision

float, [0, 1]

Percentage of all retrieved chunks that the custom judge evaluated as yes.

Metrics reported for the entire evaluation set:

Metric name

Type

Description

retrieval/llm_judged/{assessment_name}/precision/average

float, [0, 1]

Average value of {assessment_name}_precision across all questions.