Built-in AI judges
Preview
This feature is in Public Preview.
This article covers the details of each of the AI judges that are built into Mosaic AI Agent Evaluation, including required inputs and output metrics.
See also:
AI judges overview
Note
Not all judges require ground-truth labels. Judges that do not require labels are useful when you have only a set of requests to evaluate your agent.
Name of the judge |
Quality aspect that the judge assesses |
Required inputs |
Requires ground truth |
---|---|---|---|
|
Does the response address (is it relevant to) the user’s request? |
|
No |
|
Is the generated response grounded in the retrieved context (not hallucinating)? |
|
No |
|
|
No |
|
|
Is the generated response accurate (as compared to the ground truth)? |
|
Yes |
|
Does the generated response adhere to the provided per-question guidelines? |
|
Yes |
|
Does the generated response adhere to the global guidelines? |
|
No (but requires |
|
Did the retriever find chunks that are useful (relevant) in answering the user’s request? Note: This judge is applied separately to each retrieved chunk, producing a score & rationale for each chunk. These scores are aggregated into a |
|
No |
|
How many of the known relevant documents did the retriever find? |
|
Yes |
|
Did the retriever find documents with sufficient information to produce the expected response? |
|
Yes |
Note
For multi-turn conversations, AI judges evaluate only the last entry in the conversation.
AI judge outputs
Each judge used in evaluation output the following columns:
Data field |
Type |
Description |
---|---|---|
|
|
|
|
|
LLM’s written reasoning for |
|
|
If there was an error computing this assessment, details of the error are here. If no error, this is NULL. |
Each judge will also produce an aggregate metric for the entire run:
Metric name |
Type |
Description |
---|---|---|
|
|
Percentage of all evaluations that were judged to be |
Correctness
Definition: Did the agent respond with a factually accurate answer?
Requires ground-truth: Yes, expected_facts
or expected_response
.
Correctness compares the agent’s actual response to a ground-truth label and is a good way to detect factual errors.
Required inputs
The input evaluation set must have the following columns:
request
response
if you have not specified themodel
parameter tomlflow.evaluate()
.
Important
Databricks recommends using expected_facts
instead of expected_response
. expected_facts
represent the minimal set of facts required in a correct response and are easier for subject matter experts to curate.
If you must use expected_response
, it should include only the minimal set of facts that is required for a correct response. If you copy a response from another source, edit the response to remove any text that is not required for an answer to be considered correct.
Including only the required information, and leaving out information that is not strictly required in the answer, enables Agent Evaluation to provide a more robust signal on output quality.
Examples
Use correctness from an evaluation set:
import mlflow
eval_set = [{
"request": "What is the difference between reduceByKey and groupByKey in Spark?",
"response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
"expected_facts": [
"reduceByKey aggregates data before shuffling",
"groupByKey shuffles all data",
]
}]
mlflow.evaluate(
data=eval_set,
model_type="databricks-agent",
evaluator_config={
"databricks-agent": {
"metrics": ["correctness"]
}
}
)
Use correctness with the callable judge SDK:
from databricks.agents.evals import judges
assessment = judges.correctness(
request="What is the difference between reduceByKey and groupByKey in Spark?",
response="reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
expected_facts=[
"reduceByKey aggregates data before shuffling",
"groupByKey shuffles all data",
]
)
print(assessment)
What to do when a response is incorrect?
When an agent responds with a factually inaccurate answer, you should:
Understand if any context retrieved by the agent is irrelevant or innacurate. For RAG applications, you can use the Context sufficiency judge to determine if the context is sufficient to generate the
expected_facts
orexpected_response
.If there is sufficient context, adjust the prompt to include relevant information.
Relevance to query
Definition: Is the response relevant to the input request?
Requires ground-truth: No.
Relevance ensures that the agent’s response directly addresses the user’s input without deviating into unrelated topics.
Required inputs
The input evaluation set must have the following columns:
request
response
if you have not specified themodel
parameter tomlflow.evaluate()
.
Examples
Use relevance from an evaluation set:
import mlflow
eval_set = [{
"request": "What is the capital of France?",
"response": "The capital of France is Paris."
}]
mlflow.evaluate(
data=eval_set,
model_type="databricks-agent",
evaluator_config={
"databricks-agent": {
"metrics": ["relevance_to_query"]
}
}
)
Use relevance with the callable judge SDK:
from databricks.agents.evals import judges
assessment = judges.relevance_to_query(
request="What is the capital of France?",
response="The capital of France is Paris."
)
print(assessment)
Groundedness
Definition: Is the response factually consistent with the retrieved context?
Requires ground-truth: No.
Groundedness assesses whether the agent’s response is aligned with the information provided in the retrieved context.
Required inputs
The input evaluation set must have the following columns:
request
response
if you have not specified themodel
parameter tomlflow.evaluate()
.retrieved_context[].content
if you do not use themodel
argument in the call tomlflow.evaluate()
.
Examples
Use groundedness from an evaluation set:
import mlflow
eval_set = [{
"request": "What is the capital of France?",
"response": "The capital of France is Paris.",
"retrieved_context": [
{"content": "Paris is the capital city of France."}
]
}]
mlflow.evaluate(
data=eval_set,
model_type="databricks-agent",
evaluator_config={
"databricks-agent": {
"metrics": ["groundedness"]
}
}
)
Use groundedness with the callable judge SDK:
from databricks.agents.evals import judges
assessment = judges.groundedness(
request="What is the capital of France?",
response="The capital of France is Paris.",
retrieved_context=[
{"content": "Paris is the capital city of France."}
]
)
print(assessment)
What to do when the response lacks groundedness?
When the response is not grounded:
Review the retrieved context to ensure it includes the necessary information to generate the expected response.
If the context is insufficient, improve the retrieval mechanism or dataset to include relevant documents.
Modify the prompt to instruct the model to prioritize using the retrieved context when generating responses.
Guideline adherence
Definition: Does the response adhere to the provided guidelines?
Requires ground-truth: No when using global_guidelines
. Yes when using per-row guidelines
.
Guideline adherence evaluates whether the agent’s response follows specific constraints or instructions provided in the guidelines.
Guidelines can be defined:
per-row: The response of a specific request must adhere to guidelines defined on that evaluation row.
globally: All responses for any request must adhere to global guidelines.
Required inputs
The input evaluation set must have the following columns:
request
response
if you have not specified themodel
parameter tomlflow.evaluate()
.per-row
guidelines
orglobal_guidelines
defined in the config.
Examples
Use per-row guideline adherence from an evaluation set:
import mlflow
eval_set = [{
"request": "What is the capital of France?",
"response": "The capital of France is Paris.",
"guidelines": ["The response must be in English", "The response must be concise"]
}]
mlflow.evaluate(
data=eval_set,
model_type="databricks-agent",
evaluator_config={
"databricks-agent": {
"metrics": ["guideline_adherence"]
}
}
)
Use global guideline adherence from an evaluation set:
import mlflow
eval_set = [{
"request": "What is the capital of France?",
"response": "The capital of France is Paris.",
}]
mlflow.evaluate(
data=eval_set,
model_type="databricks-agent",
evaluator_config={
"databricks-agent": {
"metrics": ["guideline_adherence"],
"global_guidelines": ["The response must be in English", "The response must be concise"]
}
}
)
Use guideline adherence with the callable judge SDK:
from databricks.agents.evals import judges
assessment = judges.guideline_adherence(
request="What is the capital of France?",
response="The capital of France is Paris.",
guidelines=["The response must be in English", "The response must be concise"]
)
print(assessment)
What to do when the response does not adhere to guidelines?
When the response violates the guidelines:
Identify which guideline was violated and analyze why the agent failed to adhere to it.
Adjust the prompt to emphasize adherence to specific guidelines or retrain the model with additional examples that align with the desired behavior.
For global guidelines, ensure they are specified correctly in the evaluator configuration.
Safety
Definition: Does the response avoid harmful or toxic content?
Requires ground-truth: No.
Safety ensures that the agent’s responses do not contain harmful, offensive, or toxic content.
Required inputs
The input evaluation set must have the following columns:
request
response
if you have not specified themodel
parameter tomlflow.evaluate()
.
Examples
Use safety from an evaluation set:
import mlflow
eval_set = [{
"request": "What is the capital of France?",
"response": "The capital of France is Paris."
}]
mlflow.evaluate(
data=eval_set,
model_type="databricks-agent",
evaluator_config={
"databricks-agent": {
"metrics": ["safety"]
}
}
)
Use safety with the callable judge SDK:
from databricks.agents.evals import judges
assessment = judges.safety(
request="What is the capital of France?",
response="The capital of France is Paris."
)
print(assessment)
What to do when the response is unsafe?
When the response includes harmful content:
Analyze the request to identify if it might inadvertently lead to unsafe responses. Modify the input if necessary.
Refine the model or prompt to explicitly avoid generating harmful or toxic content.
Employ additional safety mechanisms, such as content filters, to intercept unsafe responses before they reach the user.
Context sufficiency
Definition: Are the retrieved documents sufficient to produce the expected response?
Requires ground-truth: Yes, expected_facts
or expected_response
.
Context sufficiency evaluates whether the retrieved documents provide all necessary information to generate the expected response.
Required inputs
The input evaluation set must have the following columns:
request
response
if you have not specified themodel
parameter tomlflow.evaluate()
.retrieved_context[].content
if you have not specified themodel
parameter tomlflow.evaluate()
.
Examples
Use context sufficiency from an evaluation set:
import mlflow
eval_set = [{
"request": "What is the capital of France?",
"response": "The capital of France is Paris.",
"retrieved_context": [
{"content": "Paris is the capital city of France."}
],
"expected_facts": [
"Paris"
]
}]
mlflow.evaluate(
data=eval_set,
model_type="databricks-agent",
evaluator_config={
"databricks-agent": {
"metrics": ["context_sufficiency"]
}
}
)
Use context sufficiency with the callable judge SDK:
from databricks.agents.evals import judges
assessment = judges.context_sufficiency(
request="What is the capital of France?",
retrieved_context=[
{"content": "Paris is the capital city of France."}
]
)
print(assessment)
Chunk relevance
Definition: Are the retrieved chunks relevant to the input request?
Requires ground-truth: No.
Chunk relevance measures whether each chunk is relevant to the input request.
Required inputs
The input evaluation set must have the following columns:
request
retrieved_context[].content
if you have not specified themodel
parameter tomlflow.evaluate()
.
If you do not use the model
argument in the call to mlflow.evaluate()
, you must also provide either retrieved_context[].content
or trace
.
Examples
Use chunk relevance precision from an evaluation set:
import mlflow
eval_set = [{
"request": "What is the capital of France?",
"retrieved_context": [
{"content": "Paris is the capital of France."},
{"content": "France is a country in Europe."}
]
}]
mlflow.evaluate(
data=eval_set,
model_type="databricks-agent",
evaluator_config={
"databricks-agent": {
"metrics": ["chunk_relevance_precision"]
}
}
)
Document recall
Definition: How many of the known relevant documents did the retriever find?
Requires ground-truth: Yes, expected_retrieved_context[].doc_uri
.
Document recall measures the proportion of ground truth relevant documents that were retrieved compared to the total number of relevant documents in ground truth.
Required inputs
The input evaluation set must have the following column:
expected_retrieved_context[].doc_uri
In addition, if you do not use the model
argument in the call to mlflow.evaluate()
, you must also provide either retrieved_context[].doc_uri
or trace
.
Examples
Use document recall from an evaluation set:
import mlflow
eval_set = [{
"request": "What is the capital of France?",
"expected_retrieved_context": [
{"doc_uri": "doc_123"},
{"doc_uri": "doc_456"}
],
"retrieved_context": [
{"doc_uri": "doc_123"}
]
}]
mlflow.evaluate(
data=eval_set,
model_type="databricks-agent",
evaluator_config={
"databricks-agent": {
"metrics": ["document_recall"]
}
}
)
There is no callable judge SDK for this metric as it does not use an AI judge.
Custom judges
You can create a custom judge to perform assessments specific to your use case. For details, see Create custom LLM judges.
The output produced by a custom judge depends on its assessment_type
, ANSWER
or RETRIEVAL
.
Custom LLM judge for ANSWER assessment
A custom LLM judge for ANSWER assessment evaluates the response for each question.
Outputs provided for each assessment:
Data field |
Type |
Description |
---|---|---|
|
|
|
|
|
LLM’s written reasoning for |
|
|
If there was an error computing this metric, details of the error are here. If no error, this is NULL. |
The following metric is calculated for the entire evaluation set:
Metric name |
Type |
Description |
---|---|---|
|
|
Across all questions, percentage where {assessment_name} is judged as |
Custom LLM judge for RETRIEVAL assessment
A custom LLM judge for RETRIEVAL assessment evaluates each retrieved chunk across all questions.
Outputs provided for each assessment:
Data field |
Type |
Description |
---|---|---|
|
|
Evaluation of the custom judge for each chunk, |
|
|
For each chunk, LLM’s written reasoning for |
|
|
For each chunk, if there was an error computing this metric, details of the error are here, and other values are NULL. If no error, this is NULL. |
|
|
Percentage of all retrieved chunks that the custom judge evaluated as |
Metrics reported for the entire evaluation set:
Metric name |
Type |
Description |
---|---|---|
|
|
Average value of |