Configure advanced evaluation for agents

Preview

This feature is in Public Preview.

This article describes and demonstrates how to configure the following advanced evaluation techniques for agentic applications:

  • Configure customer-defined LLM judges.

  • Provide few-shot examples to LLM judges.

  • Evaluate applications using only a subset of LLM judges.

Customer-defined LLM judges

The following are common use cases where customer-defined judges might be useful:

  • Evaluate your application against criteria that are specific to your business use case. For example:

    • Assess if your application produces responses that align with your corporate tone of voice

    • Determine if your applications’ response always follows a specific format.

  • Testing and iterating on guardrails. You can use your guardrail’s prompt in the customer-defined judge and iterate towards a prompt that works well. You would then implement the guardrail and use the LLM judge to evaluate how often the guardrail is or isn’t working.

Databricks refers to these use cases as assessments. There are two types of customer-defined LLM assessments:

Type

What does it assess?

How is the score reported?

Answer assessment

The LLM judge is called for each generated answer. For example, if you had 5 questions with corresponding answers, the judge would be called 5 times (once for each answer).

For each answer, a yes or no is reported based on your criteria. yes outputs are aggregated to a percentage for the entire evaluation set.

Retrieval assessment

Perform assessment for each retrieved chunk (if the applicaiton performs retrieval). For each question, the LLM judge is called for each chunk that was retrieved for that question. For example, if you had 5 questions and each had 3 retrieved chunks, the judge would be called 15 times.

For each chunk, a yes or no is reported based on your criteria. For each question, the percent of yes chunks is reported as a precision. Precision per question is aggregated to an average precision for the entire evaluation set.

You can configure a customer-defined LLM judge using the following parameters:

Option

Description

Requirements

model

The endpoint name for the Foundation Model API endpoint that is to receive requests for this custom judge.

Endpoint must support the /llm/v1/chat signature.

name

The name of the assessment that is also used for the output metrics.

judge_prompt

The prompt that implements the assessment, with variables enclosed in curly braces. For example, “Here is a definition that uses {request} and {response}”.

metric_metadata

A dictionary that provides additional parameters for the judge. Notably, the dictionary must include a "assessment_type" with value either "RETRIEVAL" or "ANSWER" to specify the assessment type.

The prompt contains variables which are substituted by the contents of the evaluation set before it is sent to the specified endpoint_name to retrieve the response. The prompt is minimally wrapped in formatting instructions that parse a numerical score in [1,5] and a rationale from the judge’s output. The parsed score is then transformed into yes if it is higher than 3 and no otherwise (see the sample code below on how to use the metric_metadata in order to change the default threshold of 3). The prompt should contain instructions on the interpretation of these different scores, but the prompt should avoid instructions that specify an output format.

The following variables are supported:

Variable

ANSWER assessment

RETRIEVAL assessment

request

Request column of the evaluation data set

Request column of the evaluation data set

response

Response column of the evaluation data set

Response column of the evaluation data set

expected_response

expected_response column of the evaluation data set

expected_response column of the evaluation data set

retrieved_context

Concatenated contents from retrieved_context column

Individual content in retrieved_context column

The following example uses MLflow’s `make_genai_metric_from_prompt` API to specify the has_pii and professional objects. These are passed into the extra_metrics argument in mlflow.evaluate as a list during evaluation.


from mlflow.metrics.genai import make_genai_metric_from_prompt

# Define a custom assessment to detect PII in the retrieved chunks. The default threshold of 3 will be used to convert the output numerical
# score to "yes" or "no".

has_pii_prompt = "Your task is to determine whether the retrieved content has any PII information. This was the content: '{retrieved_context}'"

has_pii = make_genai_metric_from_prompt(
    name="has_pii",
    judge_prompt=has_pii_prompt,
    model="endpoints:/ep-gpt-4-turbo-2024-04-09",
    metric_metadata={"assessment_type": "RETRIEVAL"},
)

# Define a custom assessment to determine if the tone of the answer is professional. The numerical threshold for conversion to "yes"/"no"
# is set to 2.

professional_prompt = "Your task is to determine if the response has a professional tone. The response is: '{response}'"

professional = make_genai_metric_from_prompt(
    name="professional",
    judge_prompt=professional_prompt,
    model="endpoints:/ep-gpt-4-turbo-2024-04-09",
    metric_metadata={"assessment_type": "ANSWER", "score_threshold": "2"},
)
# Use the custom judges in evaluation
results = mlflow.evaluate(..., model_type="databricks-agent", extra_metrics=[has_pii, professional])

# Process results from the custom judges
per_question_results_df = results.tables['eval_results']

# Show information about responses that are not professional
per_question_results_df[per_question_results_df["response/llm_judged/professional/rating"] == "no"].display()

Provide examples to the built-in LLM judges

You can pass domain-specific examples to the built-in judges by providing a few "yes" or "no" examples for each type of assessment. These examples are referred to as few-shot examples and can help the built-in judges align better with domain-specific rating criteria. See Create few-shot examples.

Databricks recommends providing at least one "yes" and one "no" example. The best examples are the following:

  • Examples that the judges previously got wrong, where you provide a correct response as the example.

  • Challenging examples, such as examples that are nuanced or difficult to determine as true or false.

Databricks also recommends that you provide a rationale for the response. This helps improve the judge’s ability to explain its reasoning.

To pass the few-shot examples, you need to create a dataframe that mirrors the output of mlflow.evaluate() for the corresponding judges. Here is an example for the answer-correctness, groundedness, and chunk-relevance judges:


%pip install databricks-agents pandas
dbutils.library.restartPython()

import mlflow
import pandas as pd

examples =  {
    "request": [
        "What is Spark?",
        "How do I convert a Spark DataFrame to Pandas?",
        "What is Apache Spark?"
    ],
    "response": [
        "Spark is a data analytics framework.",
        "This is not possible as Spark is not a panda.",
        "Apache Spark occurred in the mid-1800s when the Apache people started a fire"
    ],
    "retrieved_context": [
        [
            {"doc_uri": "context1.txt", "content": "In 2013, Spark, a data analytics framework, was open sourced by UC Berkeley's AMPLab."}
        ],
        [
            {"doc_uri": "context2.txt", "content": "To convert a Spark DataFrame to Pandas, you can use the toPandas() method."}
        ],
        [
            {"doc_uri": "context3.txt", "content": "Apache Spark is a unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning, and graph processing."}
        ]
    ],
    "expected_response": [
        "Spark is a data analytics framework.",
        "To convert a Spark DataFrame to Pandas, you can use the toPandas() method.",
        "Apache Spark is a unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning, and graph processing."
    ],
    "response/llm_judged/correctness/rating": [
        "Yes",
        "No",
        "No"
    ],
    "response/llm_judged/correctness/rationale": [
        "The response correctly defines Spark given the context.",
        "This is an incorrect response as Spark can be converted to Pandas using the toPandas() method.",
        "The response is incorrect and irrelevant."
    ],
    "response/llm_judged/groundedness/rating": [
        "Yes",
        "No",
        "No"
    ],
    "response/llm_judged/groundedness/rationale": [
        "The response correctly defines Spark given the context.",
        "The response is not grounded in the given context.",
        "The response is not grounded in the given context."
    ],
    "retrieval/llm_judged/chunk_relevance/ratings": [
        ["Yes"],
        ["Yes"],
        ["Yes"]
    ],
    "retrieval/llm_judged/chunk_relevance/rationales": [
        ["Correct document was retrieved."],
        ["Correct document was retrieved."],
        ["Correct document was retrieved."]
    ]
}

examples_df = pd.DataFrame(examples)

"""

Include the few-shot examples in the evaluator_config parameter of mlflow.evaluate.


evaluation_results = mlflow.evaluate(
...,
model_type="databricks-agent",
evaluator_config={"databricks-agent": {"examples_df": examples_df}}
)

Create few-shot examples

The following steps are guidelines to create a set of effective few-shot examples.

  1. Try to find groups of similar examples that the judge gets wrong.

  2. For each group, pick a single example and adjust the label or justification to reflect the desired behavior. Databricks recommends providing a rationale that explains the rating.

  3. Re-run the evaluation with the new example.

  4. Repeat as needed to target different categories of errors.

Note

Mulitple few-shot examples can negatively impact judge performance. During evaluation Databricks limits the number of few-shot examples to five, but recommends using fewer, targeted examples for best performance.

Evaluate agents using a subset of LLM judges

By default, evaluation runs all available LLM judges. To run only a subset of the LLM judges, create a custom configuration.

Note

You cannot disable the non-LLM judge metrics for chunk retrieval, chain token counts, or latency.

The following shows configuration options for evaluation using either LLM judges that don’t require ground-truth or no LLM judges at all.

%pip install databricks-agents pandas
dbutils.library.restartPython()

import mlflow
import pandas as pd

# Run only LLM judges that don't require ground-truth
config = {
   "metrics": ["groundedness", "relevance_to_query", "chunk_relevance"]
}
# Run no LLM judges
config = {
   "metrics": []
}

After you define your configuration, you can specify it in the evaluator_config parameter of mlflow.evaluate.


evaluation_results = mlflow.evaluate(
  ...,
  model_type="databricks-genai",
  evaluator_config={"databricks-genai": config})