Get feedback about the quality of an agentic application

Preview

This feature is in Public Preview.

This article shows you how to use the Databricks review app to gather feedback about the quality of your agentic application from human reviewers.

Mosaic AI Agent Evaluation enables developers to quickly and reliably evaluate the quality, cost, and latency of their generative AI application. Agent Evaluation capabilities are unified between the development, staging, and production phases of the LLMops life cycle.

Agent Evaluation is part of our Mosaic AI Agent Framework offering that is designed to help developers deploy high-quality generative AI applications. High-quality apps are those in which the output is evaluated to be accurate, safe, and governed.

What happens in a human evaluation?

The review app enables you to collect feedback from your expert stakeholders on your application. This helps ensure the quality and safety of the answers it provides.

There are three ways to collect feedback using the review app. Expert stakeholders:

  • Chat with the application bot and provide feedback on those conversations.

  • Provide feedback on historical logs from other users.

  • Provide feedback on any curated traces and agent outputs.

In the Databricks review app, the LLM is staged in an environment where expert stakeholders can interact with it - in other words, have a conversation, ask questions, and so on.

Requirements

To use the review app for human evaluation of an agentic application, you need to have the following set up:

  • Inference tables must be enabled on the endpoint that is serving the agent. This allows the review app to collect and record data about the agentic application.

  • Access to the review app workspace for each human reviewer. See the next section, Set up review app permissions.

Set up permissions to the review app workspace

If your reviewers already have access to the workspace containing the review app, you don’t need to do anything.

If reviewers don’t have access already, account admins can use account-level SCIM provisioning to sync users and groups automatically from your identity provider to your Databricks account. You can also manually register these users and groups as you set up identities in Databricks. This allows them to be included as eligible reviewers. See Sync users and groups from your identity provider.


  from databricks.agents import set_permissions
  from databricks.agents.entities import PermissionLevel

  set_permissions(model_fqn, ["user.name@databricks.com"], PermissionLevel.CAN_QUERY)

For new Public Preview customers who have trouble giving reviewers access to the review app, reach out to your DB account team to enable this feature.

Provide instructions to reviewers

Write custom text for the instructions displayed for reviewers, and submit it as shown in the following code example:

  from databricks.agents import set_review_instructions, get_review_instructions

  set_review_instructions(uc_model_name, "Thank you for testing the bot. Use your domain expertise to evaluate and give feedback on the bot's responses, ensuring it aligns with the needs and expectations of users like yourself.")
  get_review_instructions(uc_model_name)
A screenshot of the review app instructions specified the Python example.

Overview of the review app UI

The basic workflow for an expert evaluation in the review app:

  1. Open the review app URL provided.

  2. Review prepopulated chats.

    Number and status of prepopulated chats in review app.
  3. Chat with the bot and submit evaluations of its answers.

    Chat with the bot and submit evaluations of its answers.

Options for running an evaluation with stakeholders

Experts chat with the review app

To use this option, call `deploy_model(…)` and set the correct permissions. The following diagram shows how this option works.

Run the review app in which experts chat with the agentic application and provide feedback.
  1. Expert stakeholder chats with the agentic application

  2. Feedback on response

  3. Application request/response

  4. Application request/response + trace + feedback

Experts review logs

To use this option, first deploy your agentic application using `deploy_model(…)`. After users interact with either the REST API or review app, you can load these traces back into the review app using the following code.


  from databricks.agents import enable_trace_reviews

  enable_trace_reviews(
    model_name=model_fqn,
    request_ids=[
        "52ee973e-0689-4db1-bd05-90d60f94e79f",
        "1b203587-7333-4721-b0d5-bba161e4643a",
        "e68451f4-8e7b-4bfc-998e-4bda66992809",
    ],
  )

Use values from the request_id column of the request logs table.

Run a trace review in which reviewers interact with either the review app or the REST API to provide feedback.
  1. enable_trace_reviews([request_id])

  2. Chats loaded

  3. Expert stakeholder chats with the application

  4. Feedback on response

  5. Requests from front-end app usage or review app usage

  6. Application request/response

  7. Application request/response + trace + feedback

Run evaluation on the request logs table

The following notebook illustrates how to use the logs from the review app as input to an evaluation run using mlflow.evaluate().

Run evaluation on request logs notebook

Open notebook in new tab