Get feedback about the quality of an agentic application

Preview

This feature is in Public Preview.

This article shows you how to use the Databricks review app to gather feedback from human reviewers about the quality of your agentic application. It covers the following:

  • How to deploy the review app.

  • How reviewers use the app to provide feedback on the agentic application’s responses.

  • How experts can review logged chats to provide suggestions for improvement and other feedback using the app.

What happens in a human evaluation?

The Databricks review app stages the LLM in an environment where expert stakeholders can interact with it - in other words, have a conversation, ask questions, and so on. In this way, the review app lets you collect feedback on your application, helping to ensure the quality and safety of the answers it provides.

Stakeholders can chat with the application bot and provide feedback on those conversations, or provide feedback on historical logs, curated traces, or agent outputs.

Requirements

  • Inference tables must be enabled on the endpoint that is serving the agent.

  • Each human reviewer must have access to the review app workspace. See the next section, Set up permissions to the review app workspace.

  • Developers must install the databricks-agents SDK to set up permissions and configure the review app.

    %pip install databricks-agents
    dbutils.library.restartPython()
    

Set up permissions to the review app workspace

If your reviewers already have access to the workspace that contains the review app, you don’t need to do anything.

If reviewers don’t have access already, account admins can use account-level SCIM provisioning to sync users and groups automatically from your identity provider to your Databricks account. You can also manually register these users and groups as you set up identities in Databricks. This allows them to be included as eligible reviewers. See Sync users and groups from your identity provider.

from databricks import agents

# Note that <user_list> can specify individual users or groups.
agents.set_permissions(model_name=<model_name>, users=[<user_list>], permission_level=PermissionLevel.CAN_QUERY)

Experts who review chat logs must have CAN_REVIEW permissions.

Deploy the review app

When you deploy an agent using agents.deploy(), the review app is automatically enabled and deployed. The output from the command shows the URL for the review app. For information about deploying an agent, see Deploy an agent for generative AI application.

link to review app from notebook command output

If you lose the link to the deployment, you can find it using list_deployments().

from databricks import agents

deployments = agents.list_deployments()
deployments

Review app UI

To open the review app, click the provided URL. The review app UI has three tabs in the left sidebar:

When you open the review app, the instructions page appears.

review app opening screen

Provide instructions to reviewers

To provide custom text for the instructions displayed for reviewers, use the following code:

from databricks import agents

agents.set_review_instructions(uc_model_name, "Thank you for testing the bot. Use your domain expertise to evaluate and give feedback on the bot's responses, ensuring it aligns with the needs and expectations of users like yourself.")
agents.get_review_instructions(uc_model_name)
A screenshot of the review app instructions specified the Python example.

Chat with the app and submit reviews

To chat with the app and submit reviews:

  1. Click Test the bot in the left sidebar.

  2. Type your question in the box and press Return or Enter on your keyboard, or click the arrow in the box. The app displays its answer to your question, and the sources that it used to find the answer.

  3. Review the app’s answer, and select Yes, No, or I dont’ know.

  4. The app asks for additional information. Check the appropriate boxes or type your comments into the field provided.

  5. You can also edit the response directly to provide a better answer. To edit the response, click Edit response, make your changes in the dialog, and click Save, as shown in the following video.

    how to edit a response
  6. Click Done to save your feedback.

  7. Continue asking questions to provide additional feedback.

The following diagram illustrates this workflow.

  1. Using review app, reviewer chats with the agentic application.

  2. Using review app, reviewer provides feedback on application responses.

  3. All requests, responses, and feedback are logged to inference tables.

Run the review app in which experts chat with the agentic application and provide feedback.

Make chat logs available for evaluation by expert reviewers

When a user interacts with app using the REST API or the review app, all requests, responses, and additional feedback is saved to inference tables. The inference tables are located in the same Unity Catalog catalog and schema where the model was registered and are named <model_name>_payload, <model_name>_payload_assessment_logs, and <model_name>_payload_request_logs. For more information about these tables, including schemas, see Agent-enhanced inference tables.

To load these logs into the review app for evaluation by expert reviewers, you must first find the request_id and enable reviews for that request_id as follows:

  1. Locate the request_ids to be reviewed from the <model_name>_payload_request_logs inference table. The inference table is in the same Unity Catalog catalog and schema where the model was registered.

  2. Use code similar to the following to load the review logs into the review app:

    from databricks import agents
    
    agents.enable_trace_reviews(
      model_name=model_fqn,
      request_ids=[
          "52ee973e-0689-4db1-bd05-90d60f94e79f",
          "1b203587-7333-4721-b0d5-bba161e4643a",
          "e68451f4-8e7b-4bfc-998e-4bda66992809",
      ],
    )
    
  3. The result cell includes a link to the review app with the selected logs loaded for review.

Review app with chat logs loaded for expert review

Expert review of logs from other user’s interactions with the app

To review logs from previous chats, the logs must have been enabled for review. See Make chat logs available for evaluation by expert reviewers.

  1. In the left sidebar of the review app, select Chats to review. The enabled requests are displayed.

    chats enabled for review
  2. Click on a request to display it for review.

  3. Review the request and response. The app also shows the sources it used for reference. You can click on these to review the reference, and provide feedback on the relevance of the source.

  4. To provide feedback on the quality of the response, select Yes, No, or I dont’ know.

  5. The app asks for additional information. Check the appropriate boxes or type your comments into the field provided.

  6. You can also edit the response directly to provide a better answer. To edit the response, click Edit response, make your changes in the dialog, and click Save. See Chat with the app and submit reviews for a video that shows the process.

  7. Click Done to save your feedback.

The following diagram illustrates this workflow.

  1. Using review app or custom app, reviewers chat with the agentic application.

  2. All requests and responses are logged to inference tables.

  3. Application developer uses enable_trace_reviews([request_id]) (where request_id is from the <model_name>_payload_request_logs inference table) to post chat logs to review app.

  4. Using review app, expert reviews logs and provides feedback. Expert feedback is logged to inference tables.

Run a trace review in which reviewers interact with either the review app or the REST API to provide feedback.

Use mlflow.evaluate() on the request logs table

The following notebook illustrates how to use the logs from the review app as input to an evaluation run using mlflow.evaluate(). For more information about mlflow.evaluate(), see Evaluate large language models using open-source MLflow.

Run evaluation on request logs notebook

Open notebook in new tab