Step 3. Curate an Evaluation Set from stakeholder feedback

workflow with evaluation set highlighted

See the GitHub repository for the sample code in this section.

Expected time: 10 - 60 minutes. Time varies based on the quality of the responses provided by your stakeholders. If the responses are messy or contain lots of irrelevant queries, you will need to spend more time filtering and cleaning the data.

Overview and expected outcome

This step will bootstrap an evaluation set with the feedback that stakeholders have provided by using the Review App. Note that you can bootstrap an evaluation set with just questions, so even if your stakeholders only chatted with the app vs. providing feedback, you can follow this step.

For the schema of the Agent Evaluation evaluation set, see Evaluation set schema. The fields in this schema are referenced in the rest of this section.

At the end of this step, you will have an Evaluation Set that contains the following:

  • Requests with a 👍:

    • request: as entered by the user.

    • expected_response: Response as edited by the user. If the user did not edit the response, the response generated by the model.

  • Requests with a 👎:

    • request: as entered by the user.

    • expected_response: Response as edited by the user. If the user did not edit the response, the response is null.

  • Requests with no feedback (no 👍 or 👎)

    • request: as entered by the user.

For all requests, if the user selects 👍 for a chunk from the retrieved_context, the doc_uri of that chunk is included in expected_retrieved_context for the question.

Important

Databricks recommends that your evaluation set contain at least 30 questions to get started. Read the [evaluation set deep dive] to learn more about what a “good” evaluation set is.

Requirements

  • Stakeholders have used your POC and provided feedback.

  • All requirements from previous steps.

Instructions

  1. Open the 04_create_evaluation_set notebook and click Run all.

  2. Inspect the evaluation set to understand the data that is included. You need to validate that your evaluation set contains a representative and challenging set of questions. Adjust the evaluation set as required.

  3. By default, your evaluation set is saved to the Delta table configured in EVALUATION_SET_FQN in the 00_global_config notebook.

Next step

Now that you have an evaluation set, use it to evaluate the POC app’s quality, cost, and latency. See Step 4. Evaluate the POC’s quality.