Managed evaluations — subject matter expert (SME) user guide

Preview

This feature is in Public Preview.

This page describes how subject matter experts (SMEs) use the managed evaluations UI. The managed evaluations UI is designed to help SMEs do the following:

  • Create a set of input questions that test different aspects of the AI agent’s functionality.

  • Provide information to help the AI judge evaluate the AI agent’s responses to those questions.

For more information about Mosaic AI Agent Evaluation and the AI judges it provides, see What is Mosaic AI Agent Evaluation? and Use agent metrics & LLM judges to evaluate app performance.

Create questions

The first step is to create a set of questions that will be used to test the AI agent. These questions form the basis of an evaluation set. The questions are saved and can be used by the developer for ongoing testing of the AI agent.

When you click the link to the app, a screen similar to the following appears:

opening screen with field to enter question

From this screen, you can directly enter questions, or have the app generate questions automatically.

Directly enter a question

  1. Type your question in the box and press Enter.

    Question in box
  2. A new page opens, showing the question, the AI agent’s response, and a field for you to provide feedback about the response or additional information. The fields that appear on the right side of the screen depend on the mode specified by the developer. For details about the possible modes, see Assess AI responses.

    The screenshot shows an example response page in reference answer mode.

    AI app's response to question with field for feedback
  3. Enter your feedback on the right side of the screen. For more details, see Assess AI responses.

  4. When you are done, do one of the following:

    • To return to the home page, click home button.

    • To continue to the next question if there is one, click the right-pointing arrow at the top of the page.

      assessment page arrows

Automatically generate questions

  1. On the app’s home page, click the blue Generate questions button. The app selects a page randomly from the information that was used to train the AI agent. A new page opens, showing the selected page and several suggested questions based on the information presented on that page.

    screenshot with automatically generated questions
  2. To save a proposed question, click Save to the right of the question. You can also directly edit a proposed question, or click Add question to add your own.

  3. When you are done, click Next document to have the app select another page and generate more questions, or click home button to return to the home page.

Tag questions

You can use tags to organize questions.

  1. On the app’s home page, click the Tags tab.

  2. Click plus sign to create a new tag.

  3. In the dialog, enter a name for the tag and click Create. The new tag appears in the list.

  4. To rename or delete an existing tag, click the kebab menu to the right of the tag.

    menu with choices to rename or delete
  5. To apply or remove a tag, go the individual question page and click Add tags. In the drop down that appears, click the tag’s name to toggle its status.

    drop down list of tags

Assess AI responses

After creating a set of questions, the next step is to evaluate the AI agent’s responses to those questions. The process of evaluating the responses is an iterative one. The steps you follow depend on which mode the developer specified. The available modes are the following:

  • Feedback mode. Mark each response from the AI as “thumbs-up” or “thumbs-down”.

  • Reference answer mode. Provide a reference answer to each question. The AI judge uses this answer as a basis on which to evaluate the answer generated by the AI.

  • Grading notes mode. Provide a set of guidelines that identifies a correct answer. The AI judge checks the generated response to make sure it meets the guidelines you specify.

Feedback mode

In feedback mode, your task is to respond either Yes or No to indicate if the AI agent’s response is correct. No additional interaction is possible.

feedback answer mode UI

Grading notes mode

In grading notes mode, after you review the AI agent’s response, you provide information that the AI judge uses to evaluate the agent’s performance.

  1. Type your input into the Grading notes box on the right side of the screen. For important guidelines on how to provide information to the judge, see Tips for providing grading notes.

  2. Click Ask AI Judge or press Enter.

The judge uses the information you entered to assess the response. It labels the response Correct or Incorrect and displays its rationale. A response labeled Incorrect still provides important information to the developer. If you and the AI judge agree that the response is incorrect, your only task is to enter the best possible grading notes. If the AI judge marks a response as correct or incorrect and you do not agree, see If you disagree with the AI judge’s evaluation.

judge's rationale - grading notes

Tips for providing grading notes

In grading notes mode, your task is to write guidelines that the judge will use to evaluate the AI agent’s responses. These notes should be written in direct, unambiguous language.

To specify facts that must be included for a response to be correct, use “must”, as follows:

  • “The answer must mention Unity Catalog.”

To indicate that a fact should never be included in a correct answer, use “must not”, as follows:

  • “The answer must not mention Unity Catalog.”

To indicate that a fact is correct but not required for an answer to be considered correct, use “optionally”, as follows:

  • “The answer can optionally mention Unity Catalog.”

Reference answer mode

In reference answer mode, after you review the AI agent’s response, you provide information that the AI judge uses to evaluate the agent’s performance.

  1. Type your input into the Reference answer or Grading notes box on the right side of the screen. For important guidelines on how to provide information to the judge, see Tips for providing reference answers.

  2. Click Ask AI Judge or press Enter.

The judge uses the information you entered to assess the response. It labels the response Correct or Incorrect and displays its rationale. A response labeled Incorrect still provides important information to the developer. If you and the AI judge agree that the response is incorrect, your only task is to enter the best possible reference answer. If the AI judge marks a response as correct or incorrect and you do not agree, see If you disagree with the AI judge’s evaluation.

judge's rationale - reference answer

Tips for providing reference answers

In reference answer mode, your task is to write the correct answer to the question. The judge compares the AI agent’s response to the reference answer that you provide.

Important

A good reference answer should include only the minimal set of facts that is required for a correct response. If you copy a response from another source, be sure to edit the response to remove any text that is not required for an answer to be considered correct.

Including only the required information, and leaving out information that is not strictly required in the answer, enables Agent Evaluation to provide a more robust signal on output quality.

If you disagree with the AI judge’s evaluation

If the AI judge marks a response as correct or incorrect and you do not agree, the first step is to edit your reference answer or grading notes to try to guide the judge to an accurate assessment.

If you cannot get the judge to agree with your assessment, provide the best reference answer or grading notes that you can, and then click No in the AI Judge Rationale field. This is helpful information for the developer.

thumbs-down button