custom-judges(Python)

Loading...

Custom judges demo notebook

This notebook illustrates the following techniques for working with custom judges in Mosaic AI Agent Evaluation.

  1. Run a subset of AI judges.
  2. Create AI judges from guidelines.
  3. Create AI judges from custom metrics and callable judges.
2

Run a subset of AI judges

4

/local_disk0/.ephemeral_nfs/envs/pythonEnv-e8889c42-d0c0-41a0-a2ca-6e03ddea1f6c/lib/python3.10/site-packages/mlflow/pyfunc/utils/data_validation.py:134: UserWarning: Add type hints to the `predict` method to enable data validation and automatic signature inference during model logging. Check https://mlflow.org/docs/latest/model/python_model.html#type-hint-usage-in-pythonmodel for more details. color_warning(

Create AI judges from guidelines

For more information, see the documentation: (AWS | Azure).

6

Convert make_genai_metric_from_prompt to a custom metric

For more information, see the documentation: (AWS | Azure).

To give you more control, you can use the code below to convert the metric created with make_genai_metric_from_prompt to a custom metric in Agent Evaluation. This gives you the control to threshold, or post-process the result.

In this example, we'll return both the numeric value and the boolean thresholded value.

8

ERROR:databricks.rag_eval.evaluation.custom_metrics:Error when evaluating metric no_pii: name 'Assessment' is not defined. Traceback (most recent call last): File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-e8889c42-d0c0-41a0-a2ca-6e03ddea1f6c/lib/python3.10/site-packages/databricks/rag_eval/evaluation/custom_metrics.py", line 189, in run metric_value = self.eval_fn(**kwargs) File "/home/spark-e8889c42-d0c0-41a0-a2ca-6e/.ipykernel/3816/command-8932311682584259-3274122340", line 39, in no_pii Assessment( NameError: name 'Assessment' is not defined

Create AI judges from a prompt

For more information, see the documentation: (AWS | Azure).

This method is not recommended unless you need to create per-chunk assessments from a prompt. You can use custom metrics, callable judges, and custom Python code to give you more control.

10