LLM Evaluation with MLflow example

This notebook demonstrates how to evaluate various LLMs and RAG systems with MLflow, leveraging simple metrics such as perplexity and toxicity, as well as LLM-judged metrics such as relevance, and even custom LLM-judged metrics such as professionalism.

For details about how to use mlflow.evaluate(), refer to Evaluate LLMs with MLflow (AWS|Azure).

Requirements

To use the MLflow LLM evaluation feature, you must use MLflow flavor 2.8.0 or above.

2023/10/27 00:56:56 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator. 2023/10/27 00:56:56 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions. Using default facebook/roberta-hate-speech-dynabench-r4-target checkpoint 2023/10/27 00:57:06 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: token_count 2023/10/27 00:57:06 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: toxicity 2023/10/27 00:57:06 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: flesch_kincaid_grade_level 2023/10/27 00:57:06 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: ari_grade_level 2023/10/27 00:57:06 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: exact_match

{'toxicity/v1/mean': 0.00020573455913108774, 'toxicity/v1/variance': 3.4433758978645428e-09, 'toxicity/v1/p90': 0.00027067282790085303, 'toxicity/v1/ratio': 0.0, 'flesch_kincaid_grade_level/v1/mean': 15.149999999999999, 'flesch_kincaid_grade_level/v1/variance': 26.502499999999998, 'flesch_kincaid_grade_level/v1/p90': 20.85, 'ari_grade_level/v1/mean': 17.375, 'ari_grade_level/v1/variance': 42.92187499999999, 'ari_grade_level/v1/p90': 24.48, 'exact_match/v1': 0.0}

Downloading artifacts: 0%| | 0/1 [00:00<?, ?it/s]

EvaluationMetric(name=answer_similarity, greater_is_better=True, long_name=answer_similarity, version=v1, metric_details= Task: You are an impartial judge. You will be given an input that was sent to a machine learning model, and you will be given an output that the model produced. You may also be given additional information that was used by the model to generate the output. Your task is to determine a numerical score called answer_similarity based on the input and output. A definition of answer_similarity and a grading rubric are provided below. You must use the grading rubric to determine your score. You must also justify your score. Examples could be included below for reference. Make sure to use them as references and to understand them before completing the task. Input: {input} Output: {output} {grading_context_columns} Metric definition: Answer similarity is evaluated on the degree of semantic similarity of the provided output to the provided targets, which is the ground truth. Scores can be assigned based on the gradual similarity in meaning and description to the provided targets, where a higher score indicates greater alignment between the provided output and provided targets. Grading rubric: Answer similarity: Below are the details for different scores: - Score 1: the output has little to no semantic similarity to the provided targets. - Score 2: the output displays partial semantic similarity to the provided targets on some aspects. - Score 3: the output has moderate semantic similarity to the provided targets. - Score 4: the output aligns with the provided targets in most aspects and has substantial semantic similarity. - Score 5: the output closely aligns with the provided targets in all significant aspects. Examples: Input: What is MLflow? Output: MLflow is an open-source platform for managing machine learning workflows, including experiment tracking, model packaging, versioning, and deployment, simplifying the ML lifecycle. Additional information used by the model: key: ground_truth value: MLflow is an open-source platform for managing the end-to-end machine learning (ML) lifecycle. It was developed by Databricks, a company that specializes in big data and machine learning solutions. MLflow is designed to address the challenges that data scientists and machine learning engineers face when developing, training, and deploying machine learning models. score: 4 justification: The definition effectively explains what MLflow is its purpose, and its developer. It could be more concise for a 5-score. You must return the following fields in your response one below the other: score: Your numerical score for the model's answer_similarity based on the rubric justification: Your step-by-step reasoning about the model's answer_similarity score )

2023/10/27 00:57:07 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator. 2023/10/27 00:57:07 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions. 2023/10/27 00:57:13 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: token_count 2023/10/27 00:57:13 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: toxicity 2023/10/27 00:57:13 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: flesch_kincaid_grade_level 2023/10/27 00:57:13 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: ari_grade_level 2023/10/27 00:57:13 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: exact_match 2023/10/27 00:57:13 INFO mlflow.models.evaluation.default_evaluator: Evaluating metrics: answer_similarity

{'toxicity/v1/mean': 0.00023413174494635314, 'toxicity/v1/variance': 4.211776498455113e-09, 'toxicity/v1/p90': 0.00029628578631673007, 'toxicity/v1/ratio': 0.0, 'flesch_kincaid_grade_level/v1/mean': 14.774999999999999, 'flesch_kincaid_grade_level/v1/variance': 21.546875000000004, 'flesch_kincaid_grade_level/v1/p90': 19.71, 'ari_grade_level/v1/mean': 17.0, 'ari_grade_level/v1/variance': 41.005, 'ari_grade_level/v1/p90': 23.92, 'exact_match/v1': 0.0, 'answer_similarity/v1/mean': 3.75, 'answer_similarity/v1/variance': 1.1875, 'answer_similarity/v1/p90': 4.7}

Downloading artifacts: 0%| | 0/1 [00:00<?, ?it/s]

EvaluationMetric(name=professionalism, greater_is_better=True, long_name=professionalism, version=v1, metric_details= Task: You are an impartial judge. You will be given an input that was sent to a machine learning model, and you will be given an output that the model produced. You may also be given additional information that was used by the model to generate the output. Your task is to determine a numerical score called professionalism based on the input and output. A definition of professionalism and a grading rubric are provided below. You must use the grading rubric to determine your score. You must also justify your score. Examples could be included below for reference. Make sure to use them as references and to understand them before completing the task. Input: {input} Output: {output} {grading_context_columns} Metric definition: Professionalism refers to the use of a formal, respectful, and appropriate style of communication that is tailored to the context and audience. It often involves avoiding overly casual language, slang, or colloquialisms, and instead using clear, concise, and respectful language Grading rubric: Professionalism: If the answer is written using a professional tone, below are the details for different scores: - Score 1: Language is extremely casual, informal, and may include slang or colloquialisms. Not suitable for professional contexts.- Score 2: Language is casual but generally respectful and avoids strong informality or slang. Acceptable in some informal professional settings.- Score 3: Language is balanced and avoids extreme informality or formality. Suitable for most professional contexts. - Score 4: Language is noticeably formal, respectful, and avoids casual elements. Appropriate for business or academic settings. - Score 5: Language is excessively formal, respectful, and avoids casual elements. Appropriate for the most formal settings such as textbooks. Examples: Input: What is MLflow? Output: MLflow is like your friendly neighborhood toolkit for managing your machine learning projects. It helps you track experiments, package your code and models, and collaborate with your team, making the whole ML workflow smoother. It's like your Swiss Army knife for machine learning! score: 2 justification: The response is written in a casual tone. It uses contractions, filler words such as 'like', and exclamation points, which make it sound less professional. You must return the following fields in your response one below the other: score: Your numerical score for the model's professionalism based on the rubric justification: Your step-by-step reasoning about the model's professionalism score )

2023/10/27 00:57:20 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator. 2023/10/27 00:57:20 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions. 2023/10/27 00:57:24 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: token_count 2023/10/27 00:57:24 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: toxicity 2023/10/27 00:57:25 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: flesch_kincaid_grade_level 2023/10/27 00:57:25 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: ari_grade_level 2023/10/27 00:57:25 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: exact_match 2023/10/27 00:57:25 INFO mlflow.models.evaluation.default_evaluator: Evaluating metrics: professionalism

{'toxicity/v1/mean': 0.0002044261127593927, 'toxicity/v1/variance': 1.8580601275034412e-09, 'toxicity/v1/p90': 0.00025343164161313326, 'toxicity/v1/ratio': 0.0, 'flesch_kincaid_grade_level/v1/mean': 13.649999999999999, 'flesch_kincaid_grade_level/v1/variance': 33.927499999999995, 'flesch_kincaid_grade_level/v1/p90': 19.92, 'ari_grade_level/v1/mean': 16.25, 'ari_grade_level/v1/variance': 51.927499999999995, 'ari_grade_level/v1/p90': 23.900000000000002, 'professionalism/v1/mean': 4.0, 'professionalism/v1/variance': 0.0, 'professionalism/v1/p90': 4.0}

Downloading artifacts: 0%| | 0/1 [00:00<?, ?it/s]

/Users/sunish.sheth/.local/lib/python3.8/site-packages/_distutils_hack/__init__.py:18: UserWarning: Distutils was imported before Setuptools, but importing Setuptools also replaces the `distutils` module in `sys.modules`. This may lead to undesirable behaviors or errors. To avoid these issues, avoid using distutils directly, ensure that setuptools is installed in the traditional way (e.g. not an editable install), and/or make sure that setuptools is always imported before distutils. warnings.warn( /Users/sunish.sheth/.local/lib/python3.8/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils. warnings.warn("Setuptools is replacing distutils.") 2023/10/27 00:57:30 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator. 2023/10/27 00:57:30 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions. 2023/10/27 00:57:37 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: token_count 2023/10/27 00:57:37 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: toxicity 2023/10/27 00:57:38 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: flesch_kincaid_grade_level 2023/10/27 00:57:38 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: ari_grade_level 2023/10/27 00:57:38 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: exact_match 2023/10/27 00:57:38 INFO mlflow.models.evaluation.default_evaluator: Evaluating metrics: professionalism

{'toxicity/v1/mean': 0.00030383203556993976, 'toxicity/v1/variance': 9.482036560896618e-09, 'toxicity/v1/p90': 0.0003866828687023372, 'toxicity/v1/ratio': 0.0, 'flesch_kincaid_grade_level/v1/mean': 17.625, 'flesch_kincaid_grade_level/v1/variance': 2.9068750000000003, 'flesch_kincaid_grade_level/v1/p90': 19.54, 'ari_grade_level/v1/mean': 21.425, 'ari_grade_level/v1/variance': 3.6168750000000007, 'ari_grade_level/v1/p90': 23.6, 'professionalism/v1/mean': 4.5, 'professionalism/v1/variance': 0.25, 'professionalism/v1/p90': 5.0}

Downloading artifacts: 0%| | 0/1 [00:00<?, ?it/s]

question-answering-evaluation(Python)

LLM Evaluation with MLflow example

Requirements

Set OpenAI Key

Basic Question-Answering Evaluation

LLM-judged correctness with OpenAI GPT-4

Custom LLM-judged metric for professionalism