question-answering-evaluation(Python)

Loading...

LLM Evaluation with MLflow example

This notebook demonstrates how to evaluate various LLMs and RAG systems with MLflow, leveraging simple metrics such as perplexity and toxicity, as well as LLM-judged metrics such as relevance, and even custom LLM-judged metrics such as professionalism.

For details about how to use mlflow.evaluate(), refer to Evaluate LLMs with MLflow (AWS|Azure).

Requirements

To use the MLflow LLM evaluation feature, you must use MLflow flavor 2.8.0 or above.

If your cluster is running Databricks Runtime, uncomment and run the following cell to install the mlflow library. This is required for Databricks Runtime clusters only. If you are using a cluster running Databricks Runtime ML, skip to Set OpenAI Key step.

Import the required libraries.

Set OpenAI Key

Basic Question-Answering Evaluation

Create a test case of inputs that is passed into the model and ground_truth which is used to compare against the generated output from the model.

Create a simple OpenAI model that asks gpt-3.5 to answer the question in two sentences. Call mlflow.evaluate() with the model and evaluation dataframe.

2023/10/27 00:56:56 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator. 2023/10/27 00:56:56 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions. Using default facebook/roberta-hate-speech-dynabench-r4-target checkpoint 2023/10/27 00:57:06 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: token_count 2023/10/27 00:57:06 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: toxicity 2023/10/27 00:57:06 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: flesch_kincaid_grade_level 2023/10/27 00:57:06 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: ari_grade_level 2023/10/27 00:57:06 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: exact_match
{'toxicity/v1/mean': 0.00020573455913108774, 'toxicity/v1/variance': 3.4433758978645428e-09, 'toxicity/v1/p90': 0.00027067282790085303, 'toxicity/v1/ratio': 0.0, 'flesch_kincaid_grade_level/v1/mean': 15.149999999999999, 'flesch_kincaid_grade_level/v1/variance': 26.502499999999998, 'flesch_kincaid_grade_level/v1/p90': 20.85, 'ari_grade_level/v1/mean': 17.375, 'ari_grade_level/v1/variance': 42.92187499999999, 'ari_grade_level/v1/p90': 24.48, 'exact_match/v1': 0.0}

Inspect the evaluation results table as a dataframe to see row-by-row metrics to further assess model performance

    Downloading artifacts: 0%| | 0/1 [00:00<?, ?it/s]

    LLM-judged correctness with OpenAI GPT-4

    Construct an answer similarity metric using the answer_similarity() metric factory function.

    EvaluationMetric(name=answer_similarity, greater_is_better=True, long_name=answer_similarity, version=v1, metric_details= Task: You are an impartial judge. You will be given an input that was sent to a machine learning model, and you will be given an output that the model produced. You may also be given additional information that was used by the model to generate the output. Your task is to determine a numerical score called answer_similarity based on the input and output. A definition of answer_similarity and a grading rubric are provided below. You must use the grading rubric to determine your score. You must also justify your score. Examples could be included below for reference. Make sure to use them as references and to understand them before completing the task. Input: {input} Output: {output} {grading_context_columns} Metric definition: Answer similarity is evaluated on the degree of semantic similarity of the provided output to the provided targets, which is the ground truth. Scores can be assigned based on the gradual similarity in meaning and description to the provided targets, where a higher score indicates greater alignment between the provided output and provided targets. Grading rubric: Answer similarity: Below are the details for different scores: - Score 1: the output has little to no semantic similarity to the provided targets. - Score 2: the output displays partial semantic similarity to the provided targets on some aspects. - Score 3: the output has moderate semantic similarity to the provided targets. - Score 4: the output aligns with the provided targets in most aspects and has substantial semantic similarity. - Score 5: the output closely aligns with the provided targets in all significant aspects. Examples: Input: What is MLflow? Output: MLflow is an open-source platform for managing machine learning workflows, including experiment tracking, model packaging, versioning, and deployment, simplifying the ML lifecycle. Additional information used by the model: key: ground_truth value: MLflow is an open-source platform for managing the end-to-end machine learning (ML) lifecycle. It was developed by Databricks, a company that specializes in big data and machine learning solutions. MLflow is designed to address the challenges that data scientists and machine learning engineers face when developing, training, and deploying machine learning models. score: 4 justification: The definition effectively explains what MLflow is its purpose, and its developer. It could be more concise for a 5-score. You must return the following fields in your response one below the other: score: Your numerical score for the model's answer_similarity based on the rubric justification: Your step-by-step reasoning about the model's answer_similarity score )

    Call mlflow.evaluate() again but with your new answer_similarity_metric

    2023/10/27 00:57:07 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator. 2023/10/27 00:57:07 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions. 2023/10/27 00:57:13 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: token_count 2023/10/27 00:57:13 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: toxicity 2023/10/27 00:57:13 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: flesch_kincaid_grade_level 2023/10/27 00:57:13 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: ari_grade_level 2023/10/27 00:57:13 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: exact_match 2023/10/27 00:57:13 INFO mlflow.models.evaluation.default_evaluator: Evaluating metrics: answer_similarity
    {'toxicity/v1/mean': 0.00023413174494635314, 'toxicity/v1/variance': 4.211776498455113e-09, 'toxicity/v1/p90': 0.00029628578631673007, 'toxicity/v1/ratio': 0.0, 'flesch_kincaid_grade_level/v1/mean': 14.774999999999999, 'flesch_kincaid_grade_level/v1/variance': 21.546875000000004, 'flesch_kincaid_grade_level/v1/p90': 19.71, 'ari_grade_level/v1/mean': 17.0, 'ari_grade_level/v1/variance': 41.005, 'ari_grade_level/v1/p90': 23.92, 'exact_match/v1': 0.0, 'answer_similarity/v1/mean': 3.75, 'answer_similarity/v1/variance': 1.1875, 'answer_similarity/v1/p90': 4.7}

    See the row-by-row LLM-judged answer similarity score and justifications

      Downloading artifacts: 0%| | 0/1 [00:00<?, ?it/s]

      Custom LLM-judged metric for professionalism

      Create a custom metric that is used to determine professionalism of the model outputs. Use make_genai_metric with a metric definition, grading prompt, grading example, and judge model configuration

      EvaluationMetric(name=professionalism, greater_is_better=True, long_name=professionalism, version=v1, metric_details= Task: You are an impartial judge. You will be given an input that was sent to a machine learning model, and you will be given an output that the model produced. You may also be given additional information that was used by the model to generate the output. Your task is to determine a numerical score called professionalism based on the input and output. A definition of professionalism and a grading rubric are provided below. You must use the grading rubric to determine your score. You must also justify your score. Examples could be included below for reference. Make sure to use them as references and to understand them before completing the task. Input: {input} Output: {output} {grading_context_columns} Metric definition: Professionalism refers to the use of a formal, respectful, and appropriate style of communication that is tailored to the context and audience. It often involves avoiding overly casual language, slang, or colloquialisms, and instead using clear, concise, and respectful language Grading rubric: Professionalism: If the answer is written using a professional tone, below are the details for different scores: - Score 1: Language is extremely casual, informal, and may include slang or colloquialisms. Not suitable for professional contexts.- Score 2: Language is casual but generally respectful and avoids strong informality or slang. Acceptable in some informal professional settings.- Score 3: Language is balanced and avoids extreme informality or formality. Suitable for most professional contexts. - Score 4: Language is noticeably formal, respectful, and avoids casual elements. Appropriate for business or academic settings. - Score 5: Language is excessively formal, respectful, and avoids casual elements. Appropriate for the most formal settings such as textbooks. Examples: Input: What is MLflow? Output: MLflow is like your friendly neighborhood toolkit for managing your machine learning projects. It helps you track experiments, package your code and models, and collaborate with your team, making the whole ML workflow smoother. It's like your Swiss Army knife for machine learning! score: 2 justification: The response is written in a casual tone. It uses contractions, filler words such as 'like', and exclamation points, which make it sound less professional. You must return the following fields in your response one below the other: score: Your numerical score for the model's professionalism based on the rubric justification: Your step-by-step reasoning about the model's professionalism score )

      Call mlflow.evaluate with your new professionalism metric.

      2023/10/27 00:57:20 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator. 2023/10/27 00:57:20 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions. 2023/10/27 00:57:24 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: token_count 2023/10/27 00:57:24 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: toxicity 2023/10/27 00:57:25 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: flesch_kincaid_grade_level 2023/10/27 00:57:25 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: ari_grade_level 2023/10/27 00:57:25 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: exact_match 2023/10/27 00:57:25 INFO mlflow.models.evaluation.default_evaluator: Evaluating metrics: professionalism
      {'toxicity/v1/mean': 0.0002044261127593927, 'toxicity/v1/variance': 1.8580601275034412e-09, 'toxicity/v1/p90': 0.00025343164161313326, 'toxicity/v1/ratio': 0.0, 'flesch_kincaid_grade_level/v1/mean': 13.649999999999999, 'flesch_kincaid_grade_level/v1/variance': 33.927499999999995, 'flesch_kincaid_grade_level/v1/p90': 19.92, 'ari_grade_level/v1/mean': 16.25, 'ari_grade_level/v1/variance': 51.927499999999995, 'ari_grade_level/v1/p90': 23.900000000000002, 'professionalism/v1/mean': 4.0, 'professionalism/v1/variance': 0.0, 'professionalism/v1/p90': 4.0}

        Downloading artifacts: 0%| | 0/1 [00:00<?, ?it/s]

        Lets see if we can improve basic_qa_model by creating a new model that could perform better by changing the system prompt.

        Call mlflow.evaluate() using the new model. Observe that the professionalism score has increased!

        /Users/sunish.sheth/.local/lib/python3.8/site-packages/_distutils_hack/__init__.py:18: UserWarning: Distutils was imported before Setuptools, but importing Setuptools also replaces the `distutils` module in `sys.modules`. This may lead to undesirable behaviors or errors. To avoid these issues, avoid using distutils directly, ensure that setuptools is installed in the traditional way (e.g. not an editable install), and/or make sure that setuptools is always imported before distutils. warnings.warn( /Users/sunish.sheth/.local/lib/python3.8/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils. warnings.warn("Setuptools is replacing distutils.") 2023/10/27 00:57:30 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator. 2023/10/27 00:57:30 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions. 2023/10/27 00:57:37 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: token_count 2023/10/27 00:57:37 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: toxicity 2023/10/27 00:57:38 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: flesch_kincaid_grade_level 2023/10/27 00:57:38 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: ari_grade_level 2023/10/27 00:57:38 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: exact_match 2023/10/27 00:57:38 INFO mlflow.models.evaluation.default_evaluator: Evaluating metrics: professionalism
        {'toxicity/v1/mean': 0.00030383203556993976, 'toxicity/v1/variance': 9.482036560896618e-09, 'toxicity/v1/p90': 0.0003866828687023372, 'toxicity/v1/ratio': 0.0, 'flesch_kincaid_grade_level/v1/mean': 17.625, 'flesch_kincaid_grade_level/v1/variance': 2.9068750000000003, 'flesch_kincaid_grade_level/v1/p90': 19.54, 'ari_grade_level/v1/mean': 21.425, 'ari_grade_level/v1/variance': 3.6168750000000007, 'ari_grade_level/v1/p90': 23.6, 'professionalism/v1/mean': 4.5, 'professionalism/v1/variance': 0.25, 'professionalism/v1/p90': 5.0}

          Downloading artifacts: 0%| | 0/1 [00:00<?, ?it/s]