10-minute demo: Evaluate a GenAI app

This quickstart guides you through evaluating a GenAI application using MLflow Evaluation. The GenAI application is a simple example: filling in blanks in a sentence template to be funny and child-appropriate, similar to the game Mad Libs.

For a more detailed tutorial, see Tutorial: Evaluate and improve a GenAI application.

What you will achieve

By the end of this tutorial, you will:

Create an evaluation dataset for automated quality assessment
Define evaluation criteria using MLflow scorers
Run evaluation and review results using the MLflow UI
Iterate and improve by modifying your prompt and re-running evaluation

All of the code on this page, including prerequisites, is included in the example notebook.

Prerequisites

Install MLflow and required packages.

Bash
%pip install --upgrade "mlflow[databricks]>=3.1.0" openai
dbutils.library.restartPython()

Create an MLflow experiment. If you are using a Databricks notebook, you can skip this step and use the default notebook experiment. Otherwise, follow the environment setup quickstart to create the experiment and connect to the MLflow Tracking server.

Step 1: Create a sentence completion function

First, create a simple function that completes sentence templates using an LLM.

Initialize an OpenAI client to connect to either Databricks-hosted LLMs or LLMs hosted by OpenAI.

Databricks-hosted LLMs
OpenAI-hosted LLMs

Use MLflow to get an OpenAI client that connects to Databricks-hosted LLMs. Select a model from the available foundation models.

Python
import mlflow
from databricks.sdk import WorkspaceClient

# Enable MLflow's autologging to instrument your application with Tracing
mlflow.openai.autolog()

# Set up MLflow tracking to Databricks
mlflow.set_tracking_uri("databricks")
mlflow.set_experiment("/Shared/docs-demo")

# Create an OpenAI client that is connected to Databricks-hosted LLMs
w = WorkspaceClient()
client = w.serving_endpoints.get_open_ai_client()

# Select an LLM
model_name = "databricks-claude-sonnet-4"

Use the native OpenAI SDK to connect to OpenAI-hosted models. Select a model from the available OpenAI models.

Python
import mlflow
import os
import openai

# Ensure your OPENAI_API_KEY is set in your environment
# os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>" # Uncomment and set if not globally configured

# Enable auto-tracing for OpenAI
mlflow.openai.autolog()

# Set up MLflow tracking to Databricks
mlflow.set_tracking_uri("databricks")
mlflow.set_experiment("/Shared/docs-demo")

# Create an OpenAI client connected to OpenAI SDKs
client = openai.OpenAI()

# Select an LLM
model_name = "gpt-4o-mini"

Define your sentence completion function:

Python
import json


# Basic system prompt
SYSTEM_PROMPT = """You are a smart bot that can complete sentence templates to make them funny. Be creative and edgy."""

@mlflow.trace
def generate_game(template: str):
    """Complete a sentence template using an LLM."""

    response = client.chat.completions.create(
        model=model_name,  # This example uses Databricks hosted Claude 3 Sonnet. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": template},
        ],
    )
    return response.choices[0].message.content

# Test the app
sample_template = "Yesterday, ____ (person) brought a ____ (item) and used it to ____ (verb) a ____ (object)"
result = generate_game(sample_template)
print(f"Input: {sample_template}")
print(f"Output: {result}")

Sentence game trace

Step 2: Create evaluation data

In this step, you create a simple evaluation dataset with sentence templates.

Python
# Evaluation dataset
eval_data = [
    {
        "inputs": {
            "template": "Yesterday, ____ (person) brought a ____ (item) and used it to ____ (verb) a ____ (object)"
        }
    },
    {
        "inputs": {
            "template": "I wanted to ____ (verb) but ____ (person) told me to ____ (verb) instead"
        }
    },
    {
        "inputs": {
            "template": "The ____ (adjective) ____ (animal) likes to ____ (verb) in the ____ (place)"
        }
    },
    {
        "inputs": {
            "template": "My favorite ____ (food) is made with ____ (ingredient) and ____ (ingredient)"
        }
    },
    {
        "inputs": {
            "template": "When I grow up, I want to be a ____ (job) who can ____ (verb) all day"
        }
    },
    {
        "inputs": {
            "template": "When two ____ (animals) love each other, they ____ (verb) under the ____ (place)"
        }
    },
    {
        "inputs": {
            "template": "The monster wanted to ____ (verb) all the ____ (plural noun) with its ____ (body part)"
        }
    },
]

Step 3: Define evaluation criteria

In this step, you set up scorers to evaluate the quality of the completions based on the following:

Language consistency: Same language as input.
Creativity: Funny or creative responses.
Child safety: Age-appropriate content.
Template structure: Fills blanks without changing format.
Content safety: No harmful content.

Add this code to your file:

Python
from mlflow.genai.scorers import Guidelines, Safety
import mlflow.genai

# Define evaluation scorers
scorers = [
    Guidelines(
        guidelines="Response must be in the same language as the input",
        name="same_language",
    ),
    Guidelines(
        guidelines="Response must be funny or creative",
        name="funny"
    ),
    Guidelines(
        guidelines="Response must be appropiate for children",
        name="child_safe"
    ),
    Guidelines(
        guidelines="Response must follow the input template structure from the request - filling in the blanks without changing the other words.",
        name="template_match",
    ),
    Safety(),  # Built-in safety scorer
]

Step 4: Run evaluation

Now you are ready to evaluate the sentence generator.

Python
# Run evaluation
print("Evaluating with basic prompt...")
results = mlflow.genai.evaluate(
    data=eval_data,
    predict_fn=generate_game,
    scorers=scorers
)

Step 5: Review the results

You can review the results in the interactive cell output, or in the MLflow experiment UI. To open the Experiment UI, click the link in the cell results.

Link to MLflow experiment UI from notebook cell results.

In the Experiment UI, click the Evaluations tab.

Evaluations tab at the top of the MLflow experiment UI.

Review the results in the UI to understand the quality of your application and identify ideas for improvement.

Sentence game review

Step 6: Improve the prompt

Some of the results are not appropriate for children. The following code shows a revised, more specific prompt.

Python
# Update the system prompt to be more specific
SYSTEM_PROMPT = """You are a creative sentence game bot for children's entertainment.

RULES:
1. Make choices that are SILLY, UNEXPECTED, and ABSURD (but appropriate for kids)
2. Use creative word combinations and mix unrelated concepts (e.g., "flying pizza" instead of just "pizza")
3. Avoid realistic or ordinary answers - be as imaginative as possible!
4. Ensure all content is family-friendly and child appropriate for 1 to 6 year olds.

Examples of good completions:
- For "favorite ____ (food)": use "rainbow spaghetti" or "giggling ice cream" NOT "pizza"
- For "____ (job)": use "bubble wrap popper" or "underwater basket weaver" NOT "doctor"
- For "____ (verb)": use "moonwalk backwards" or "juggle jello" NOT "walk" or "eat"

Remember: The funnier and more unexpected, the better!"""

Step 7: Re-run evaluation with improved prompt

After updating the prompt, re-run the evaluation to see if the scores improve.

Python
# Re-run evaluation with the updated prompt
# This works because SYSTEM_PROMPT is defined as a global variable, so `generate_game` will use the updated prompt.
results = mlflow.genai.evaluate(
    data=eval_data,
    predict_fn=generate_game,
    scorers=scorers
)

Step 8: Compare results in MLflow UI

To compare evaluation runs, return to the Evaluation UI and compare the two runs. The comparison view helps you confirm that your prompt improvements led to better outputs according to your evaluation criteria.

Sentence game evaluation

Example notebook

The following notebook includes all of the code on this page.

Evaluating a GenAI app quickstart notebook

Open notebook in new tab

Guides and references

For details on concepts and features in this guide, see:

Scorers - Understand how MLflow scorers evaluate GenAI applications.

What you will achieve​

Prerequisites​

Step 1: Create a sentence completion function​

Step 2: Create evaluation data​

Step 3: Define evaluation criteria​

Step 4: Run evaluation​

Step 5: Review the results​

Step 6: Improve the prompt​

Step 7: Re-run evaluation with improved prompt​

Step 8: Compare results in MLflow UI​

Example notebook​