Skip to main content

Tutorial: Evaluate and improve a GenAI application

This tutorial shows you how to use evaluation datasets to evaluate quality, identify issues, and iteratively improve a generative AI application.

This guide steps you through evaluating an email generation app that uses Retrieval-Augmented Generation (RAG). The app simulates retrieving customer information from a database and generates personalized follow-up emails based on the retrieved information.

For a shorter introduction to evaluation, see 10-minute demo: Evaluate a GenAI app.

This tutorial includes the following steps:

  • Create evaluation datasets from real usage data.
  • Evaluate quality with MLflow's LLM judges using the evaluation harness.
  • Interpret results to identify quality issues.
  • Improve your app based on evaluation results.
  • Compare versions to verify improvements worked and did not cause regressions.

The tutorial uses traces from a deployed app to create the evaluation dataset, but the same workflow applies no matter how you created your evaluation dataset. For other approaches to creating an evaluation dataset, see Building MLflow evaluation datasets. For information about tracing, see MLflow Tracing - GenAI observability.

Offline monitoring workflow diagram

Prerequisites

  1. Install required packages:

    Python
    %pip install -q --upgrade "mlflow[databricks]>=3.1.0" openai
    dbutils.library.restartPython()
  2. Create an MLflow experiment. If you are using a Databricks notebook, you can skip this step and use the default notebook experiment. Otherwise, follow the environment setup quickstart to create the experiment and connect to the MLflow Tracking server.

  3. To create an evaluation dataset, you must have CREATE TABLE permissions on a schema in Unity Catalog.

note

Running a complex agent can take a long time. To configure parallelization, see (Optional) Configure parallelization.

Step 1: Create your application

The first step is to create the email generation app. The retrieval component is marked with span_type="RETRIEVER" to enable MLflow's retrieval-specific LLM judges.

  1. Initialize an OpenAI client to connect to either Databricks-hosted LLMs or LLMs hosted by OpenAI.

    Use databricks-openai to get an OpenAI client that connects to Databricks-hosted LLMs. Select a model from the available foundation models.

    Python
    import mlflow
    from databricks_openai import DatabricksOpenAI

    # Enable MLflow's autologging to instrument your application with Tracing
    mlflow.openai.autolog()

    # Set up MLflow tracking to Databricks
    mlflow.set_tracking_uri("databricks")
    mlflow.set_experiment("/Shared/docs-demo")

    # Create an OpenAI client that is connected to Databricks-hosted LLMs
    client = DatabricksOpenAI()

    # Select an LLM
    model_name = "databricks-claude-sonnet-4"
  2. Create the email generation app:

    Python
    from mlflow.entities import Document
    from typing import List, Dict

    # Simulated customer relationship management database
    CRM_DATA = {
    "Acme Corp": {
    "contact_name": "Alice Chen",
    "recent_meeting": "Product demo on Monday, very interested in enterprise features. They asked about: advanced analytics, real-time dashboards, API integrations, custom reporting, multi-user support, SSO authentication, data export capabilities, and pricing for 500+ users",
    "support_tickets": ["Ticket #123: API latency issue (resolved last week)", "Ticket #124: Feature request for bulk import", "Ticket #125: Question about GDPR compliance"],
    "account_manager": "Sarah Johnson"
    },
    "TechStart": {
    "contact_name": "Bob Martinez",
    "recent_meeting": "Initial sales call last Thursday, requested pricing",
    "support_tickets": ["Ticket #456: Login issues (open - critical)", "Ticket #457: Performance degradation reported", "Ticket #458: Integration failing with their CRM"],
    "account_manager": "Mike Thompson"
    },
    "Global Retail": {
    "contact_name": "Carol Wang",
    "recent_meeting": "Quarterly review yesterday, happy with platform performance",
    "support_tickets": [],
    "account_manager": "Sarah Johnson"
    }
    }

    # Use a retriever span to enable MLflow's predefined RetrievalGroundedness judge to work
    @mlflow.trace(span_type="RETRIEVER")
    def retrieve_customer_info(customer_name: str) -> List[Document]:
    """Retrieve customer information from CRM database"""
    if customer_name in CRM_DATA:
    data = CRM_DATA[customer_name]
    return [
    Document(
    id=f"{customer_name}_meeting",
    page_content=f"Recent meeting: {data['recent_meeting']}",
    metadata={"type": "meeting_notes"}
    ),
    Document(
    id=f"{customer_name}_tickets",
    page_content=f"Support tickets: {', '.join(data['support_tickets']) if data['support_tickets'] else 'No open tickets'}",
    metadata={"type": "support_status"}
    ),
    Document(
    id=f"{customer_name}_contact",
    page_content=f"Contact: {data['contact_name']}, Account Manager: {data['account_manager']}",
    metadata={"type": "contact_info"}
    )
    ]
    return []

    @mlflow.trace
    def generate_sales_email(customer_name: str, user_instructions: str) -> Dict[str, str]:
    """Generate personalized sales email based on customer data & a sale's rep's instructions."""
    # Retrieve customer information
    customer_docs = retrieve_customer_info(customer_name)

    # Combine retrieved context
    context = "\n".join([doc.page_content for doc in customer_docs])

    # Generate email using retrieved context
    prompt = f"""You are a sales representative. Based on the customer information below,
    write a brief follow-up email that addresses their request.

    Customer Information:
    {context}

    User instructions: {user_instructions}

    Keep the email concise and personalized."""

    response = client.chat.completions.create(
    model=model_name, # This example uses a Databricks hosted LLM - you can replace this with any AI Gateway or Model Serving endpoint. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
    messages=[
    {"role": "system", "content": "You are a helpful sales assistant."},
    {"role": "user", "content": prompt}
    ],
    max_tokens=2000
    )

    return {"email": response.choices[0].message.content}

    # Test the application
    result = generate_sales_email("Acme Corp", "Follow up after product demo")
    print(result["email"])

Evaluation app trace

Step 2: Simulate production traffic

This step simulates traffic for demonstration purposes. In practice, you would use traces from actual usage to create your evaluation dataset.

Python
# Simulate beta testing traffic with scenarios designed to fail guidelines
test_requests = [
{"customer_name": "Acme Corp", "user_instructions": "Follow up after product demo"},
{"customer_name": "TechStart", "user_instructions": "Check on support ticket status"},
{"customer_name": "Global Retail", "user_instructions": "Send quarterly review summary"},
{"customer_name": "Acme Corp", "user_instructions": "Write a very detailed email explaining all our product features, pricing tiers, implementation timeline, and support options"},
{"customer_name": "TechStart", "user_instructions": "Send an enthusiastic thank you for their business!"},
{"customer_name": "Global Retail", "user_instructions": "Send a follow-up email"},
{"customer_name": "Acme Corp", "user_instructions": "Just check in to see how things are going"},
]

# Run requests and capture traces
print("Simulating production traffic...")
for req in test_requests:
try:
result = generate_sales_email(**req)
print(f"✓ Generated email for {req['customer_name']}")
except Exception as e:
print(f"✗ Error for {req['customer_name']}: {e}")

Step 3: Create evaluation dataset

In this step you save the traces to an evaluation dataset. Storing the traces in an evaluation dataset allows you to link evaluation results to the dataset so you can track changes to the dataset over time and see all evaluation results generated using this dataset.

  1. Click Experiments in the sidebar to display the Experiments page.

  2. Click on the name of your experiment to open it.

    Open experiment

  3. In the left sidebar, click Traces.

  4. Use the checkboxes on the left side of the trace list to select the traces you want to add. To select all traces on the current page, click the checkbox next to Trace ID in the column header.

    Select traces

  5. Click Actions. The button label shows the number of selected traces, for example Actions (3).

    Actions menu

  6. Under Use for evaluation, select Add to evaluation dataset. The Add traces to evaluation dataset dialog opens.

  7. If no evaluation datasets exist for this experiment, or if you want to add traces to a new dataset, follow these steps to create a new evaluation dataset:

    1. Click Create new dataset.
    2. Select the Unity Catalog schema to hold the new dataset.
    3. Enter a name for the dataset and click Create Dataset.
    4. Click Export and then click Done.

    Add traces dialog if no evaluation datasets exist

    If evaluation datasets already exist for the experiment, click Export to the right of the dataset you want to add the traces to. You can export to more than one dataset. When you've finished exporting, click Done.

    Add traces dialog if with existing evaluation datasets

Step 4: Run evaluation with LLM judges

In this step, you use MLflow's built-in LLM judges to automatically evaluate different aspects of the GenAI app's quality. To learn more, see LLM judges and code-based scorers.

Python
from mlflow.genai.scorers import (
RetrievalGroundedness,
RelevanceToQuery,
Safety,
Guidelines,
)

# Save the LLM judges as a variable so you can re-use them in step 7

email_judges = [
RetrievalGroundedness(), # Checks if email content is grounded in retrieved data
Guidelines(
name="follows_instructions",
guidelines="The generated email must follow the user_instructions in the request.",
),
Guidelines(
name="concise_communication",
guidelines="The email MUST be concise and to the point. The email should communicate the key message efficiently without being overly brief or losing important context.",
),
Guidelines(
name="mentions_contact_name",
guidelines="The email MUST explicitly mention the customer contact's first name (e.g., Alice, Bob, Carol) in the greeting. Generic greetings like 'Hello' or 'Dear Customer' are not acceptable.",
),
Guidelines(
name="professional_tone",
guidelines="The email must be in a professional tone.",
),
Guidelines(
name="includes_next_steps",
guidelines="The email MUST end with a specific, actionable next step that includes a concrete timeline.",
),
RelevanceToQuery(), # Checks if email addresses the user's request
Safety(), # Checks for harmful or inappropriate content
]

# Run evaluation with LLM judges
eval_results = mlflow.genai.evaluate(
data=eval_dataset,
predict_fn=generate_sales_email,
scorers=email_judges,
)

Step 5: View and interpret results

Running mlflow.genai.evaluate() creates an evaluation run. For details, see Evaluation runs.

An evaluation run is like a test report that captures everything about how your app performed on a specific dataset. The evaluation run contains a trace for each row in your evaluation dataset annotated with feedback from each judge.

Using the evaluation run, you can view aggregate metrics and investigate test cases where your app performed poorly.

This evaluation shows several issues:

  • Poor instruction following - The agent frequently provides responses that don't match user requests, such as sending detailed product information when asked for simple check-ins, or providing support ticket updates when asked for enthusiastic thank-you messages.
  • Lack of conciseness - Most emails are unnecessarily long and include excessive details that dilute the key message, failing to communicate efficiently despite instructions to keep emails "concise and personalized".
  • Missing concrete next steps - The majority of emails fail to end with specific, actionable next steps that include concrete timelines, which was identified as a required element.

Assessment summary

  1. Click Experiments in the sidebar to display the Experiments page.

  2. Click on the name of your experiment to open it.

  3. In the left sidebar, click Evaluation runs. The right pane shows a table of traces.

    Evaluation runs table

    If you do not see the Assessments with their Pass and Fail labels, scroll to the right or hover over the pane separator and click the left-pointing arrow.

    Expand table

  4. To see the rationale for the Pass or Fail label, hover over the label.

    Hover over label to show rationale

Details and add feedback

To see more details for each trace:

  1. Click on the request identifier in the Request column. A window appears showing the full trace, including inputs and outputs for each step.

    Request details window

  2. At the right, you can add Feedback or Expectations to apply to the response for this request. If you do not see the Assessments pane, click Assessments button. To add a new Assessment, scroll down and click Add new assessment button.

  3. You can use the arrows at either side of this window to step through the requests.

    Step through requests using arrows

Step 6: Create an improved version

Use the evaluation results to create an improved version that addresses the identified issues.

When creating an improved version, focus on targeted changes based on evaluation results. Common improvement strategies include:

  • Prompt engineering: Refine system prompts to address specific failure patterns, add explicit guidelines for edge cases, include examples demonstrating correct handling, or adjust tone or style.
  • Guardrails: Implement validation steps in application logic and add post-processing to check outputs before presenting to users.
  • :Retrieval improvements (for RAG apps): Enhance retrieval mechanisms if relevant documents aren't being found by examining retrieval spans, improving embedding models, or refining chunking strategies.
  • Reasoning enhancements: Break complex tasks into multiple spans, implement chain-of-thought techniques, or add verification steps for critical outputs.

The code below shows prompt engineering improvements based on the evaluation results:

Python
@mlflow.trace
def generate_sales_email_v2(customer_name: str, user_instructions: str) -> Dict[str, str]:
"""Generate personalized sales email based on customer data & a sale's rep's instructions."""
# Retrieve customer information
customer_docs = retrieve_customer_info(customer_name) # retrive_customer_info is defined in Step 1

if not customer_docs:
return {"error": f"No customer data found for {customer_name}"}

# Combine retrieved context
context = "\n".join([doc.page_content for doc in customer_docs])

# Generate email using retrieved context with better instruction following
prompt = f"""You are a sales representative writing an email.

MOST IMPORTANT: Follow these specific user instructions exactly:
{user_instructions}

Customer context (only use what's relevant to the instructions):
{context}

Guidelines:
1. PRIORITIZE the user instructions above all else
2. Keep the email CONCISE - only include information directly relevant to the user's request
3. End with a specific, actionable next step that includes a concrete timeline (e.g., "I'll follow up with pricing by Friday" or "Let's schedule a 15-minute call this week")
4. Only reference customer information if it's directly relevant to the user's instructions

Write a brief, focused email that satisfies the user's exact request."""

response = client.chat.completions.create(
model="databricks-claude-sonnet-4-5",
messages=[
{"role": "system", "content": "You are a helpful sales assistant who writes concise, instruction-focused emails."},
{"role": "user", "content": prompt}
],
max_tokens=2000
)

return {"email": response.choices[0].message.content}

# Test the application
result = generate_sales_email("Acme Corp", "Follow up after product demo")
print(result["email"])

Step 7: Evaluate the new version and compare

Run the evaluation on the improved version using the same judges and dataset to see if you've successfully addressed the issues.

Python
import mlflow

# Run evaluation of the new version with the same judges as before
# Use start_run to name the evaluation run in the UI
with mlflow.start_run(run_name="v2"):
eval_results_v2 = mlflow.genai.evaluate(
data=eval_dataset, # same eval dataset
predict_fn=generate_sales_email_v2, # new app version
scorers=email_judges, # same judges as step 4
)

Step 8: Compare results

Compare the results to understand if the changes improved quality.

  1. Click Experiments in the sidebar to display the Experiments page.

  2. Click on the name of your experiment to open it.

  3. In the left sidebar, click Evaluation runs. The left pane shows a a list of evaluation runs for this experiment.

    Runs pane

  4. Check the boxes for the runs you want to compare.

  5. From the Actions drop-down menu, select Compare.

    Select runs to compare

  6. The right pane displays a comparison of each trace in the selected runs.

    Trace comparison screen

  7. For more details, click on the request identifier in the Request column. A window appears showing the full traces for the request from each run selected for comparison.

    Comparison details window

    To see the details of each assessment, click See details. To see the trace details, click See detailed trace view.

Step 9: Continue iterating

Based on the evaluation results, you can continue iterating to improve the application's quality and test each new fix.

Example notebook

The following notebook includes all of the code on this page.

Evaluating a GenAI app quickstart notebook

Open notebook in new tab

Next steps

Reference guides