Skip to main content

Human feedback quickstart

Experience the complete human feedback lifecycle in 5 minutes. This quickstart shows you how to collect end-user feedback, add developer annotations, create expert review sessions, and use that feedback to evaluate your GenAI app's quality.

Prerequisites

  1. Install MLflow and required packages

    Bash
    pip install --upgrade "mlflow[databricks]>=3.1.0" openai "databricks-connect>=16.1"
  2. Create an MLflow experiment by following the setup your environment quickstart.

Step 1: Create and trace a simple app

First, create a simple GenAI app using an LLM with MLflow tracing:

Python
import mlflow
from openai import OpenAI

# Enable automatic tracing for all OpenAI API calls
mlflow.openai.autolog()

# Connect to a Databricks LLM via OpenAI using the same credentials as MLflow
# Alternatively, you can use your own OpenAI credentials here
mlflow_creds = mlflow.utils.databricks_utils.get_databricks_host_creds()
client = OpenAI(
api_key=mlflow_creds.token,
base_url=f"{mlflow_creds.host}/serving-endpoints"
)

# Create a RAG app with tracing
@mlflow.trace
def my_chatbot(user_question: str) -> str:
# Retrieve relevant context
context = retrieve_context(user_question)

# Generate response using LLM with retrieved context
response = client.chat.completions.create(
model="databricks-claude-3-7-sonnet", # If using OpenAI directly, use "gpt-4o" or "gpt-3.5-turbo"
messages=[
{"role": "system", "content": "You are a helpful assistant. Use the provided context to answer questions."},
{"role": "user", "content": f"Context: {context}\n\nQuestion: {user_question}"}
],
temperature=0.7,
max_tokens=150
)
return response.choices[0].message.content

@mlflow.trace(span_type="RETRIEVER")
def retrieve_context(query: str) -> str:
# Simulated retrieval - in production, this would search a vector database
if "mlflow" in query.lower():
return "MLflow is an open-source platform for managing the end-to-end machine learning lifecycle. It provides tools for experiment tracking, model packaging, and deployment."
return "General information about machine learning and data science."

# Run the app to generate a trace
response = my_chatbot("What is MLflow?")
print(f"Response: {response}")

# Get the trace ID for the next step
trace_id = mlflow.get_last_active_trace_id()
print(f"Trace ID: {trace_id}")

Step 2: Collect end-user feedback

When users interact with your app, they can provide feedback through UI elements like thumbs up/down buttons. For this quickstart, we'll simulate an end user giving negative feedback by using the SDK directly:

Python
import mlflow
from mlflow.entities.assessment import AssessmentSource, AssessmentSourceType

# Simulate end-user feedback from your app
# In production, this would be triggered when a user clicks thumbs down in your UI
mlflow.log_feedback(
trace_id=trace_id,
name="user_feedback",
value=False, # False for thumbs down - user is unsatisfied
rationale="Missing details about MLflow's key features like Projects and Model Registry",
source=AssessmentSource(
source_type=AssessmentSourceType.HUMAN,
source_id="enduser_123", # Would be actual user ID in production
),
)

print("✅ End-user feedback recorded!")

# In a real app, you would:
# 1. Return the trace_id with your response to the frontend
# 2. When user clicks thumbs up/down, call your backend API
# 3. Your backend would then call mlflow.log_feedback() with the trace_id

Step 3: View feedback in the UI

Launch the MLflow UI to see your traces with feedback:

  1. Navigate to your MLflow Experiment
  2. Navigate to the Traces tab
  3. You'll see your trace with the end-user feedback column:
    • user_feedback showing the thumbs down (False)

Click on an individual trace to see detailed feedback in the Assessments panel.

review app

Step 4: Add developer annotations via the UI

As a developer, you can also add your own feedback and notes directly in the UI:

  1. In the Traces tab, click on a trace to open it
  2. Click on any span (choose the root span for trace-level feedback)
  3. Expand the Assessments tab on the right
  4. Click Add Assessment and fill in:
    • Type: Choose "Feedback" or "Expectation"
    • Name: e.g., "accuracy_score"
    • Value: Your assessment
    • Rationale: Optional explanation
  5. Click Create

The new feedback appears immediately as a column in the Traces table.

Step 5: Send trace for expert review

The negative end-user feedback from Step 2 signals a potential quality issue, but only domain experts can confirm if there's truly a problem and provide the correct answer. Create a labeling session to get authoritative expert feedback:

Python
import mlflow
from mlflow.genai.label_schemas import create_label_schema, InputCategorical, InputText
from mlflow.genai.labeling import create_labeling_session

# Define what feedback to collect
accuracy_schema = create_label_schema(
name="response_accuracy",
type="feedback",
title="Is the response factually accurate?",
input=InputCategorical(options=["Accurate", "Partially Accurate", "Inaccurate"]),
overwrite=True
)

ideal_response_schema = create_label_schema(
name="expected_response",
type="expectation",
title="What would be the ideal response?",
input=InputText(),
overwrite=True
)

# Create a labeling session
labeling_session = create_labeling_session(
name="quickstart_review",
label_schemas=[accuracy_schema.name, ideal_response_schema.name],
)

# Add your trace to the session
# Get the most recent trace from the current experiment
traces = mlflow.search_traces(
max_results=1 # Gets the most recent trace
)
labeling_session.add_traces(traces)

# Share with reviewers
print(f"✅ Trace sent for review!")
print(f"Share this link with reviewers: {labeling_session.url}")

review app

Expert reviewers can now:

  1. Open the Review App URL
  2. See your trace with the question and response (including any end-user feedback)
  3. Assess whether the response is actually accurate
  4. Provide the correct answer in expected_response if needed
  5. Submit their expert assessments as ground truth

Step 6: Use feedback to evaluate your app

Once experts provide feedback, use their expected_response labels to evaluate your app with MLflow's Correctness scorer:

note

Here, we directly use the traces for evaluation. In your application, we reccomend adding labeled traces to an MLflow Evaluation Dataset which provides version tracking and lineage. Follow the create evaluation set guide to learn more.

Python
import mlflow
from mlflow.genai.scorers import Correctness

# Get traces from the labeling session
labeled_traces = mlflow.search_traces(
run_id=labeling_session.mlflow_run_id, # Labeling Sessions are MLflow Runs
)

# Evaluate your app against expert expectations
eval_results = mlflow.genai.evaluate(
data=labeled_traces,
predict_fn=my_chatbot, # The app we created in Step 1
scorers=[Correctness()] # Compares outputs to expected_response
)

The Correctness scorer compares your app's outputs against the expert-provided expected_response, giving you quantitative feedback on alignment with expert expectations.

review app

What you've learned

You've experienced the complete human feedback lifecycle:

  • ✅ Instrumented a GenAI app with MLflow tracing
  • ✅ Collected end-user feedback (simulated via SDK)
  • ✅ Added developer feedback interactively through the UI
  • ✅ Viewed feedback alongside your traces
  • ✅ Created a labeling session for structured expert review
  • ✅ Used expert feedback to evaluate app quality

Next steps

Continue your journey with these recommended actions and tutorials.

Reference guides

Explore detailed documentation for concepts and features mentioned in this guide.