Labeling Sessions
Labeling Sessions provide a structured way to gather feedback from domain experts on the behavior of your GenAI applications. A Labeling Session is a special type of MLflow Run that contains a specific set of traces that you want domain experts to review using the MLflow Review App.
The goal of a Labeling Session is to collect human-generated Assessments
(labels) on existing MLflow Traces. You can capture either Feedback
or Expectations
, which can then be used to improve your GenAI app through systematic evaluation.
Since a Labeling Session is an MLflow Run, the collected data (traces and their associated Assessments
) can be accessed programmatically using MLflow SDKs (e.g., mlflow.search_runs()
) and visualized within the MLflow UI - each Labeling Session appears in the Evaluations tab.
How Labeling Sessions Work
When you create a Labeling Session, you define:
- Name: A descriptive identifier for the session
- Assigned Users: Domain experts who will provide labels
- Agent: (Optional) The GenAI app to generate responses if needed
- Label Schemas: The questions and format for feedback collection
- Multi-turn Chat: Whether to support conversation-style labeling
The session acts as a container for traces and their associated labels, enabling systematic feedback collection that can drive evaluation and improvement workflows.
Creating Labeling Sessions
Use mlflow.genai.labeling.create_labeling_session()
to create new sessions with specific configurations.
Creating Sessions Through the UI
Navigate to the Labeling tab in the MLflow UI to create sessions visually. This provides an intuitive interface for defining session parameters, assigning users, and selecting label schemas without writing code.
Viewing Sessions Through the UI
Navigate to the Labeling tab in the MLflow UI to view sessions visually.
Creating Sessions Programmatically
Use the MLflow SDK to create sessions with full programmatic control over all configuration options.
Basic Session Creation
import mlflow.genai.labeling as labeling
import mlflow.genai.label_schemas as schemas
# Create a simple labeling session with built-in schemas
session = labeling.create_labeling_session(
name="customer_service_review_jan_2024",
assigned_users=["alice@company.com", "bob@company.com"],
label_schemas=[schemas.EXPECTED_FACTS] # Required: at least one schema needed
)
print(f"Created session: {session.name}")
print(f"Session ID: {session.labeling_session_id}")
Labeling session names are not guaranteed to be unique. Multiple sessions can have the same name. For reliable programmatic access, store and reference sessions by their MLflow Run ID (session.mlflow_run_id
) rather than by name.
Label schemas are required when creating a labeling session. You can use built-in schemas (EXPECTED_FACTS
, EXPECTED_RESPONSE
, GUIDELINES
) or create custom ones. See the Labeling Schemas guide for detailed information on creating and using schemas.
Session with Custom Label Schemas
import mlflow.genai.labeling as labeling
import mlflow.genai.label_schemas as schemas
# Create custom schemas first (see Labeling Schemas guide)
quality_schema = schemas.create_label_schema(
name="response_quality",
type="feedback",
title="Rate the response quality",
input=schemas.InputCategorical(options=["Poor", "Fair", "Good", "Excellent"]),
overwrite=True,
)
# Create session using the schemas
session = labeling.create_labeling_session(
name="quality_assessment_session",
assigned_users=["expert@company.com"],
label_schemas=["response_quality", schemas.EXPECTED_FACTS],
)
Managing Labeling Sessions
Since labeling session names are not unique, finding sessions by name may return multiple matches or unexpected results. For production workflows, it's recommended to store and reference sessions by their MLflow Run ID.
Retrieving Sessions
import mlflow.genai.labeling as labeling
# Get all labeling sessions
all_sessions = labeling.get_labeling_sessions()
print(f"Found {len(all_sessions)} sessions")
for session in all_sessions:
print(f"- {session.name} (ID: {session.labeling_session_id})")
print(f" Assigned users: {session.assigned_users}")
Getting a Specific Session
import mlflow
import mlflow.genai.labeling as labeling
import pandas as pd
# Get all labeling sessions first
all_sessions = labeling.get_labeling_sessions()
# Find session by name (note: names may not be unique)
target_session = None
for session in all_sessions:
if session.name == "customer_service_review_jan_2024":
target_session = session
break
if target_session:
print(f"Session name: {target_session.name}")
print(f"Experiment ID: {target_session.experiment_id}")
print(f"MLflow Run ID: {target_session.mlflow_run_id}")
print(f"Label schemas: {target_session.label_schemas}")
else:
print("Session not found")
# Alternative: Get session by MLflow Run ID (if you know it)
run_id = "your_labeling_session_run_id"
run = mlflow.search_runs(
experiment_ids=["your_experiment_id"],
filter_string=f"tags.mlflow.runName LIKE '%labeling_session%' AND attribute.run_id = '{run_id}'"
).iloc[0]
print(f"Found labeling session run: {run['run_id']}")
print(f"Session name: {run['tags.mlflow.runName']}")
Since labeling session names are not guaranteed to be unique, it's recommended to store and reference sessions by their MLflow Run ID when you need to retrieve specific sessions programmatically.
Deleting Sessions
import mlflow.genai.labeling as labeling
# Find the session to delete by name
all_sessions = labeling.get_labeling_sessions()
session_to_delete = None
for session in all_sessions:
if session.name == "customer_service_review_jan_2024":
session_to_delete = session
break
if session_to_delete:
# Delete the session (removes from Review App)
review_app = labeling.delete_labeling_session(session_to_delete)
print(f"Deleted session: {session_to_delete.name}")
else:
print("Session not found")
Adding Traces to Sessions
Once created, populate your session with traces for expert review using the add_traces()
method.
For details on how traces are rendered and displayed to labelers in the Review App UI, including how different data types (dictionaries, OpenAI messages, tool calls) are presented, see the Review App guide.
Adding Traces Through the UI
You can also add traces directly through the MLflow UI by navigating to the Traces tab, selecting the traces you want to include, and using the export functionality to add them to a labeling session.
Adding Traces from Search Results
import mlflow
import mlflow.genai.labeling as labeling
from openai import OpenAI
# First, let's create some sample traces with a simple app
# Connect to a Databricks LLM via OpenAI using the same credentials as MLflow
# Alternatively, you can use your own OpenAI credentials here
mlflow_creds = mlflow.utils.databricks_utils.get_databricks_host_creds()
client = OpenAI(
api_key=mlflow_creds.token,
base_url=f"{mlflow_creds.host}/serving-endpoints"
)
@mlflow.trace
def support_app(question: str):
"""Simple support app that generates responses"""
mlflow.update_current_trace(tags={"test_tag": "C001"})
response = client.chat.completions.create(
model="databricks-claude-3-7-sonnet", # This example uses Databricks hosted Claude 3.5 Sonnet. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
messages=[
{"role": "system", "content": "You are a helpful customer support agent."},
{"role": "user", "content": question},
],
)
return {"response": response.choices[0].message.content}
# Generate some sample traces
with mlflow.start_run():
# Create traces with negative feedback for demonstration
support_app("My order is delayed")
support_app("I can't log into my account")
# Now search for traces to label
traces_df = mlflow.search_traces(
filter_string="tags.test_tag = 'C001'", max_results=50
)
# Create session and add traces
session = labeling.create_labeling_session(
name="negative_feedback_review",
assigned_users=["quality_expert@company.com"],
label_schemas=["response_quality", "expected_facts"]
)
# Add traces from search results
session.add_traces(traces_df)
print(f"Added {len(traces_df)} traces to session")
Adding Individual Trace Objects
import mlflow
import mlflow.genai.labeling as labeling
from openai import OpenAI
# Set up the app to generate traces
# Connect to a Databricks LLM via OpenAI using the same credentials as MLflow
# Alternatively, you can use your own OpenAI credentials here
mlflow_creds = mlflow.utils.databricks_utils.get_databricks_host_creds()
client = OpenAI(
api_key=mlflow_creds.token,
base_url=f"{mlflow_creds.host}/serving-endpoints"
)
@mlflow.trace
def support_app(question: str):
"""Simple support app that generates responses"""
mlflow.update_current_trace(tags={"test_tag": "C001"})
response = client.chat.completions.create(
model="databricks-claude-3-7-sonnet", # This example uses Databricks hosted Claude 3.5 Sonnet. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
messages=[
{"role": "system", "content": "You are a helpful customer support agent."},
{"role": "user", "content": question},
],
)
return {"response": response.choices[0].message.content}
# Generate specific traces for edge cases
with mlflow.start_run() as run:
# Create traces for specific scenarios
support_app("What's your refund policy?")
trace_id_1 = mlflow.get_last_active_trace_id()
support_app("How do I cancel my subscription?")
trace_id_2 = mlflow.get_last_active_trace_id()
support_app("The website is down")
trace_id_3 = mlflow.get_last_active_trace_id()
# Get the trace objects
trace1 = mlflow.get_trace(trace_id_1)
trace2 = mlflow.get_trace(trace_id_2)
trace3 = mlflow.get_trace(trace_id_3)
# Create session and add traces
session = labeling.create_labeling_session(
name="negative_feedback_review",
assigned_users=["nikhil.thorat@databricks.com"],
label_schemas=["response_quality", schemas.EXPECTED_FACTS],
)
# Add individual traces
session.add_traces([trace1, trace2, trace3])
Managing Assigned Users
User Access Requirements
Any user in the Databricks account can be assigned to a labeling session, regardless of whether they have workspace access. However, granting a user permission to a labeling session will give them access to the labeling session's MLflow experiment.
Setup permissions for users
- For users who do not have access to the workspace, an account admin uses account-level SCIM provisioning to sync users and groups automatically from your identity provider to your Databricks account. You can also manually register these users and groups to give them access when you set up identities in Databricks. See User and group management.
- For users who already have access to the workspace that contains the review app, no additional configuration is required.
When you assign users to a labeling session, the system automatically grants necessary WRITE
permissions on the MLflow Experiment containing the Labeling Session. This gives assigned users access to view and interact with the experiment data.
Adding Users to Existing Sessions
import mlflow.genai.labeling as labeling
# Find existing session by name
all_sessions = labeling.get_labeling_sessions()
session = None
for s in all_sessions:
if s.name == "customer_review_session":
session = s
break
if session:
# Add more users to the session
new_users = ["expert2@company.com", "expert3@company.com"]
session.set_assigned_users(session.assigned_users + new_users)
print(f"Session now has users: {session.assigned_users}")
else:
print("Session not found")
Replacing Assigned Users
import mlflow.genai.labeling as labeling
# Find session by name
all_sessions = labeling.get_labeling_sessions()
session = None
for s in all_sessions:
if s.name == "session_name":
session = s
break
if session:
# Replace all assigned users
session.set_assigned_users(["new_expert@company.com", "lead_reviewer@company.com"])
print("Updated assigned users list")
else:
print("Session not found")
Syncing to Evaluation Datasets
A powerful feature is the ability to synchronize collected Expectations
to Evaluation Datasets.
How Dataset Synchronization Works
The sync()
method performs an intelligent upsert operation:
- Unique Key: Each trace's inputs serve as a unique key to identify records in the dataset
- Expectation Updates: For traces with matching inputs, expectations from the labeling session overwrite existing expectations in the dataset (if the expectation names are the same)
- New Records: Traces from the labeling session that don't exist in the dataset (based on input matching) are added as new records
- Preservation: Existing dataset records with different inputs remain unchanged
This approach allows you to iteratively improve your evaluation dataset by adding new examples and updating ground truth for existing ones without losing previous work.
Dataset Synchronization
import mlflow.genai.labeling as labeling
# Find session with completed labels by name
all_sessions = labeling.get_labeling_sessions()
session = None
for s in all_sessions:
if s.name == "completed_review_session":
session = s
break
if session:
# Sync expectations to dataset
session.sync(dataset_name="customer_service_eval_dataset")
print("Synced expectations to evaluation dataset")
else:
print("Session not found")
Best Practices
Session Organization
- Descriptive names: Use clear, date-stamped names like
customer_service_review_march_2024
- Focused scope: Keep sessions focused on specific evaluation goals or time periods
- Appropriate size: Aim for 25-100 traces per session to avoid reviewer fatigue
Session identification: Since session names are not unique, always store the session.mlflow_run_id
when you create a session. Use the run ID for programmatic access instead of relying on session names.
import mlflow.genai.labeling as labeling
# Good: Store run ID for later reference
session = labeling.create_labeling_session(name="my_session", ...)
session_run_id = session.mlflow_run_id # Store this!
# Later: Use run ID to find session via mlflow.search_runs()
# rather than searching by name through all sessions
User Management
- Clear assignments: Assign users based on domain expertise and availability
- Balanced workload: Distribute labeling work evenly across multiple experts
- Permission awareness: Remember that users need Databricks Workspace access
Summary
MLflow Labeling Sessions provide a structured framework for collecting expert feedback on GenAI applications. By combining sessions with Label Schemas and the Review App, you can systematically gather high-quality human assessments that drive evaluation and improvement workflows.
Key capabilities include:
- Flexible session creation with custom configurations
- User assignment and permission management
- Dataset synchronization for evaluation workflows
Use Labeling Sessions to transform ad-hoc feedback collection into systematic, repeatable processes that continuously improve your GenAI applications.
Next Steps
- Label existing traces - Step-by-step guide using labeling sessions
- Create custom labeling schemas - Define structured feedback questions
- Build evaluation datasets - Convert labeled sessions into test datasets