Labeling Sessions

Labeling Sessions provide a structured way to gather feedback from domain experts on the behavior of your GenAI applications. A Labeling Session is a special type of MLflow Run that contains a specific set of traces that you want domain experts to review using the MLflow Review App.

The goal of a Labeling Session is to collect human-generated Assessments (labels) on existing MLflow Traces. You can capture either Feedback or Expectations, which can then be used to improve your GenAI app through systematic evaluation.

note

Since a Labeling Session is an MLflow Run, the collected data (traces and their associated Assessments) can be accessed programmatically using MLflow SDKs (e.g., mlflow.search_runs()) and visualized within the MLflow UI - each Labeling Session appears in the Evaluations tab.

How Labeling Sessions Work

When you create a Labeling Session, you define:

Name: A descriptive identifier for the session
Assigned Users: Domain experts who will provide labels
Agent: (Optional) The GenAI app to generate responses if needed
Label Schemas: The questions and format for feedback collection
Multi-turn Chat: Whether to support conversation-style labeling

The session acts as a container for traces and their associated labels, enabling systematic feedback collection that can drive evaluation and improvement workflows.

Creating Labeling Sessions

Use mlflow.genai.labeling.create_labeling_session() to create new sessions with specific configurations.

Creating Sessions Through the UI

Navigate to the Labeling tab in the MLflow UI to create sessions visually. This provides an intuitive interface for defining session parameters, assigning users, and selecting label schemas without writing code.

label session

Viewing Sessions Through the UI

Navigate to the Labeling tab in the MLflow UI to view sessions visually.

label session

Creating Sessions Programmatically

Use the MLflow SDK to create sessions with full programmatic control over all configuration options.

Basic Session Creation

Python
import mlflow.genai.labeling as labeling
import mlflow.genai.label_schemas as schemas

# Create a simple labeling session with built-in schemas
session = labeling.create_labeling_session(
    name="customer_service_review_jan_2024",
    assigned_users=["alice@company.com", "bob@company.com"],
    label_schemas=[schemas.EXPECTED_FACTS]  # Required: at least one schema needed
)

print(f"Created session: {session.name}")
print(f"Session ID: {session.labeling_session_id}")

warning

Labeling session names are not guaranteed to be unique. Multiple sessions can have the same name. For reliable programmatic access, store and reference sessions by their MLflow Run ID (session.mlflow_run_id) rather than by name.

note

Label schemas are required when creating a labeling session. You can use built-in schemas (EXPECTED_FACTS, EXPECTED_RESPONSE, GUIDELINES) or create custom ones. See the Labeling Schemas guide for detailed information on creating and using schemas.

Session with Custom Label Schemas

Python
import mlflow.genai.labeling as labeling
import mlflow.genai.label_schemas as schemas

# Create custom schemas first (see Labeling Schemas guide)
quality_schema = schemas.create_label_schema(
    name="response_quality",
    type="feedback",
    title="Rate the response quality",
    input=schemas.InputCategorical(options=["Poor", "Fair", "Good", "Excellent"]),
    overwrite=True,
)

# Create session using the schemas
session = labeling.create_labeling_session(
    name="quality_assessment_session",
    assigned_users=["expert@company.com"],
    label_schemas=["response_quality", schemas.EXPECTED_FACTS],
)

Managing Labeling Sessions

warning

Since labeling session names are not unique, finding sessions by name may return multiple matches or unexpected results. For production workflows, it's recommended to store and reference sessions by their MLflow Run ID.

Retrieving Sessions

Python
import mlflow.genai.labeling as labeling

# Get all labeling sessions
all_sessions = labeling.get_labeling_sessions()
print(f"Found {len(all_sessions)} sessions")

for session in all_sessions:
    print(f"- {session.name} (ID: {session.labeling_session_id})")
    print(f"  Assigned users: {session.assigned_users}")

Getting a Specific Session

Python
import mlflow
import mlflow.genai.labeling as labeling
import pandas as pd

# Get all labeling sessions first
all_sessions = labeling.get_labeling_sessions()

# Find session by name (note: names may not be unique)
target_session = None
for session in all_sessions:
    if session.name == "customer_service_review_jan_2024":
        target_session = session
        break

if target_session:
    print(f"Session name: {target_session.name}")
    print(f"Experiment ID: {target_session.experiment_id}")
    print(f"MLflow Run ID: {target_session.mlflow_run_id}")
    print(f"Label schemas: {target_session.label_schemas}")
else:
    print("Session not found")

# Alternative: Get session by MLflow Run ID (if you know it)
run_id = "your_labeling_session_run_id"
run = mlflow.search_runs(
    experiment_ids=["your_experiment_id"],
    filter_string=f"tags.mlflow.runName LIKE '%labeling_session%' AND attribute.run_id = '{run_id}'"
).iloc[0]

print(f"Found labeling session run: {run['run_id']}")
print(f"Session name: {run['tags.mlflow.runName']}")

note

Since labeling session names are not guaranteed to be unique, it's recommended to store and reference sessions by their MLflow Run ID when you need to retrieve specific sessions programmatically.

Deleting Sessions

Python
import mlflow.genai.labeling as labeling

# Find the session to delete by name
all_sessions = labeling.get_labeling_sessions()
session_to_delete = None
for session in all_sessions:
    if session.name == "customer_service_review_jan_2024":
        session_to_delete = session
        break

if session_to_delete:
    # Delete the session (removes from Review App)
    review_app = labeling.delete_labeling_session(session_to_delete)
    print(f"Deleted session: {session_to_delete.name}")
else:
    print("Session not found")

Adding Traces to Sessions

Once created, populate your session with traces for expert review using the add_traces() method.

note

For details on how traces are rendered and displayed to labelers in the Review App UI, including how different data types (dictionaries, OpenAI messages, tool calls) are presented, see the Review App guide.

Adding Traces Through the UI

You can also add traces directly through the MLflow UI by navigating to the Traces tab, selecting the traces you want to include, and using the export functionality to add them to a labeling session.

Adding Traces from Search Results

Python
import mlflow
import mlflow.genai.labeling as labeling
from openai import OpenAI

# First, let's create some sample traces with a simple app
# Connect to a Databricks LLM via OpenAI using the same credentials as MLflow
# Alternatively, you can use your own OpenAI credentials here
mlflow_creds = mlflow.utils.databricks_utils.get_databricks_host_creds()
client = OpenAI(
    api_key=mlflow_creds.token,
    base_url=f"{mlflow_creds.host}/serving-endpoints"
)

@mlflow.trace
def support_app(question: str):
    """Simple support app that generates responses"""
    mlflow.update_current_trace(tags={"test_tag": "C001"})
    response = client.chat.completions.create(
        model="databricks-claude-3-7-sonnet",  # This example uses Databricks hosted Claude 3.5 Sonnet. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
        messages=[
            {"role": "system", "content": "You are a helpful customer support agent."},
            {"role": "user", "content": question},
        ],
    )
    return {"response": response.choices[0].message.content}

# Generate some sample traces
with mlflow.start_run():
    # Create traces with negative feedback for demonstration
    support_app("My order is delayed")

    support_app("I can't log into my account")

# Now search for traces to label
traces_df = mlflow.search_traces(
    filter_string="tags.test_tag = 'C001'", max_results=50
)

# Create session and add traces
session = labeling.create_labeling_session(
    name="negative_feedback_review",
    assigned_users=["quality_expert@company.com"],
    label_schemas=["response_quality", "expected_facts"]
)

# Add traces from search results
session.add_traces(traces_df)
print(f"Added {len(traces_df)} traces to session")

Adding Individual Trace Objects

Python
import mlflow
import mlflow.genai.labeling as labeling
from openai import OpenAI

# Set up the app to generate traces
# Connect to a Databricks LLM via OpenAI using the same credentials as MLflow
# Alternatively, you can use your own OpenAI credentials here
mlflow_creds = mlflow.utils.databricks_utils.get_databricks_host_creds()
client = OpenAI(
    api_key=mlflow_creds.token,
    base_url=f"{mlflow_creds.host}/serving-endpoints"
)

@mlflow.trace
def support_app(question: str):
    """Simple support app that generates responses"""
    mlflow.update_current_trace(tags={"test_tag": "C001"})
    response = client.chat.completions.create(
        model="databricks-claude-3-7-sonnet",  # This example uses Databricks hosted Claude 3.5 Sonnet. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
        messages=[
            {"role": "system", "content": "You are a helpful customer support agent."},
            {"role": "user", "content": question},
        ],
    )
    return {"response": response.choices[0].message.content}

# Generate specific traces for edge cases
with mlflow.start_run() as run:
    # Create traces for specific scenarios
    support_app("What's your refund policy?")
    trace_id_1 = mlflow.get_last_active_trace_id()

    support_app("How do I cancel my subscription?")
    trace_id_2 = mlflow.get_last_active_trace_id()

    support_app("The website is down")
    trace_id_3 = mlflow.get_last_active_trace_id()

# Get the trace objects
trace1 = mlflow.get_trace(trace_id_1)
trace2 = mlflow.get_trace(trace_id_2)
trace3 = mlflow.get_trace(trace_id_3)

# Create session and add traces
session = labeling.create_labeling_session(
    name="negative_feedback_review",
    assigned_users=["nikhil.thorat@databricks.com"],
    label_schemas=["response_quality", schemas.EXPECTED_FACTS],
)

# Add individual traces
session.add_traces([trace1, trace2, trace3])

Managing Assigned Users

User Access Requirements

Any user in the Databricks account can be assigned to a labeling session, regardless of whether they have workspace access. However, granting a user permission to a labeling session will give them access to the labeling session's MLflow experiment.

Setup permissions for users

For users who do not have access to the workspace, an account admin uses account-level SCIM provisioning to sync users and groups automatically from your identity provider to your Databricks account. You can also manually register these users and groups to give them access when you set up identities in Databricks. See User and group management.
For users who already have access to the workspace that contains the review app, no additional configuration is required.

important

When you assign users to a labeling session, the system automatically grants necessary WRITE permissions on the MLflow Experiment containing the Labeling Session. This gives assigned users access to view and interact with the experiment data.

Adding Users to Existing Sessions

Python
import mlflow.genai.labeling as labeling

# Find existing session by name
all_sessions = labeling.get_labeling_sessions()
session = None
for s in all_sessions:
    if s.name == "customer_review_session":
        session = s
        break

if session:
    # Add more users to the session
    new_users = ["expert2@company.com", "expert3@company.com"]
    session.set_assigned_users(session.assigned_users + new_users)
    print(f"Session now has users: {session.assigned_users}")
else:
    print("Session not found")

Replacing Assigned Users

Python
import mlflow.genai.labeling as labeling

# Find session by name
all_sessions = labeling.get_labeling_sessions()
session = None
for s in all_sessions:
    if s.name == "session_name":
        session = s
        break

if session:
    # Replace all assigned users
    session.set_assigned_users(["new_expert@company.com", "lead_reviewer@company.com"])
    print("Updated assigned users list")
else:
    print("Session not found")

Syncing to Evaluation Datasets

A powerful feature is the ability to synchronize collected Expectations to Evaluation Datasets.

How Dataset Synchronization Works

The sync() method performs an intelligent upsert operation:

Unique Key: Each trace's inputs serve as a unique key to identify records in the dataset
Expectation Updates: For traces with matching inputs, expectations from the labeling session overwrite existing expectations in the dataset (if the expectation names are the same)
New Records: Traces from the labeling session that don't exist in the dataset (based on input matching) are added as new records
Preservation: Existing dataset records with different inputs remain unchanged

This approach allows you to iteratively improve your evaluation dataset by adding new examples and updating ground truth for existing ones without losing previous work.

Dataset Synchronization

Python
import mlflow.genai.labeling as labeling

# Find session with completed labels by name
all_sessions = labeling.get_labeling_sessions()
session = None
for s in all_sessions:
    if s.name == "completed_review_session":
        session = s
        break

if session:
    # Sync expectations to dataset
    session.sync(dataset_name="customer_service_eval_dataset")
    print("Synced expectations to evaluation dataset")
else:
    print("Session not found")

Best Practices

Session Organization

Descriptive names: Use clear, date-stamped names like customer_service_review_march_2024
Focused scope: Keep sessions focused on specific evaluation goals or time periods
Appropriate size: Aim for 25-100 traces per session to avoid reviewer fatigue

warning

Session identification: Since session names are not unique, always store the session.mlflow_run_id when you create a session. Use the run ID for programmatic access instead of relying on session names.

Python
import mlflow.genai.labeling as labeling

# Good: Store run ID for later reference
session = labeling.create_labeling_session(name="my_session", ...)
session_run_id = session.mlflow_run_id  # Store this!

# Later: Use run ID to find session via mlflow.search_runs()
# rather than searching by name through all sessions

User Management

Clear assignments: Assign users based on domain expertise and availability
Balanced workload: Distribute labeling work evenly across multiple experts
Permission awareness: Remember that users need Databricks Workspace access

Summary

MLflow Labeling Sessions provide a structured framework for collecting expert feedback on GenAI applications. By combining sessions with Label Schemas and the Review App, you can systematically gather high-quality human assessments that drive evaluation and improvement workflows.

Key capabilities include:

Flexible session creation with custom configurations
User assignment and permission management
Dataset synchronization for evaluation workflows

Use Labeling Sessions to transform ad-hoc feedback collection into systematic, repeatable processes that continuously improve your GenAI applications.

Next Steps

Label existing traces - Step-by-step guide using labeling sessions
Create custom labeling schemas - Define structured feedback questions
Build evaluation datasets - Convert labeled sessions into test datasets

How Labeling Sessions Work​

Creating Labeling Sessions​

Creating Sessions Through the UI​

Viewing Sessions Through the UI​

Creating Sessions Programmatically​

Basic Session Creation​

Session with Custom Label Schemas​

Managing Labeling Sessions​

Retrieving Sessions​

Getting a Specific Session​

Deleting Sessions​

Adding Traces to Sessions​

Adding Traces Through the UI​

Adding Traces from Search Results​

Adding Individual Trace Objects​

Managing Assigned Users​

User Access Requirements​

Setup permissions for users​

Adding Users to Existing Sessions​

Replacing Assigned Users​

Syncing to Evaluation Datasets​

How Dataset Synchronization Works​

Dataset Synchronization​

Best Practices​

Session Organization​

User Management​

Summary​

Next Steps​

How Labeling Sessions Work

Creating Labeling Sessions

Creating Sessions Through the UI

Viewing Sessions Through the UI

Creating Sessions Programmatically

Basic Session Creation

Session with Custom Label Schemas

Managing Labeling Sessions

Retrieving Sessions

Getting a Specific Session

Deleting Sessions

Adding Traces to Sessions

Adding Traces Through the UI

Adding Traces from Search Results

Adding Individual Trace Objects

Managing Assigned Users

User Access Requirements

Setup permissions for users

Adding Users to Existing Sessions

Replacing Assigned Users

Syncing to Evaluation Datasets

How Dataset Synchronization Works

Dataset Synchronization

Best Practices

Session Organization

User Management

Summary

Next Steps