Skip to main content

Labeling Schemas

Labeling Schemas define the specific questions that domain experts answer when labeling existing traces in the Review App. They structure the feedback collection process, ensuring consistent and relevant information for evaluating your GenAI app.

note

Labeling Schemas only apply when using the Review App to label existing traces and not when using the Review App to test new app versions in the chat UI.

How Labeling Schemas Work

When you create a Labeling Session, you associate it with one or more Labeling Schemas. Each schema represents either a Feedback or Expectation Assessment that gets attached to an MLflow Trace.

The schemas control:

  • The question shown to reviewers
  • The input method (dropdown, text box, etc.)
  • Validation rules and constraints
  • Optional instructions and comments
important

Labeling Schema names must be unique within each MLflow Experiment. You cannot have two schemas with the same name in the same experiment, but you can reuse schema names across different experiments.

Labeling Schemas for common use cases

MLflow provides predefined schema names for the predefined scorers that use expectations. You can create custom schemas using these names to ensure compatibility with the built-in evaluation functionality.

  • Works with the guidelines scorer
    • GUIDELINES: Collects ideal instructions the GenAI app should follow for a request
  • Works with the correctness scorer
    • EXPECTED_FACTS: Collects factual statements that must be included for correctness
    • EXPECTED_RESPONSE: Collects the complete ground-truth answer

Creating schemas for common use cases

Python
import mlflow.genai.label_schemas as schemas
from mlflow.genai.label_schemas import LabelSchemaType, InputTextList, InputText

# Schema for collecting expected facts
expected_facts_schema = schemas.create_label_schema(
name=schemas.EXPECTED_FACTS,
type=LabelSchemaType.EXPECTATION,
title="Expected facts",
input=InputTextList(max_length_each=1000),
instruction="Please provide a list of facts that you expect to see in a correct response.",
overwrite=True
)

# Schema for collecting guidelines
guidelines_schema = schemas.create_label_schema(
name=schemas.GUIDELINES,
type=LabelSchemaType.EXPECTATION,
title="Guidelines",
input=InputTextList(max_length_each=500),
instruction="Please provide guidelines that the model's output is expected to adhere to.",
overwrite=True
)

# Schema for collecting expected response
expected_response_schema = schemas.create_label_schema(
name=schemas.EXPECTED_RESPONSE,
type=LabelSchemaType.EXPECTATION,
title="Expected response",
input=InputText(),
instruction="Please provide a correct agent response.",
overwrite=True
)

Creating Custom Labeling Schemas

Create custom schemas to collect specific feedback for your domain. You can create schemas either through the MLflow UI or programmatically using the SDK.

note

Remember that schema names must be unique within your current MLflow Experiment. Choose descriptive names that clearly indicate the purpose of each schema.

Creating Schemas Through the UI

Navigate to the Labeling tab in the MLflow UI to create schemas visually. This provides an intuitive interface for defining questions, input types, and validation rules without writing code.

human feedback

Creating Schemas Programmatically

All schemas require a name, type, title, and input specification.

Basic Schema Creation

Python
import mlflow.genai.label_schemas as schemas
from mlflow.genai.label_schemas import InputCategorical, InputText

# Create a feedback schema for rating response quality
quality_schema = schemas.create_label_schema(
name="response_quality",
type="feedback",
title="How would you rate the overall quality of this response?",
input=InputCategorical(options=["Poor", "Fair", "Good", "Excellent"]),
instruction="Consider accuracy, relevance, and helpfulness when rating."
)

Schema Types

Choose between two schema types:

  • feedback: Subjective assessments like ratings, preferences, or opinions
  • expectation: Objective ground truth like correct answers or expected behavior
Python
import mlflow.genai.label_schemas as schemas
from mlflow.genai.label_schemas import InputCategorical, InputTextList

# Feedback schema for subjective assessment
tone_schema = schemas.create_label_schema(
name="response_tone",
type="feedback",
title="Is the response tone appropriate for the context?",
input=InputCategorical(options=["Too formal", "Just right", "Too casual"]),
enable_comment=True # Allow additional comments
)

# Expectation schema for ground truth
facts_schema = schemas.create_label_schema(
name="required_facts",
type="expectation",
title="What facts must be included in a correct response?",
input=InputTextList(max_count=5, max_length_each=200),
instruction="List key facts that any correct response must contain."
)

Managing Labeling Schemas

Use the SDK functions to programmatically manage your schemas:

Retrieving Schemas

Python
import mlflow.genai.label_schemas as schemas

# Get an existing schema
schema = schemas.get_label_schema("response_quality")
print(f"Schema: {schema.name}")
print(f"Type: {schema.type}")
print(f"Title: {schema.title}")

Updating Schemas

Python
import mlflow.genai.label_schemas as schemas
from mlflow.genai.label_schemas import InputCategorical

# Update by recreating with overwrite=True
updated_schema = schemas.create_label_schema(
name="response_quality",
type="feedback",
title="Rate the response quality (updated question)",
input=InputCategorical(options=["Excellent", "Good", "Fair", "Poor", "Very Poor"]),
instruction="Updated: Focus on factual accuracy above all else.",
overwrite=True # Replace existing schema
)

Deleting Schemas

Python
import mlflow.genai.label_schemas as schemas

# Remove a schema that's no longer needed
schemas.delete_label_schema("old_schema_name")

Input Types for Custom Schemas

MLflow supports five input types for collecting different kinds of feedback:

Single-Select Dropdown (InputCategorical)

Use for mutually exclusive options:

Python
from mlflow.genai.label_schemas import InputCategorical

# Rating scale
rating_input = InputCategorical(
options=["1 - Poor", "2 - Below Average", "3 - Average", "4 - Good", "5 - Excellent"]
)

# Binary choice
safety_input = InputCategorical(options=["Safe", "Unsafe"])

# Multiple categories
error_type_input = InputCategorical(
options=["Factual Error", "Logical Error", "Formatting Error", "No Error"]
)

Multi-Select Dropdown (InputCategoricalList)

Use when multiple options can be selected:

Python
from mlflow.genai.label_schemas import InputCategoricalList

# Multiple error types can be present
errors_input = InputCategoricalList(
options=[
"Factual inaccuracy",
"Missing context",
"Inappropriate tone",
"Formatting issues",
"Off-topic content"
]
)

# Multiple content types
content_input = InputCategoricalList(
options=["Technical details", "Examples", "References", "Code samples"]
)

Free-Form Text (InputText)

Use for open-ended responses:

Python
from mlflow.genai.label_schemas import InputText

# General feedback
feedback_input = InputText(max_length=500)

# Specific improvement suggestions
improvement_input = InputText(
max_length=200 # Limit length for focused feedback
)

# Short answers
summary_input = InputText(max_length=100)

Multiple Text Entries (InputTextList)

Use for collecting lists of text items:

Python
from mlflow.genai.label_schemas import InputTextList

# List of factual errors
errors_input = InputTextList(
max_count=10, # Maximum 10 errors
max_length_each=150 # Each error description limited to 150 chars
)

# Missing information
missing_input = InputTextList(
max_count=5,
max_length_each=200
)

# Improvement suggestions
suggestions_input = InputTextList(max_count=3) # No length limit per item

Numeric Input (InputNumeric)

Use for numerical ratings or scores:

Python
from mlflow.genai.label_schemas import InputNumeric

# Confidence score
confidence_input = InputNumeric(
min_value=0.0,
max_value=1.0
)

# Rating scale
rating_input = InputNumeric(
min_value=1,
max_value=10
)

# Cost estimate
cost_input = InputNumeric(min_value=0) # No maximum limit

Complete Examples

Customer Service Evaluation

Here's a comprehensive example for evaluating customer service responses:

Python
import mlflow.genai.label_schemas as schemas
from mlflow.genai.label_schemas import (
InputCategorical,
InputCategoricalList,
InputText,
InputTextList,
InputNumeric
)

# Overall quality rating
quality_schema = schemas.create_label_schema(
name="service_quality",
type="feedback",
title="Rate the overall quality of this customer service response",
input=InputCategorical(options=["Excellent", "Good", "Average", "Poor", "Very Poor"]),
instruction="Consider helpfulness, accuracy, and professionalism.",
enable_comment=True
)

# Issues identification
issues_schema = schemas.create_label_schema(
name="response_issues",
type="feedback",
title="What issues are present in this response? (Select all that apply)",
input=InputCategoricalList(options=[
"Factually incorrect information",
"Unprofessional tone",
"Doesn't address the question",
"Too vague or generic",
"Contains harmful content",
"No issues identified"
]),
instruction="Select all issues you identify. Choose 'No issues identified' if the response is problem-free."
)

# Expected resolution steps
resolution_schema = schemas.create_label_schema(
name="expected_resolution",
type="expectation",
title="What steps should be included in the ideal resolution?",
input=InputTextList(max_count=5, max_length_each=200),
instruction="List the key steps a customer service rep should take to properly resolve this issue."
)

# Confidence in assessment
confidence_schema = schemas.create_label_schema(
name="assessment_confidence",
type="feedback",
title="How confident are you in your assessment?",
input=InputNumeric(min_value=1, max_value=10),
instruction="Rate from 1 (not confident) to 10 (very confident)"
)

Medical Information Review

Example for evaluating medical information responses:

Python
import mlflow.genai.label_schemas as schemas
from mlflow.genai.label_schemas import InputCategorical, InputTextList, InputNumeric

# Safety assessment
safety_schema = schemas.create_label_schema(
name="medical_safety",
type="feedback",
title="Is this medical information safe and appropriate?",
input=InputCategorical(options=[
"Safe - appropriate general information",
"Concerning - may mislead patients",
"Dangerous - could cause harm if followed"
]),
instruction="Assess whether the information could be safely consumed by patients."
)

# Required disclaimers
disclaimers_schema = schemas.create_label_schema(
name="required_disclaimers",
type="expectation",
title="What medical disclaimers should be included?",
input=InputTextList(max_count=3, max_length_each=300),
instruction="List disclaimers that should be present (e.g., 'consult your doctor', 'not professional medical advice')."
)

# Accuracy of medical facts
accuracy_schema = schemas.create_label_schema(
name="medical_accuracy",
type="feedback",
title="Rate the factual accuracy of the medical information",
input=InputNumeric(min_value=0, max_value=100),
instruction="Score from 0 (completely inaccurate) to 100 (completely accurate)"
)

Integration with Labeling Sessions

Once created, use your schemas in Labeling Sessions:

Python
import mlflow.genai.label_schemas as schemas

# Schemas are automatically available when creating labeling sessions
# The Review App will present questions based on your schema definitions

# Example: Using schemas in a session (conceptual - actual session creation
# happens through the Review App UI or other APIs)
session_schemas = [
"service_quality", # Your custom schema
"response_issues", # Your custom schema
schemas.EXPECTED_FACTS # Built-in schema
]

Best Practices

Schema Design

  • Clear titles: Write questions as clear, specific prompts
  • Helpful instructions: Provide context to guide reviewers
  • Appropriate constraints: Set reasonable limits on text length and list counts
  • Logical options: For categorical inputs, ensure options are mutually exclusive and comprehensive

Schema Management

  • Consistent naming: Use descriptive, consistent names across your schemas
  • Version control: When updating schemas, consider the impact on existing sessions
  • Clean up: Delete unused schemas to keep your workspace organized

Input Type Selection

  • Use InputCategorical for standardized ratings or classifications
  • Use InputCategoricalList when multiple issues or features can be present
  • Use InputText for detailed explanations or custom feedback
  • Use InputTextList for structured lists of items
  • Use InputNumeric for precise scoring or confidence ratings

Next Steps