Labeling Schemas
Labeling Schemas define the specific questions that domain experts answer when labeling existing traces in the Review App. They structure the feedback collection process, ensuring consistent and relevant information for evaluating your GenAI app.
Labeling Schemas only apply when using the Review App to label existing traces and not when using the Review App to test new app versions in the chat UI.
How Labeling Schemas Work
When you create a Labeling Session, you associate it with one or more Labeling Schemas. Each schema represents either a Feedback
or Expectation
Assessment
that gets attached to an MLflow Trace.
The schemas control:
- The question shown to reviewers
- The input method (dropdown, text box, etc.)
- Validation rules and constraints
- Optional instructions and comments
Labeling Schema names must be unique within each MLflow Experiment. You cannot have two schemas with the same name in the same experiment, but you can reuse schema names across different experiments.
Labeling Schemas for common use cases
MLflow provides predefined schema names for the predefined scorers that use expectations. You can create custom schemas using these names to ensure compatibility with the built-in evaluation functionality.
- Works with the guidelines scorer
GUIDELINES
: Collects ideal instructions the GenAI app should follow for a request
- Works with the correctness scorer
EXPECTED_FACTS
: Collects factual statements that must be included for correctnessEXPECTED_RESPONSE
: Collects the complete ground-truth answer
Creating schemas for common use cases
import mlflow.genai.label_schemas as schemas
from mlflow.genai.label_schemas import LabelSchemaType, InputTextList, InputText
# Schema for collecting expected facts
expected_facts_schema = schemas.create_label_schema(
name=schemas.EXPECTED_FACTS,
type=LabelSchemaType.EXPECTATION,
title="Expected facts",
input=InputTextList(max_length_each=1000),
instruction="Please provide a list of facts that you expect to see in a correct response.",
overwrite=True
)
# Schema for collecting guidelines
guidelines_schema = schemas.create_label_schema(
name=schemas.GUIDELINES,
type=LabelSchemaType.EXPECTATION,
title="Guidelines",
input=InputTextList(max_length_each=500),
instruction="Please provide guidelines that the model's output is expected to adhere to.",
overwrite=True
)
# Schema for collecting expected response
expected_response_schema = schemas.create_label_schema(
name=schemas.EXPECTED_RESPONSE,
type=LabelSchemaType.EXPECTATION,
title="Expected response",
input=InputText(),
instruction="Please provide a correct agent response.",
overwrite=True
)
Creating Custom Labeling Schemas
Create custom schemas to collect specific feedback for your domain. You can create schemas either through the MLflow UI or programmatically using the SDK.
Remember that schema names must be unique within your current MLflow Experiment. Choose descriptive names that clearly indicate the purpose of each schema.
Creating Schemas Through the UI
Navigate to the Labeling tab in the MLflow UI to create schemas visually. This provides an intuitive interface for defining questions, input types, and validation rules without writing code.
Creating Schemas Programmatically
All schemas require a name, type, title, and input specification.
Basic Schema Creation
import mlflow.genai.label_schemas as schemas
from mlflow.genai.label_schemas import InputCategorical, InputText
# Create a feedback schema for rating response quality
quality_schema = schemas.create_label_schema(
name="response_quality",
type="feedback",
title="How would you rate the overall quality of this response?",
input=InputCategorical(options=["Poor", "Fair", "Good", "Excellent"]),
instruction="Consider accuracy, relevance, and helpfulness when rating."
)
Schema Types
Choose between two schema types:
feedback
: Subjective assessments like ratings, preferences, or opinionsexpectation
: Objective ground truth like correct answers or expected behavior
import mlflow.genai.label_schemas as schemas
from mlflow.genai.label_schemas import InputCategorical, InputTextList
# Feedback schema for subjective assessment
tone_schema = schemas.create_label_schema(
name="response_tone",
type="feedback",
title="Is the response tone appropriate for the context?",
input=InputCategorical(options=["Too formal", "Just right", "Too casual"]),
enable_comment=True # Allow additional comments
)
# Expectation schema for ground truth
facts_schema = schemas.create_label_schema(
name="required_facts",
type="expectation",
title="What facts must be included in a correct response?",
input=InputTextList(max_count=5, max_length_each=200),
instruction="List key facts that any correct response must contain."
)
Managing Labeling Schemas
Use the SDK functions to programmatically manage your schemas:
Retrieving Schemas
import mlflow.genai.label_schemas as schemas
# Get an existing schema
schema = schemas.get_label_schema("response_quality")
print(f"Schema: {schema.name}")
print(f"Type: {schema.type}")
print(f"Title: {schema.title}")
Updating Schemas
import mlflow.genai.label_schemas as schemas
from mlflow.genai.label_schemas import InputCategorical
# Update by recreating with overwrite=True
updated_schema = schemas.create_label_schema(
name="response_quality",
type="feedback",
title="Rate the response quality (updated question)",
input=InputCategorical(options=["Excellent", "Good", "Fair", "Poor", "Very Poor"]),
instruction="Updated: Focus on factual accuracy above all else.",
overwrite=True # Replace existing schema
)
Deleting Schemas
import mlflow.genai.label_schemas as schemas
# Remove a schema that's no longer needed
schemas.delete_label_schema("old_schema_name")
Input Types for Custom Schemas
MLflow supports five input types for collecting different kinds of feedback:
Single-Select Dropdown (InputCategorical
)
Use for mutually exclusive options:
from mlflow.genai.label_schemas import InputCategorical
# Rating scale
rating_input = InputCategorical(
options=["1 - Poor", "2 - Below Average", "3 - Average", "4 - Good", "5 - Excellent"]
)
# Binary choice
safety_input = InputCategorical(options=["Safe", "Unsafe"])
# Multiple categories
error_type_input = InputCategorical(
options=["Factual Error", "Logical Error", "Formatting Error", "No Error"]
)
Multi-Select Dropdown (InputCategoricalList
)
Use when multiple options can be selected:
from mlflow.genai.label_schemas import InputCategoricalList
# Multiple error types can be present
errors_input = InputCategoricalList(
options=[
"Factual inaccuracy",
"Missing context",
"Inappropriate tone",
"Formatting issues",
"Off-topic content"
]
)
# Multiple content types
content_input = InputCategoricalList(
options=["Technical details", "Examples", "References", "Code samples"]
)
Free-Form Text (InputText
)
Use for open-ended responses:
from mlflow.genai.label_schemas import InputText
# General feedback
feedback_input = InputText(max_length=500)
# Specific improvement suggestions
improvement_input = InputText(
max_length=200 # Limit length for focused feedback
)
# Short answers
summary_input = InputText(max_length=100)
Multiple Text Entries (InputTextList
)
Use for collecting lists of text items:
from mlflow.genai.label_schemas import InputTextList
# List of factual errors
errors_input = InputTextList(
max_count=10, # Maximum 10 errors
max_length_each=150 # Each error description limited to 150 chars
)
# Missing information
missing_input = InputTextList(
max_count=5,
max_length_each=200
)
# Improvement suggestions
suggestions_input = InputTextList(max_count=3) # No length limit per item
Numeric Input (InputNumeric
)
Use for numerical ratings or scores:
from mlflow.genai.label_schemas import InputNumeric
# Confidence score
confidence_input = InputNumeric(
min_value=0.0,
max_value=1.0
)
# Rating scale
rating_input = InputNumeric(
min_value=1,
max_value=10
)
# Cost estimate
cost_input = InputNumeric(min_value=0) # No maximum limit
Complete Examples
Customer Service Evaluation
Here's a comprehensive example for evaluating customer service responses:
import mlflow.genai.label_schemas as schemas
from mlflow.genai.label_schemas import (
InputCategorical,
InputCategoricalList,
InputText,
InputTextList,
InputNumeric
)
# Overall quality rating
quality_schema = schemas.create_label_schema(
name="service_quality",
type="feedback",
title="Rate the overall quality of this customer service response",
input=InputCategorical(options=["Excellent", "Good", "Average", "Poor", "Very Poor"]),
instruction="Consider helpfulness, accuracy, and professionalism.",
enable_comment=True
)
# Issues identification
issues_schema = schemas.create_label_schema(
name="response_issues",
type="feedback",
title="What issues are present in this response? (Select all that apply)",
input=InputCategoricalList(options=[
"Factually incorrect information",
"Unprofessional tone",
"Doesn't address the question",
"Too vague or generic",
"Contains harmful content",
"No issues identified"
]),
instruction="Select all issues you identify. Choose 'No issues identified' if the response is problem-free."
)
# Expected resolution steps
resolution_schema = schemas.create_label_schema(
name="expected_resolution",
type="expectation",
title="What steps should be included in the ideal resolution?",
input=InputTextList(max_count=5, max_length_each=200),
instruction="List the key steps a customer service rep should take to properly resolve this issue."
)
# Confidence in assessment
confidence_schema = schemas.create_label_schema(
name="assessment_confidence",
type="feedback",
title="How confident are you in your assessment?",
input=InputNumeric(min_value=1, max_value=10),
instruction="Rate from 1 (not confident) to 10 (very confident)"
)
Medical Information Review
Example for evaluating medical information responses:
import mlflow.genai.label_schemas as schemas
from mlflow.genai.label_schemas import InputCategorical, InputTextList, InputNumeric
# Safety assessment
safety_schema = schemas.create_label_schema(
name="medical_safety",
type="feedback",
title="Is this medical information safe and appropriate?",
input=InputCategorical(options=[
"Safe - appropriate general information",
"Concerning - may mislead patients",
"Dangerous - could cause harm if followed"
]),
instruction="Assess whether the information could be safely consumed by patients."
)
# Required disclaimers
disclaimers_schema = schemas.create_label_schema(
name="required_disclaimers",
type="expectation",
title="What medical disclaimers should be included?",
input=InputTextList(max_count=3, max_length_each=300),
instruction="List disclaimers that should be present (e.g., 'consult your doctor', 'not professional medical advice')."
)
# Accuracy of medical facts
accuracy_schema = schemas.create_label_schema(
name="medical_accuracy",
type="feedback",
title="Rate the factual accuracy of the medical information",
input=InputNumeric(min_value=0, max_value=100),
instruction="Score from 0 (completely inaccurate) to 100 (completely accurate)"
)
Integration with Labeling Sessions
Once created, use your schemas in Labeling Sessions:
import mlflow.genai.label_schemas as schemas
# Schemas are automatically available when creating labeling sessions
# The Review App will present questions based on your schema definitions
# Example: Using schemas in a session (conceptual - actual session creation
# happens through the Review App UI or other APIs)
session_schemas = [
"service_quality", # Your custom schema
"response_issues", # Your custom schema
schemas.EXPECTED_FACTS # Built-in schema
]
Best Practices
Schema Design
- Clear titles: Write questions as clear, specific prompts
- Helpful instructions: Provide context to guide reviewers
- Appropriate constraints: Set reasonable limits on text length and list counts
- Logical options: For categorical inputs, ensure options are mutually exclusive and comprehensive
Schema Management
- Consistent naming: Use descriptive, consistent names across your schemas
- Version control: When updating schemas, consider the impact on existing sessions
- Clean up: Delete unused schemas to keep your workspace organized
Input Type Selection
- Use
InputCategorical
for standardized ratings or classifications - Use
InputCategoricalList
when multiple issues or features can be present - Use
InputText
for detailed explanations or custom feedback - Use
InputTextList
for structured lists of items - Use
InputNumeric
for precise scoring or confidence ratings
Next Steps
- Label existing traces - Apply your schemas to collect structured feedback
- Create labeling sessions - Organize review workflows using your schemas
- Build evaluation datasets - Transform labeled data into test datasets