Skip to main content

MLflow 3 for GenAI

This page describes how MLflow 3 for GenAI, integrated with the Databricks platform, helps you build production-grade GenAI apps.

Traditional software and ML tests aren't built for GenAI's free-form language, making it difficult for teams to measure and improve quality.

MLflow 3 solves this by combining AI-powered metrics that reliably measure GenAI quality with comprehensive trace observability, enabling you to measure, improve, and monitor quality throughout your entire application lifecycle.

note

Agent Evaluation is integrated with Managed MLflow 3. The Agent Evaluation SDK methods are now exposed through the mlflow[databricks]>=3.1 SDK. See the migration guide to update your MLflow 2 and Agent Evaluation code to MLflow 3 SDKs.

Observe and debug GenAI apps with tracing

Tracing lets you see exactly what your GenAI application is doing with comprehensive observability that captures every step of execution.

Python
# Just add one line to capture everything
mlflow.autolog()

# Your existing code works unchanged
response = client.chat.completions.create(...)
# Traces are automatically captured!

Trace Summary

Automated quality evaluation of GenAI apps

Replace manual testing with automated evaluation using LLM judges that match human expertise and can be applied in both development and production.

Pre-built Judges

  • Safety - detect harmful or toxic content
  • Hallucination & Groundedness - ensure responses stick to retrieved context
  • Relevance - verify responses address user requests
  • Correctness - verify responses provide the same facts as ground-truth responses
  • Retrieval Quality - measure if your RAG finds the right information

Custom Judges

  • Tailored to your business - create judges that enforce your specific requirements
  • Aligned with experts - train judges to match your domain experts' judgment

Turn Production Data into Improvements

Every production interaction becomes an opportunity to improve with integrated feedback and evaluation workflows.

Expert Feedback Collection

  • Reviewing and Labeling - business stakeholders and experts can review and provide ratings, corrections, or guidelines on production traces, without writing code
  • Live testing - SMEs chat with your app and provide instant feedback

Closing the Loop between Development and Production

End-User Feedback

  • Collect feedback - capture thumbs up/down and comments programmatically from your deployed app
  • Link to traces - debug negative feedback with full execution context

Evaluation Comparison

Manage Your GenAI Application Lifecycle

Version, track, and govern your entire GenAI application with enterprise-grade lifecycle management.

Application Versioning

  • LoggedModels - track code, parameters, and evaluation metrics for each version
  • Full lineage - link traces, evaluations, and feedback to specific versions

Prompt Registry (Coming Soon)

  • Centralized management - version and share prompts across your organization
  • A/B testing - deploy multiple prompt versions without code changes
  • Unity Catalog integration - enterprise governance for your prompts

Enterprise Integration

  • Unity Catalog - unified governance for all AI assets
  • Data Intelligence - connect your GenAI data to your business data in the Databricks Lakehouse and deliver custom analytics to your business stakeholders
  • Mosaic AI Agent Serving - deploy agents to production with scaling and operational rigor

Start Building Better GenAI Applications

Quick Start

Ready to instrument your first application? Our quickstart guides will have you up and running in minutes.

Get Started →

Choose your path:

Why Teams Choose MLflow 3 for GenAI

Unified Platform
Everything you need in one place - from development debugging to production monitoring.

Open and Flexible
Open-source foundation with no vendor lock-in. Use any LLM provider, any framework.

Enterprise Ready
Built on Databricks' platform with enterprise security, scale, and governance.

Proven Results
Join thousands of organizations building production GenAI applications with MLflow.


Take the first step. Follow our quickstart guide and see your GenAI application's execution in minutes.