MLflow 3 for GenAI

This page describes how MLflow 3 for GenAI, integrated with the Databricks platform, helps you build production-grade GenAI apps.

Traditional software and ML tests aren't built for GenAI's free-form language, making it difficult for teams to measure and improve quality.

MLflow 3 solves this by combining AI-powered metrics that reliably measure GenAI quality with comprehensive trace observability, enabling you to measure, improve, and monitor quality throughout your entire application lifecycle.

note

Agent Evaluation is integrated with Managed MLflow 3. The Agent Evaluation SDK methods are now exposed through the mlflow[databricks]>=3.1 SDK. See the migration guide to update your MLflow 2 and Agent Evaluation code to MLflow 3 SDKs.

Observe and debug GenAI apps with tracing

Tracing lets you see exactly what your GenAI application is doing with comprehensive observability that captures every step of execution.

One-line instrumentation for 20+ libraries including OpenAI, LangChain, LlamaIndex, Anthropic, and DSPy
Complete execution visibility - prompts, retrievals, tool calls, responses, latency, and costs
Production-ready - same instrumentation works in development and production
OpenTelemetry compatible - export traces anywhere, maintain full data ownership

Python
# Just add one line to capture everything
mlflow.autolog()

# Your existing code works unchanged
response = client.chat.completions.create(...)
# Traces are automatically captured!

Trace Summary

Automated quality evaluation of GenAI apps

Replace manual testing with automated evaluation using LLM judges that match human expertise and can be applied in both development and production.

Pre-built Judges

Safety - detect harmful or toxic content
Hallucination & Groundedness - ensure responses stick to retrieved context
Relevance - verify responses address user requests
Correctness - verify responses provide the same facts as ground-truth responses
Retrieval Quality - measure if your RAG finds the right information

Custom Judges

Tailored to your business - create judges that enforce your specific requirements
Aligned with experts - train judges to match your domain experts' judgment

Turn Production Data into Improvements

Every production interaction becomes an opportunity to improve with integrated feedback and evaluation workflows.

Expert Feedback Collection

Reviewing and Labeling - business stakeholders and experts can review and provide ratings, corrections, or guidelines on production traces, without writing code
Live testing - SMEs chat with your app and provide instant feedback

Closing the Loop between Development and Production

Evaluation datasets from production - turn problematic traces into test cases

End-User Feedback

Collect feedback - capture thumbs up/down and comments programmatically from your deployed app
Link to traces - debug negative feedback with full execution context

Evaluation Comparison

Manage Your GenAI Application Lifecycle

Version, track, and govern your entire GenAI application with enterprise-grade lifecycle management.

Application Versioning

LoggedModels - track code, parameters, and evaluation metrics for each version
Full lineage - link traces, evaluations, and feedback to specific versions

Prompt Registry (Coming Soon)

Centralized management - version and share prompts across your organization
A/B testing - deploy multiple prompt versions without code changes
Unity Catalog integration - enterprise governance for your prompts

Enterprise Integration

Unity Catalog - unified governance for all AI assets
Data Intelligence - connect your GenAI data to your business data in the Databricks Lakehouse and deliver custom analytics to your business stakeholders
Mosaic AI Agent Serving - deploy agents to production with scaling and operational rigor

Start Building Better GenAI Applications

Quick Start

Ready to instrument your first application? Our quickstart guides will have you up and running in minutes.

Get Started →

Choose your path:

Databricks Notebook - Start in a managed environment
Local IDE - Develop on your machine

Why Teams Choose MLflow 3 for GenAI

Unified Platform
Everything you need in one place - from development debugging to production monitoring.

Open and Flexible
Open-source foundation with no vendor lock-in. Use any LLM provider, any framework.

Enterprise Ready
Built on Databricks' platform with enterprise security, scale, and governance.

Proven Results
Join thousands of organizations building production GenAI applications with MLflow.

Take the first step. Follow our quickstart guide and see your GenAI application's execution in minutes.

Observe and debug GenAI apps with tracing​

Automated quality evaluation of GenAI apps​

Pre-built Judges​

Custom Judges​

Turn Production Data into Improvements​

Expert Feedback Collection​

Closing the Loop between Development and Production​

End-User Feedback​

Manage Your GenAI Application Lifecycle​

Application Versioning​

Prompt Registry (Coming Soon)​

Enterprise Integration​

Start Building Better GenAI Applications​

Why Teams Choose MLflow 3 for GenAI​