MLflow 3 for GenAI
This page describes how MLflow 3 for GenAI, integrated with the Databricks platform, helps you build production-grade GenAI apps.
Traditional software and ML tests aren't built for GenAI's free-form language, making it difficult for teams to measure and improve quality.
MLflow 3 solves this by combining AI-powered metrics that reliably measure GenAI quality with comprehensive trace observability, enabling you to measure, improve, and monitor quality throughout your entire application lifecycle.
Agent Evaluation is integrated with Managed MLflow 3. The Agent Evaluation SDK methods are now exposed through the mlflow[databricks]>=3.1
SDK. See the migration guide to update your MLflow 2 and Agent Evaluation code to MLflow 3 SDKs.
Observe and debug GenAI apps with tracing
Tracing lets you see exactly what your GenAI application is doing with comprehensive observability that captures every step of execution.
- One-line instrumentation for 20+ libraries including OpenAI, LangChain, LlamaIndex, Anthropic, and DSPy
- Complete execution visibility - prompts, retrievals, tool calls, responses, latency, and costs
- Production-ready - same instrumentation works in development and production
- OpenTelemetry compatible - export traces anywhere, maintain full data ownership
# Just add one line to capture everything
mlflow.autolog()
# Your existing code works unchanged
response = client.chat.completions.create(...)
# Traces are automatically captured!
Automated quality evaluation of GenAI apps
Replace manual testing with automated evaluation using LLM judges that match human expertise and can be applied in both development and production.
Pre-built Judges
- Safety - detect harmful or toxic content
- Hallucination & Groundedness - ensure responses stick to retrieved context
- Relevance - verify responses address user requests
- Correctness - verify responses provide the same facts as ground-truth responses
- Retrieval Quality - measure if your RAG finds the right information
Custom Judges
- Tailored to your business - create judges that enforce your specific requirements
- Aligned with experts - train judges to match your domain experts' judgment
Turn Production Data into Improvements
Every production interaction becomes an opportunity to improve with integrated feedback and evaluation workflows.
Expert Feedback Collection
- Reviewing and Labeling - business stakeholders and experts can review and provide ratings, corrections, or guidelines on production traces, without writing code
- Live testing - SMEs chat with your app and provide instant feedback
Closing the Loop between Development and Production
- Evaluation datasets from production - turn problematic traces into test cases
End-User Feedback
- Collect feedback - capture thumbs up/down and comments programmatically from your deployed app
- Link to traces - debug negative feedback with full execution context
Manage Your GenAI Application Lifecycle
Version, track, and govern your entire GenAI application with enterprise-grade lifecycle management.
Application Versioning
- LoggedModels - track code, parameters, and evaluation metrics for each version
- Full lineage - link traces, evaluations, and feedback to specific versions
Prompt Registry (Coming Soon)
- Centralized management - version and share prompts across your organization
- A/B testing - deploy multiple prompt versions without code changes
- Unity Catalog integration - enterprise governance for your prompts
Enterprise Integration
- Unity Catalog - unified governance for all AI assets
- Data Intelligence - connect your GenAI data to your business data in the Databricks Lakehouse and deliver custom analytics to your business stakeholders
- Mosaic AI Agent Serving - deploy agents to production with scaling and operational rigor
Start Building Better GenAI Applications
Ready to instrument your first application? Our quickstart guides will have you up and running in minutes.
Choose your path:
- Databricks Notebook - Start in a managed environment
- Local IDE - Develop on your machine
Why Teams Choose MLflow 3 for GenAI
Unified Platform
Everything you need in one place - from development debugging to production monitoring.
Open and Flexible
Open-source foundation with no vendor lock-in. Use any LLM provider, any framework.
Enterprise Ready
Built on Databricks' platform with enterprise security, scale, and governance.
Proven Results
Join thousands of organizations building production GenAI applications with MLflow.
Take the first step. Follow our quickstart guide and see your GenAI application's execution in minutes.