Skip to main content

MLflow 3 for GenAI

This page describes how MLflow 3 for GenAI, integrated with the Databricks platform, helps you build production-grade GenAI apps.

Traditional software and ML tests aren't built for GenAI's free-form language, making it difficult for teams to measure and improve quality. MLflow 3 solves this by combining AI-powered metrics that reliably measure GenAI quality with comprehensive trace observability, enabling you to measure, improve, and monitor quality throughout the entire application lifecycle.

When you use MLflow 3 for GenAI on Databricks, you get all of the advantages of the Databricks platform, including the following:

  • Unified platform. The entire GenAI development process in one place, from development debugging to production monitoring.
  • Open and flexible. Use any LLM provider and any framework.
  • Enterprise-ready. The Databricks platform provides enterprise security, scale, and governance.

Agent Evaluation SDK methods are integrated with Databricks-managed MLflow 3. For information about agent evaluation in MLflow 2, see Mosaic AI Agent Evaluation (MLflow 2) and the migration guide.

For a set of tutorials to get you started, see Get started with MLflow 3 for GenAI.

note

Open source telemetry collection was introduced in MLflow 3.2.0, and is disabled on Databricks by default. For more details, refer to the MLflow usage tracking documentation.

Observe and debug GenAI apps with tracing

See exactly what your GenAI application is doing with comprehensive observability that captures every step of execution. You need only add a single line of code, and MLflow Tracing captures all prompts, retrievals, tool calls, responses, latency, and token counts throughout your application.

Python
# Just add one line to capture everything
mlflow.autolog()

# Your existing code works unchanged
response = client.chat.completions.create(...)
# Traces are automatically captured!

Evaluation Comparison

Feature

Description

Automatic instrumentation

One-line instrumentation for 20+ libraries including OpenAI, LangChain, LlamaIndex, Anthropic, and DSPy.

Review your app's behavior and performance

Complete execution visibility allows you to capture prompts, retrievals, tool calls, responses, latency, and costs.

Production observability

Use the same instrumentation in development and production environments for consistent evaluation.

OpenTelemetry compatibility

Export traces anywhere while maintaining full data ownership and integration flexibility.

Automated quality evaluation of GenAI apps

Replace manual testing with automated evaluation using built-in and custom LLM-based scorers that match human expertise and can be applied in both development and production.

Feature

Description

Built-in scorers

Ready-to-use scorers that assess safety, hallucinations, relevance, correctness, and retrieval quality.

Custom scorers

Create tailored judges that enforce your specific business requirements and align with domain expert judgment.

Turn production data into improvements

Every production interaction becomes an opportunity to improve with integrated feedback and evaluation workflows.

Trace Summary

Feature

Description

Expert feedback collection

The Review App provides a structured process and UI for collecting domain expert feedback including ratings, corrections, and guidelines on real interactions with your application.

Live app testing

Subject matter experts can chat with your app and provide instant feedback for continuous improvement.

Evaluation datasets from production

Evaluation datasets enable consistent, repeatable evaluation. Problematic production traces become test cases for continuous improvement and regression testing.

User feedback collection

Capture and link user feedback to specific traces for debugging and quality improvement insights. Collect thumbs up/down and comments programmatically from your deployed application.

Evaluate and improve quality with traces

Analyze traces to identify quality issues, create evaluation datasets from trace data, implement targeted improvements, and measure the impact of your changes.

Manage your GenAI application lifecycle

Version, track, and govern your entire GenAI application with enterprise-grade lifecycle management and governance tools.

Feature

Description

Application versioning

Track code, parameters, and evaluation metrics for each version.

Production trace linking

Link traces, evaluations, and feedback to specific application versions.

Prompt Registry

Centralized management for versioning and sharing prompts across your organization with A/B testing capabilities and Unity Catalog integration.

Enterprise integration

Unity Catalog. Unified governance for all AI assets with enterprise security, access control, and compliance features.

Data intelligence. Connect your GenAI data to your business data in the Databricks Lakehouse and deliver custom analytics to your business stakeholders.

Mosaic AI Agent Serving. Deploy agents to production with scaling and operational rigor.

Get started with MLflow 3 for GenAI

Start building better GenAI applications with comprehensive observability and evaluation tools.

Task

Description

Quick start guide

Get up and running in minutes with step-by-step instructions for instrumenting your first application.

Databricks Notebook setup

Start in a managed environment with pre-configured dependencies and instant access to MLflow 3 features.

Local IDE development

Develop on your local machine with full MLflow 3 capabilities and seamless cloud integration.

Data Intelligence integration

Connect your GenAI data to business data in the Databricks Lakehouse for custom analytics and insights.