Skip to main content

MLflow 3 for GenAI

MLflow 3 for GenAI is an open platform that unifies tracking, evaluation, and observability for GenAI apps and agents throughout the development and production lifecycle. It includes realtime trace logging, built-in and custom scorers, incorporation of human feedback, and version tracking to help you efficiently evaluate and improve app quality during development and continue tracking and improving quality in production.

Managed MLflow on Databricks extends open source MLflow with capabilities designed for production GenAI applications, including enterprise-ready governance, fully managed hosting, production-level scaling, and integration with your data in the Databricks lakehouse and Unity Catalog.

For information about agent evaluation in MLflow 2, see Mosaic AI Agent Evaluation (MLflow 2) and the migration guide. For MLflow 3, the Agent Evaluation SDK methods have been integrated with Databricks-managed MLflow.

For a set of tutorials to get you started, see Get started.

How MLflow 3 helps optimize GenAI app quality

Evaluating GenAI applications and agents is more complex than evaluating traditional software. Inputs and outputs are often free-form text, and many different outputs can be considered correct. Quality depends not only on correctness but also on factors like precision, length, completeness, appropriateness, and other criteria specific to the use case. Because LLMs are inherently non-deterministic, and GenAI agents include additional components such as retrievers and tools, their responses can vary from run to run.

Developers need concrete quality metrics, automated evaluation, and continuous monitoring to build and deploy robust AI apps. MLflow 3 for GenAI provides these key pieces for efficient development, deployment, and continuous improvement:

Using MLflow 3 on Databricks, you can bring AI to your data to help you deeply understand and improve quality. Unity Catalog provides consistent governance for prompts, apps, and traces. Using any model or framework, MLflow supports you throughout the development loop all the way to and in production.

Get started

Start building better GenAI applications with comprehensive observability and evaluation tools.

Task

Description

Quick start guide

Get up and running in minutes with step-by-step instructions for instrumenting your first application with tracing, running evaluation, and collecting human feedback.

Get started: Tracing a GenAI app

Instrument a simple GenAI app to automatically capture detailed traces for debugging and optimization.

Tutorial: Evaluate and improve a GenAI application

Steps you through evaluating an email generation app that uses Retrieval-Augmented Generation (RAG).

10-minute demo: Collect human feedback

Collect end-user feedback, add developer annotations, create expert review sessions, and use that feedback to evaluate your GenAI app's quality.

Tracing

MLflow Tracing provides observability and logs the trace data required for evaluation and monitoring.

Feature

Description

MLflow Tracing

End-to-end observability for GenAI applications, including complex agent-based systems. Track inputs, outputs, intermediate steps, and metadata for a complete picture of how your app behaves.

What is tracing?

Introduction to tracing concepts.

Review your app's behavior and performance

Complete execution visibility allows you to capture prompts, retrievals, tool calls, responses, latency, and costs.

Production observability

Use the same instrumentation in development and production environments for consistent evaluation.

Use traces to evaluate and improve quality

Analyze traces to identify quality issues, create evaluation datasets from trace data, make targeted improvements, and measure the impact of your changes.

Tracing integrations

MLflow Tracing is integrated with many libraries and frameworks for automatic tracing that allows you to gain immediate observability into your GenAI applications with minimal setup.

Evaluation and monitoring

Replace manual testing with automated evaluation using built-in and custom LLM judges and scorers that match human expertise and can be applied in both development and production. Every production interaction becomes an opportunity to improve with integrated feedback and evaluation workflows.

Feature

Description

Evaluate and monitor GenAI agents

Overview of evaluating and monitoring agents using MLflow 3 on Databricks.

LLM judges and scorers

MLflow 3 includes built-in LLM judges for safety, relevance, correctness, retrieval quality and more. You can also create custom LLM judges and code-based scorers for your specific business requirements.

Evaluation

Run evaluation during development or as part of a release process.

Production monitoring

Continuously monitor a sample of production traffic using LLM judges and scorers.

Collect human feedback

Collect and use feedback from domain experts and end users during development and during production for continuous improvement.

Manage the GenAI app lifecycle

Version, track, and govern your entire GenAI application with enterprise-grade lifecycle management and governance tools.

Feature

Description

Application versioning

Track code, parameters, and evaluation metrics for each version.

Prompt Registry

Centralized management for versioning and sharing prompts across your organization with A/B testing capabilities and Unity Catalog integration.

Enterprise integration

Unity Catalog. Unified governance for all AI assets with enterprise security, access control, and compliance features.

Data intelligence. Connect your GenAI data to your business data in the Databricks Lakehouse and deliver custom analytics to your business stakeholders.

Mosaic AI Agent Serving. Deploy agents to production with scaling and operational rigor.