Skip to main content

Key challenges in developing GenAI apps and how MLflow helps

MLflow is built to address the fundamental challenge in delivering production-ready GenAI apps: It is difficult to create apps that reliably deliver high-quality (accurate) responses at the optimal cost and latency.

GenAI apps don't behave (or fail) like regular software. They can hallucinate, drift as data changes, and worse, real users phrase the same intent in endless fresh ways, so the input space is vast—and always in flux. Traditional software and ML testing approaches that are designed for known sets of fixed inputs/outputs and known user actions can't reliably measure the quality of GenAI's free-form and ever-shifting language inputs and outputs.

To address these fundamental challenges, MLflow unites metrics that reliably measure GenAI quality with operational observability into latency and cost, and workflows to easily collect feedback from human experts.

User inputs are free-form, plain language

The challenge

A single intent can be phrased dozens of ways—your app must recognize them all.

Consider a chatbot to help answer user support queries. The following intents are the same, even though the words are different:

  • "My Wi-Fi keeps dropping—please fix it."
  • "Can you help? The internet here is dead."

How MLflow helps

MLflow's LLM judges score intent, tone, and factuality rather than exact strings, so different phrasings of the same request are evaluated based on their meaning. This semantic evaluation ensures your app handles the variety of ways users express themselves.

Tracing captures complete conversations including all input variations, giving you visibility into how users actually phrase requests. This comprehensive observability helps you understand the full range of user inputs your app encounters.

User inputs evolve over time

The challenge

Popular intents shift over time, even if your code hasn't changed.

You designed your app to help with the "internet outage" intent, but didn't predict that users also ask, "Will I get a bill credit for the problem?"

How MLflow helps

MLflow's Evaluation Datasets enable capturing production traces into offline test sets, so new intents (like bill-credit questions) automatically become test and regression cases. This ensures your app continues to handle emerging user needs.

Production monitoring continuously tracks query patterns and identifies new types of requests. By analyzing real traffic, you can proactively adapt your app to evolving user behavior before quality degrades.

GenAI outputs are free-form, plain language

The challenge

Two differently worded answers may both be correct, so quality checks must compare meaning, not strings.

The following answers are the same, even though the words are completely different:

  • "Please power-cycle your modem by unplugging it for 30 seconds."
  • "Try turning the router off for half a minute, then plug it back in."

How MLflow helps

MLflow's LLM judges assess meaning rather than exact text matches. When evaluating responses, judges understand that "half a minute" equals "30 seconds" and that "power-cycle" and "turn off and on" are equivalent instructions.

The same quality checks work seamlessly across development, CI/CD, and production. This consistency means you can trust that responses validated in development will maintain their quality in production, regardless of wording variations.

Domain expertise is required to assess quality

The challenge

Developers often lack the subject-matter depth to judge correctness; specialist review is needed.

To determine whether an answer is correct, you need an expert to verify that telling users to press the reset pin is safe for their modem model. Technical correctness requires domain knowledge that engineering teams may not possess.

How MLflow helps

MLflow's Review App surfaces complete conversations so domain experts can spot issues fast. The intuitive interface allows non-technical experts to review app outputs without needing to understand code or complex tooling.

You can scale domain expert feedback by using expert labels from a handful of traces to create custom LLM judges. These judges learn from expert assessments, allowing you to automatically evaluate the quality of iterations and production traffic without requiring human review for every response.

Managing the Quality ↔ Latency ↔ Cost trade-off

The challenge

Faster, cheaper models save time and money, but can lower answer quality—each tweak must balance all three.

Switching from GPT-4o to GPT-4o-mini drastically reduces time and cost, but the smaller model might miss the nuance in bill-credit questions, lowering answer quality.

How MLflow helps

MLflow enables you to run many evaluations quickly to explore variants at scale. Side-by-side experiments expose quality, latency, and cost deltas before rollout, helping you make informed decisions about model selection.

Tracing provides end-to-end observability of app performance, capturing latency and cost metrics alongside quality assessments. This unified view ensures you can optimize for all three dimensions simultaneously, making data-driven trade-offs that align with your business needs.

The Evaluation UI allows you to compare different app versions side-by-side, visualizing how changes in models, prompts, or code affect quality scores, response times, and operational costs. This comprehensive comparison ensures you deploy the optimal configuration for your use case.

Next Steps