Skip to main content

Third-party scorers

MLflow integrates with popular open-source evaluation frameworks so that you can use their specialized metrics as scorers alongside built-in LLM judges and code-based scorers. Third-party scorers plug directly into mlflow.genai.evaluate(), giving you access to a broad library of evaluation metrics through a single, unified interface.

Why use third-party scorers

Third-party scorers are useful when you need:

  • Specialized metrics not covered by built-in judges, such as agent plan quality, jailbreak detection, or BLEU/ROUGE text comparison scores.
  • Framework-specific strengths from libraries your team already uses, without changing your evaluation workflow.
  • Combined evaluation across multiple frameworks in a single mlflow.genai.evaluate() call, with results visualized together in the MLflow UI.

Available integrations

Each integration wraps a third-party framework's metrics as MLflow scorers. Install the framework's package, import the scorer, and pass it to mlflow.genai.evaluate().

Integration

When to use:

DeepEval scorers

You need the broadest metric coverage across RAG, agents, conversational AI, and safety. DeepEval offers specialized scorers for agent plan quality, step efficiency, multi-turn conversation completeness, and role adherence that other frameworks don't provide.

RAGAS scorers

You need deep RAG evaluation with fine-grained context metrics (precision, recall, utilization, noise sensitivity), agent goal accuracy, or deterministic text comparison scores like BLEU, ROUGE, and semantic similarity without LLM calls.

Arize Phoenix scorers

You need a lightweight, focused set of scorers for hallucination detection, relevance assessment, toxicity identification, QA correctness, or summarization quality.

TruLens scorers

You need to analyze agent execution traces with goal-plan-action alignment metrics like logical consistency, execution efficiency, plan adherence, and tool selection.

Guardrails AI scorers

You need rule-based output validation that runs without LLM calls, such as toxicity detection, PII scanning, jailbreak detection, secrets detection, or gibberish identification.

Quick example

The following example combines scorers from two different frameworks in a single evaluation:

Python
import mlflow
from mlflow.genai.scorers.deepeval import AnswerRelevancy
from mlflow.genai.scorers.guardrails import ToxicLanguage

eval_dataset = [
{
"inputs": {"query": "What is MLflow?"},
"outputs": "MLflow is an open-source platform for managing ML and GenAI workloads.",
},
]

results = mlflow.genai.evaluate(
data=eval_dataset,
scorers=[
AnswerRelevancy(threshold=0.7, model="databricks:/databricks-gpt-5-mini"),
ToxicLanguage(threshold=0.7),
],
)

When to use third-party vs. built-in scorers

Start with built-in LLM judges for common evaluation needs like correctness, groundedness, and safety. Add third-party scorers in the following situations:

  • You already use these libraries in your workflows and want to take advantage of other MLflow features.
  • You need metrics for a specific domain that built-in judges don't cover, such as agent step efficiency or conversation completeness.
  • You need deterministic, non-LLM evaluation metrics like BLEU scores, exact match, or regex pattern matching.
  • You need rule-based validators that run without LLM calls, such as PII detection or secrets scanning.