Third-party scorers
MLflow integrates with popular open-source evaluation frameworks so that you can use their specialized metrics as scorers alongside built-in LLM judges and code-based scorers. Third-party scorers plug directly into mlflow.genai.evaluate(), giving you access to a broad library of evaluation metrics through a single, unified interface.
Why use third-party scorers
Third-party scorers are useful when you need:
- Specialized metrics not covered by built-in judges, such as agent plan quality, jailbreak detection, or BLEU/ROUGE text comparison scores.
- Framework-specific strengths from libraries your team already uses, without changing your evaluation workflow.
- Combined evaluation across multiple frameworks in a single
mlflow.genai.evaluate()call, with results visualized together in the MLflow UI.
Available integrations
Each integration wraps a third-party framework's metrics as MLflow scorers. Install the framework's package, import the scorer, and pass it to mlflow.genai.evaluate().
Integration | When to use: |
|---|---|
You need the broadest metric coverage across RAG, agents, conversational AI, and safety. DeepEval offers specialized scorers for agent plan quality, step efficiency, multi-turn conversation completeness, and role adherence that other frameworks don't provide. | |
You need deep RAG evaluation with fine-grained context metrics (precision, recall, utilization, noise sensitivity), agent goal accuracy, or deterministic text comparison scores like BLEU, ROUGE, and semantic similarity without LLM calls. | |
You need a lightweight, focused set of scorers for hallucination detection, relevance assessment, toxicity identification, QA correctness, or summarization quality. | |
You need to analyze agent execution traces with goal-plan-action alignment metrics like logical consistency, execution efficiency, plan adherence, and tool selection. | |
You need rule-based output validation that runs without LLM calls, such as toxicity detection, PII scanning, jailbreak detection, secrets detection, or gibberish identification. |
Quick example
The following example combines scorers from two different frameworks in a single evaluation:
import mlflow
from mlflow.genai.scorers.deepeval import AnswerRelevancy
from mlflow.genai.scorers.guardrails import ToxicLanguage
eval_dataset = [
{
"inputs": {"query": "What is MLflow?"},
"outputs": "MLflow is an open-source platform for managing ML and GenAI workloads.",
},
]
results = mlflow.genai.evaluate(
data=eval_dataset,
scorers=[
AnswerRelevancy(threshold=0.7, model="databricks:/databricks-gpt-5-mini"),
ToxicLanguage(threshold=0.7),
],
)
When to use third-party vs. built-in scorers
Start with built-in LLM judges for common evaluation needs like correctness, groundedness, and safety. Add third-party scorers in the following situations:
- You already use these libraries in your workflows and want to take advantage of other MLflow features.
- You need metrics for a specific domain that built-in judges don't cover, such as agent step efficiency or conversation completeness.
- You need deterministic, non-LLM evaluation metrics like BLEU scores, exact match, or regex pattern matching.
- You need rule-based validators that run without LLM calls, such as PII detection or secrets scanning.