Arize Phoenix scorers

Arize Phoenix is an open-source LLM observability and evaluation framework from Arize AI. MLflow integrates with Phoenix so that you can use Phoenix evaluators as scorers for tasks including hallucination detection, relevance assessment, and toxicity identification.

Requirements

Install the arize-phoenix-evals package:

Python
%pip install arize-phoenix-evals

Quick start

To call a Phoenix scorer directly:

Python
from mlflow.genai.scorers.phoenix import Hallucination

scorer = Hallucination(model="databricks:/databricks-gpt-5-mini")
feedback = scorer(
    inputs="What is the capital of France?",
    outputs="Paris is the capital of France.",
    expectations={"context": "France is a country in Europe. Its capital is Paris."},
)

print(feedback.value)  # "factual" or "hallucinated"
print(feedback.metadata["score"])  # Numeric score

To call Phoenix scorers using mlflow.genai.evaluate():

Python
import mlflow
from mlflow.genai.scorers.phoenix import Hallucination, Relevance

eval_dataset = [
    {
        "inputs": {"query": "What is MLflow?"},
        "outputs": "MLflow is an open-source AI engineering platform for agents and LLMs.",
        "expectations": {
            "context": "MLflow is an ML platform for experiment tracking and model deployment."
        },
    },
    {
        "inputs": {"query": "How do I track experiments?"},
        "outputs": "You can use mlflow.start_run() to begin tracking experiments.",
        "expectations": {
            "context": "MLflow provides APIs like mlflow.start_run() for experiment tracking."
        },
    },
]

results = mlflow.genai.evaluate(
    data=eval_dataset,
    scorers=[
        Hallucination(model="databricks:/databricks-gpt-5-mini"),
        Relevance(model="databricks:/databricks-gpt-5-mini"),
    ],
)

Available Phoenix scorers

Scorer	What does it evaluate?
`Hallucination`	Does the output contain fabricated information absent from the context?
`Relevance`	Is the retrieved context relevant to the input query?
`Toxicity`	Does the output contain toxic or harmful content?
`QA`	Is the answer accurate relative to the reference material?
`Summarization`	Is the summary accurate and complete?

Scorer	What does it evaluate?
`Hallucination`	Does the output contain fabricated information absent from the context?
`Relevance`	Is the retrieved context relevant to the input query?
`Toxicity`	Does the output contain toxic or harmful content?
`QA`	Is the answer accurate relative to the reference material?
`Summarization`	Is the summary accurate and complete?

Create a scorer by name

You can dynamically create a scorer using get_scorer by passing the metric name as a string:

Python
from mlflow.genai.scorers.phoenix import get_scorer

scorer = get_scorer(
    metric_name="Hallucination",
    model="databricks:/databricks-gpt-5-mini",
)
feedback = scorer(
    inputs="What is MLflow?",
    outputs="MLflow is a platform for ML workflows.",
    expectations={"context": "MLflow is an ML platform."},
)

Requirements​

Quick start​

Available Phoenix scorers​

Create a scorer by name​

Requirements

Quick start

Available Phoenix scorers

Create a scorer by name