Arize Phoenix scorers
Arize Phoenix is an open-source LLM observability and evaluation framework from Arize AI. MLflow integrates with Phoenix so that you can use Phoenix evaluators as scorers for tasks including hallucination detection, relevance assessment, and toxicity identification.
Requirements
Install the arize-phoenix-evals package:
%pip install arize-phoenix-evals
Quick start
To call a Phoenix scorer directly:
from mlflow.genai.scorers.phoenix import Hallucination
scorer = Hallucination(model="databricks:/databricks-gpt-5-mini")
feedback = scorer(
inputs="What is the capital of France?",
outputs="Paris is the capital of France.",
expectations={"context": "France is a country in Europe. Its capital is Paris."},
)
print(feedback.value) # "factual" or "hallucinated"
print(feedback.metadata["score"]) # Numeric score
To call Phoenix scorers using mlflow.genai.evaluate():
import mlflow
from mlflow.genai.scorers.phoenix import Hallucination, Relevance
eval_dataset = [
{
"inputs": {"query": "What is MLflow?"},
"outputs": "MLflow is an open-source AI engineering platform for agents and LLMs.",
"expectations": {
"context": "MLflow is an ML platform for experiment tracking and model deployment."
},
},
{
"inputs": {"query": "How do I track experiments?"},
"outputs": "You can use mlflow.start_run() to begin tracking experiments.",
"expectations": {
"context": "MLflow provides APIs like mlflow.start_run() for experiment tracking."
},
},
]
results = mlflow.genai.evaluate(
data=eval_dataset,
scorers=[
Hallucination(model="databricks:/databricks-gpt-5-mini"),
Relevance(model="databricks:/databricks-gpt-5-mini"),
],
)
Available Phoenix scorers
Scorer | What does it evaluate? |
|---|---|
Does the output contain fabricated information absent from the context? | |
Is the retrieved context relevant to the input query? | |
Does the output contain toxic or harmful content? | |
Is the answer accurate relative to the reference material? | |
Is the summary accurate and complete? |
Create a scorer by name
You can dynamically create a scorer using get_scorer by passing the metric name as a string:
from mlflow.genai.scorers.phoenix import get_scorer
scorer = get_scorer(
metric_name="Hallucination",
model="databricks:/databricks-gpt-5-mini",
)
feedback = scorer(
inputs="What is MLflow?",
outputs="MLflow is a platform for ML workflows.",
expectations={"context": "MLflow is an ML platform."},
)