Pular para o conteúdo principal

Juiz de exatidão & goleador

O juiz predefinido judges.is_correct() avalia se a resposta do seu aplicativo GenAI é factualmente correta, comparando-a com as informações de verdade fornecidas (expected_facts ou expected_response).

Esse juiz está disponível por meio do marcador Correctness predefinido para avaliar as respostas da candidatura em relação às respostas corretas conhecidas.

Assinatura da API

Python
from mlflow.genai.judges import is_correct

def is_correct(
*,
request: str, # User's question or query
response: str, # Application's response to evaluate
expected_facts: Optional[list[str]], # List of expected facts (provide either expected_response or expected_facts)
expected_response: Optional[str] = None, # Ground truth response (provide either expected_response or expected_facts)
name: Optional[str] = None # Optional custom name for display in the MLflow UIs
) -> mlflow.entities.Feedback:
"""Returns Feedback with 'yes' or 'no' value and a rationale"""

Pré-requisitos para executar os exemplos

  1. Instale o site MLflow e o pacote necessário

    Bash
    pip install --upgrade "mlflow[databricks]>=3.1.0"
  2. Crie um experimento MLflow seguindo o início rápido de configuração do ambiente.

Uso direto do SDK

Python
from mlflow.genai.judges import is_correct

# Example 1: Response contains expected facts
feedback = is_correct(
request="What is MLflow?",
response="MLflow is an open-source platform for managing the ML lifecycle.",
expected_facts=[
"MLflow is open-source",
"MLflow is a platform for ML lifecycle"
]
)
print(feedback.value) # "yes"
print(feedback.rationale) # Explanation of correctness

# Example 2: Response missing or contradicting facts
feedback = is_correct(
request="When was MLflow released?",
response="MLflow was released in 2017.",
expected_facts=["MLflow was released in June 2018"]
)
print(feedback.value) # "no"
print(feedback.rationale) # Explanation of what's incorrect

Usando o marcador pré-construído

O juiz is_correct está disponível por meio do marcador pré-construído Correctness.

Requisitos:

  • Requisitos de rastreamento : inputs e outputs devem estar na extensão raiz do Trace
  • Ground-truth rótulo : Obrigatório - o senhor deve fornecer expected_facts ou expected_response no dicionário expectations
Python
from mlflow.genai.scorers import Correctness

# Create evaluation dataset with ground truth
eval_dataset = [
{
"inputs": {"query": "What is the capital of France?"},
"outputs": {
"response": "Paris is the magnificent capital city of France, known for the Eiffel Tower and rich culture."
},
"expectations": {
"expected_facts": ["Paris is the capital of France."]
},
},
{
"inputs": {"query": "What are the main components of MLflow?"},
"outputs": {
"response": "MLflow has four main components: Tracking, Projects, Models, and Registry."
},
"expectations": {
"expected_facts": [
"MLflow has four main components",
"Components include Tracking",
"Components include Projects",
"Components include Models",
"Components include Registry"
]
},
},
{
"inputs": {"query": "When was MLflow released?"},
"outputs": {
"response": "MLflow was released in 2017 by Databricks."
},
"expectations": {
"expected_facts": ["MLflow was released in June 2018"]
},
}
]

# Run evaluation with Correctness scorer
eval_results = mlflow.genai.evaluate(
data=eval_dataset,
scorers=[Correctness()]
)

Alternativa: usar expected_response

Você também pode usar expected_response em vez de expected_facts:

Python
eval_dataset_with_response = [
{
"inputs": {"query": "What is MLflow?"},
"outputs": {
"response": "MLflow is an open-source platform for managing the ML lifecycle."
},
"expectations": {
"expected_response": "MLflow is an open-source platform for managing the machine learning lifecycle, including experimentation, reproducibility, and deployment."
},
}
]

# Run evaluation with expected_response
eval_results = mlflow.genai.evaluate(
data=eval_dataset_with_response,
scorers=[Correctness()]
)
dica

O uso do site expected_facts é recomendado em vez do site expected_response, pois permite uma avaliação mais flexível - a resposta não precisa corresponder palavra por palavra, apenas conter os fatos do site key.

Usando em um marcador personalizado

Ao avaliar aplicativos com estruturas de dados diferentes dos requisitos do pontuador predefinido, envolva o juiz em um marcador personalizado:

Python
from mlflow.genai.judges import is_correct
from mlflow.genai.scorers import scorer
from typing import Dict, Any

eval_dataset = [
{
"inputs": {"question": "What are the main components of MLflow?"},
"outputs": {
"answer": "MLflow has four main components: Tracking, Projects, Models, and Registry."
},
"expectations": {
"facts": [
"MLflow has four main components",
"Components include Tracking",
"Components include Projects",
"Components include Models",
"Components include Registry"
]
}
},
{
"inputs": {"question": "What is MLflow used for?"},
"outputs": {
"answer": "MLflow is used for building websites."
},
"expectations": {
"facts": [
"MLflow is used for managing ML lifecycle",
"MLflow helps with experiment tracking"
]
}
}
]

@scorer
def correctness_scorer(inputs: Dict[Any, Any], outputs: Dict[Any, Any], expectations: Dict[Any, Any]):
return is_correct(
request=inputs["question"],
response=outputs["answer"],
expected_facts=expectations["facts"]
)

# Run evaluation
eval_results = mlflow.genai.evaluate(
data=eval_dataset,
scorers=[correctness_scorer]
)

Interpretando resultados

O juiz retorna um objeto Feedback com:

  • value : " sim " se a resposta estiver correta, " não " se incorreta
  • rationale : Explicação detalhada de quais fatos estão comprovados ou ausentes

Próximas etapas