Built-in LLM judges
Built-in LLM judges are predefined scorers that use Databricks-hosted LLMs to evaluate common quality dimensions of your GenAI application such as relevance, safety, groundedness, and correctness. Use them when you want to start evaluating quality quickly. For situations where you want more control over your judges, use custom LLM judges or Python (code-based scorers).
For the complete list and detailed documentation, see the MLflow predefined scorers documentation.
Available judges
Judge | Arguments | Requires ground truth | What it evaluates |
|---|---|---|---|
| No | Is the response directly relevant to the user's request? | |
| No | Is the retrieved context directly relevant to the user's request? | |
| No | Is the content free from harmful, offensive, or toxic material? | |
| No | Is the response grounded in the information provided in the context? Is the agent hallucinating? | |
| Yes | Is the response correct as compared to the provided ground truth? | |
| Yes | Does the context provide all necessary information to generate a response that includes the ground truth facts? | |
| No | Does the response meet specified natural language criteria? | |
| No (but needs guidelines in expectations) | Does the response meet per-example natural language criteria? | |
| Yes | Are the tool calls and arguments correct for the user query? | |
| No | Are the tool calls efficient without redundancy? |
Multi-turn judges
For conversational AI systems, MLflow provides judges that evaluate entire conversations rather than individual turns. These judges analyze the complete conversation history to assess quality patterns that emerge over multiple interactions.
Use multi-turn judges both for evaluation during development and for monitoring in production.
For the complete list and detailed documentation, see the MLflow predefined scorers documentation.
Judge | Arguments | Requires ground truth | What it evaluates |
|---|---|---|---|
| No | Did the agent address all user questions throughout the conversation? | |
| No | Did the user become frustrated? Was the frustration resolved? | |
| No | Does the agent correctly retain information from earlier in the conversation? | |
| No | Do the assistant's responses comply with provided guidelines throughout the conversation? | |
| No | Does the assistant maintain its assigned role throughout the conversation? | |
| No | Are the assistant's responses safe and free of harmful content? | |
| No | Was tool usage across the conversation efficient and appropriate? |
Next steps
- Choose the LLM that powers a judge
- Build a custom LLM judge when built-in judges don't fit your use case
- Align judges with human feedback to improve accuracy on your domain