Skip to main content

Built-in LLM judges

Built-in LLM judges are predefined scorers that use Databricks-hosted LLMs to evaluate common quality dimensions of your GenAI application such as relevance, safety, groundedness, and correctness. Use them when you want to start evaluating quality quickly. For situations where you want more control over your judges, use custom LLM judges or Python (code-based scorers).

For the complete list and detailed documentation, see the MLflow predefined scorers documentation.

Available judges

Judge

Arguments

Requires ground truth

What it evaluates

RelevanceToQuery

inputs, outputs

No

Is the response directly relevant to the user's request?

RetrievalRelevance

inputs, outputs

No

Is the retrieved context directly relevant to the user's request?

Safety

inputs, outputs

No

Is the content free from harmful, offensive, or toxic material?

RetrievalGroundedness

inputs, outputs

No

Is the response grounded in the information provided in the context? Is the agent hallucinating?

Correctness

inputs, outputs, expectations

Yes

Is the response correct as compared to the provided ground truth?

RetrievalSufficiency

inputs, outputs, expectations

Yes

Does the context provide all necessary information to generate a response that includes the ground truth facts?

Guidelines

inputs, outputs

No

Does the response meet specified natural language criteria?

ExpectationsGuidelines

inputs, outputs, expectations

No (but needs guidelines in expectations)

Does the response meet per-example natural language criteria?

ToolCallCorrectness

inputs, outputs, expectations

Yes

Are the tool calls and arguments correct for the user query?

ToolCallEfficiency

inputs, outputs

No

Are the tool calls efficient without redundancy?

Multi-turn judges

For conversational AI systems, MLflow provides judges that evaluate entire conversations rather than individual turns. These judges analyze the complete conversation history to assess quality patterns that emerge over multiple interactions.

Use multi-turn judges both for evaluation during development and for monitoring in production.

For the complete list and detailed documentation, see the MLflow predefined scorers documentation.

Judge

Arguments

Requires ground truth

What it evaluates

ConversationCompleteness

session

No

Did the agent address all user questions throughout the conversation?

UserFrustration

session

No

Did the user become frustrated? Was the frustration resolved?

KnowledgeRetention

session

No

Does the agent correctly retain information from earlier in the conversation?

ConversationalGuidelines

session, guidelines

No

Do the assistant's responses comply with provided guidelines throughout the conversation?

ConversationalRoleAdherence

session

No

Does the assistant maintain its assigned role throughout the conversation?

ConversationalSafety

session

No

Are the assistant's responses safe and free of harmful content?

ConversationalToolCallEfficiency

session

No

Was tool usage across the conversation efficient and appropriate?

Next steps