LLM-as-judge.
Using an LLM to score another LLM’s output against a defined rubric.
How it’s computed
The judge LLM receives a rubric (criteria, scale, examples), the input, and the output to evaluate. It returns a score and usually a rationale. Simple variants take a single sample; better variants (G-Eval, majority voting, chain-of-thought) reduce variance through averaging or probability weighting.
The core trade-off is cost versus stability. More samples or higher-tier judges give you a less noisy score, but you pay for it in tokens and latency.
Worked example
Rubric: “Score 1-5 on answer helpfulness. Score 5 means fully addresses the user’s need; score 1 means does not address it at all.”
The judge reads the Q/A pair and returns: “Score: 4. The answer addresses the main question but misses a follow-up detail.”
How Yoke Agent uses it
LLM-as-judge is the default scoring backbone for RAGAS (RAG metrics) and G-Eval (agent metrics). Yoke enforces a different judge model family than the generator to avoid same-model inflation.
Deterministic overrides exist where possible —
tool-call accuracy parses TOOL_CALL
structurally, entity recall uses a fixed NER — so the
leaderboard is not purely judge-bound.
Frequently asked
How much can I trust absolute scores?
Not much. Different judges produce meaningfully different absolute scores on identical inputs. Use relative rankings across a grid — those are reliable — and treat absolute numbers as internal-only.
When should I NOT use LLM-as-judge?
Regulated domains where bias is unacceptable, or any domain where the judge is less competent than the evaluatee. If you are evaluating a frontier model with a cheaper judge, the results are suspect.
Is there a self-consistency trick?
Yes — sample multiple judgments and majority-vote. It halves variance but doubles cost. G-Eval’s probability weighting is a cheaper substitute.
Same-model bias?
If the generator and judge are the same model, judges tend to over-reward their own outputs. Always cross-family.