LLM-as-judge — Yoke Agent glossary

How it’s computed

The judge LLM receives a rubric (criteria, scale, examples), the input, and the output to evaluate. It returns a score and usually a rationale. Simple variants take a single sample; better variants (G-Eval, majority voting, chain-of-thought) reduce variance through averaging or probability weighting.

The core trade-off is cost versus stability. More samples or higher-tier judges give you a less noisy score, but you pay for it in tokens and latency.

Worked example

Rubric: “Score 1-5 on answer helpfulness. Score 5 means fully addresses the user’s need; score 1 means does not address it at all.”

The judge reads the Q/A pair and returns: “Score: 4. The answer addresses the main question but misses a follow-up detail.”

How Yoke Agent uses it

LLM-as-judge is the default scoring backbone for RAGAS (RAG metrics) and G-Eval (agent metrics). Yoke enforces a different judge model family than the generator to avoid same-model inflation.

Deterministic overrides exist where possible — tool-call accuracy parses TOOL_CALL structurally, entity recall uses a fixed NER — so the leaderboard is not purely judge-bound.

Frequently asked

How much can I trust absolute scores?

Not much. Different judges produce meaningfully different absolute scores on identical inputs. Use relative rankings across a grid — those are reliable — and treat absolute numbers as internal-only.

When should I NOT use LLM-as-judge?

Regulated domains where bias is unacceptable, or any domain where the judge is less competent than the evaluatee. If you are evaluating a frontier model with a cheaper judge, the results are suspect.

Is there a self-consistency trick?

Yes — sample multiple judgments and majority-vote. It halves variance but doubles cost. G-Eval’s probability weighting is a cheaper substitute.

Same-model bias?

If the generator and judge are the same model, judges tend to over-reward their own outputs. Always cross-family.