G-Eval — Yoke Agent glossary

How it’s computed

The judge receives the rubric (e.g. “Score 1-5 for task completion”), generates chain-of-thought reasoning, then produces a score. Instead of taking the top-1 score, G-Eval reads the output token probabilities for each candidate score and combines them into a weighted average.

score = sum_{s} s × P(s)
        over the score bucket

This captures judge uncertainty and reduces single-draw variance — the metric stops flipping between adjacent integer scores across runs.

Worked example

Rubric: “Score 1-5 on decision-path quality.” The judge outputs reasoning and final scores with probabilities: P(3)=0.2, P(4)=0.6, P(5)=0.2. Weighted score = 3×0.2 + 4×0.6 + 5×0.2 = 4.0. Plain top-1 would have been 4 — in this case identical, but in borderline cases the probability weighting spreads the score more smoothly across runs.

How Yoke Agent uses it

G-Eval is the judging backbone for the agent workbench. All 14 agent rubric metrics — task completion, goal alignment, faithfulness, persona consistency and the rest — are G-Eval scorers with different rubrics.

Yoke also pairs G-Eval with deterministic overrides where available (tool-call accuracy, entity recall) so you are not purely judge-bound.

Frequently asked

Why is probability weighting better than top-1?

Lower variance across runs and better calibration on borderline cases. Top-1 scoring flip-flops between adjacent scores; G-Eval gives you a smooth signal.

Does the judge model matter?

Yes. Different judges produce different absolute scores — use a mid-tier model and be consistent about the choice. Relative rankings stay trustworthy across judges; absolute scores are judge-bound.

What are the downsides?

Slightly more expensive per call (need token probabilities, not just the top sample). And the rubric prompt matters a lot — a vague rubric gives you a noisy G-Eval regardless of the weighting.