Deep dive · 07 Apr 2026 · 11 min read

The 14 agent evaluation metrics Yoke ships (and why).

Agent evaluation is harder than RAG evaluation because the thing you are scoring is a trajectory, not an answer. Here is the complete rubric Yoke Agent ships, and when to reach for each metric.

When you evaluate a RAG pipeline you score a question-answer pair: input, context, output. When you evaluate an agent you score a transcript — the whole multi-turn conversation, including every tool call, every retry, every guardrail that fired. The scoring surface is orders of magnitude larger, and “was this any good” does not decompose cleanly into one number.

Yoke Agent ships 14 rubric metrics built on top of G-Eval, plus one deterministic override for tool-call accuracy that does not depend on LLM-as-judge at all. This post walks through all of them with concrete examples, so you can pick the three or four that matter for your agent.

The deterministic override comes first

Before the 14 rubric metrics, there is one signal that matters more than any of them: did the agent call the right tool with the right arguments? Everything else is downstream of that.

Yoke parses TOOL_CALL invocations from the transcript directly. No LLM judge. If the rubric for a scenario says “call search_invoices(customer_id=42, start_date="2026-01-01")” and the agent called search_invoices(customer_id=42, start_date="2025-12-31"), that is a deterministic miss with a visible diff. This turns out to be the single most useful agent signal in production.

The 14 rubric metrics

All 14 are G-Eval-style rubrics — an LLM is prompted to score a specific quality on a 0-to-1 scale with explicit criteria. You enable the subset that matters for the agent; you do not need all 14 on every run.

1. Tool-call accuracy (judge-based)

The rubric version of the deterministic check above. Scores 0 to 1 based on whether each tool invocation looks correct semantically, not just structurally. Use it as a fallback when your scenario rubric is loose (“call some reasonable search tool”) rather than strict.

2. Task completion

Did the agent actually accomplish the user’s goal by the end of the transcript? Independent of how cleanly it got there. This is the headline metric for most agent teams — if it drops, nothing else matters.

3. Goal alignment

Did the agent interpret the user’s goal correctly in the first place? Distinguishes between “the agent did the wrong task perfectly” (goal alignment low, task completion high) and “the agent understood but failed” (goal alignment high, task completion low). These fail differently and need different fixes.

4. Decision-path quality

Rubric against the optimal ordering of steps. Did the agent call tools in a reasonable order, or did it make a round-trip it didn’t need? Catches “eventually got there” trajectories that would score well on task completion but cost 3× as much.

5. Context utilization

Did the agent actually use the information in its context window, or did it ignore retrieved documents / prior turns and wing it? Low scores correlate strongly with hallucination.

6. Response quality

The quality of the final natural-language reply to the user — clarity, correctness, helpfulness, tone. Separate from task completion (the agent can complete the task and still give a terrible explanation).

7. Faithfulness

Identical to the RAG metric. Every factual claim the agent makes should be supported by the tool outputs or retrieved context. Your hallucination detector in conversational form.

8. Hallucination

Complement to faithfulness — explicitly scores the presence of unsupported factual claims. You usually want both: faithfulness tells you what fraction is grounded, hallucination tells you the severity of the ungrounded parts.

9. Refusal accuracy

When the agent refused, was the refusal correct? When it didn’t refuse, should it have? Critical for any agent with guardrails (regulatory, safety, permission).

10. Guardrail adherence

Rubric against the specific guardrails configured for the agent. Did it honor the system prompt? Did it stay on topic? Did it escalate when it should have?

11. Persona consistency

If the agent has a defined voice / persona / style guide, did it maintain it across the transcript? Drift matters more than average adherence — an agent that starts helpful and ends curt is a worse experience than one that is uniformly mediocre.

12. Error recovery

When a tool returned an error, did the agent recover gracefully? Fallback to an alternative tool, re-prompt the user, or roll over and die? Separates agents that survive the real world from agents that only work in the happy path.

13. Clarification quality

When the agent asked a clarifying question, was the question necessary and well-formed? Over-asking is its own failure mode — some agents paralyse users by requesting confirmation for every action.

14. Safety

Aggregate of toxicity, bias, PII leakage and policy violations in the agent output. You almost always want this on as a blocking gate, not a reported-and-ignored metric.

How to pick your subset

You do not want 14 metrics on every run. It is noisy, it is expensive, and the per-metric judge calls multiply. A practical starting set:

  • Tool-call accuracy (deterministic) — always on, costs nothing.
  • Task completion — the headline.
  • Faithfulness + hallucination — paired.
  • Safety — always on as a gate.

That is 5 metrics for roughly $0.02-$0.05 per transcript on flex-tier pricing, which is tolerable for grid-search. Add the others when a specific failure mode shows up — if users complain about tone, turn on persona consistency; if the agent keeps asking unnecessary clarifications, turn on clarification quality.

Scenario types and which metrics they trigger

Yoke ships four scenario types, each with a default metric bundle:

ScenarioDefault metrics
tool_use Tool-call accuracy (det), decision-path quality, error recovery, task completion
decision_path Decision-path quality, goal alignment, task completion
response_quality Response quality, faithfulness, hallucination, persona consistency
guardrail Refusal accuracy, guardrail adherence, safety

Mix and match scenarios across personas to get a coverage matrix that surfaces every failure mode without running every metric on every transcript.

Extending with custom metrics

The 14 cover the common ground. Your agent almost certainly has one or two domain-specific quality rules — citation style, regulatory boilerplate, specific phrasing to avoid. Write those as a custom metric hook:

def citation_coverage(transcript, persona, scenario) -> float:
    claims = extract_factual_claims(transcript.final_response)
    cited = [c for c in claims if has_citation(c)]
    return len(cited) / max(len(claims), 1)

Register it once in the backend and it shows up in the leaderboard alongside the built-ins. This is how teams keep the eval rubric tight to the product over time.


If you want to see the metrics in action against your own agent, clone Yoke Agent, point it at your system prompt and tools, and the first persona simulation runs in about five minutes.