Tool-call accuracy.
Whether an agent invoked the correct tool with the correct arguments for the given user intent.
How it’s computed
Yoke supports two variants.
Deterministic: parse TOOL_CALL
invocations from the transcript. Diff the called tool name
and arguments against the scenario’s expected call.
Score = exact-match or arg-level match ratio.
Judge-based: an LLM rubric scores whether the invocation “looks correct” semantically — useful when the scenario rubric is loose (e.g. “call some reasonable search tool”).
accuracy = matched_calls / expected_calls
(with arg-level partial credit when enabled)
Worked example
Scenario expects
search_invoices(customer_id=42, start_date="2026-01-01").
The agent called
search_invoices(customer_id=42, start_date="2025-12-31").
Tool name matches, one of two args matches.
Score = 0.5 with arg-level credit;
0 with strict match.
How Yoke Agent uses it
The deterministic variant runs on every tool_use
scenario. Unlike judge-based tool-call scoring, it does not
drift with judge model changes and does not cost a judge
call.
The judge-based variant is available as a fallback when the scenario has semantic rather than structural expectations.
Frequently asked
Why is deterministic better than judge-based?
Two reasons. First, reproducibility — the same transcript always scores the same. Second, cost — no extra LLM calls. For tool-use scoring specifically, the structured nature of TOOL_CALL invocations means deterministic is both cheaper and more reliable.
What counts as a “correct” call?
Configured per scenario. The strictest mode requires exact tool name + exact arguments. Looser modes allow arg-level partial credit, arg-value tolerances (e.g. date ranges within +/- 1 day), or wildcard args that aren’t scored.
What if my agent doesn’t use TOOL_CALL format?
Use the judge-based variant. Or adapt your agent’s tool-call format to something parseable — it is worth it.