Glossary · Agent metric

Tool-call accuracy.

Whether an agent invoked the correct tool with the correct arguments for the given user intent.

How it’s computed

Yoke supports two variants.

Deterministic: parse TOOL_CALL invocations from the transcript. Diff the called tool name and arguments against the scenario’s expected call. Score = exact-match or arg-level match ratio.

Judge-based: an LLM rubric scores whether the invocation “looks correct” semantically — useful when the scenario rubric is loose (e.g. “call some reasonable search tool”).

accuracy = matched_calls / expected_calls
(with arg-level partial credit when enabled)

Worked example

Scenario expects search_invoices(customer_id=42, start_date="2026-01-01"). The agent called search_invoices(customer_id=42, start_date="2025-12-31"). Tool name matches, one of two args matches. Score = 0.5 with arg-level credit; 0 with strict match.

How Yoke Agent uses it

The deterministic variant runs on every tool_use scenario. Unlike judge-based tool-call scoring, it does not drift with judge model changes and does not cost a judge call.

The judge-based variant is available as a fallback when the scenario has semantic rather than structural expectations.

Frequently asked

Why is deterministic better than judge-based?

Two reasons. First, reproducibility — the same transcript always scores the same. Second, cost — no extra LLM calls. For tool-use scoring specifically, the structured nature of TOOL_CALL invocations means deterministic is both cheaper and more reliable.

What counts as a “correct” call?

Configured per scenario. The strictest mode requires exact tool name + exact arguments. Looser modes allow arg-level partial credit, arg-value tolerances (e.g. date ranges within +/- 1 day), or wildcard args that aren’t scored.

What if my agent doesn’t use TOOL_CALL format?

Use the judge-based variant. Or adapt your agent’s tool-call format to something parseable — it is worth it.