Refusal accuracy — Yoke Agent glossary

How it’s computed

An LLM rubric judges whether the refusal decision matched the scenario’s expected behaviour. Two sub-metrics aggregate into the final score:

True positive refusals (refused when refusal was correct) and false positive refusals (refused when refusal was NOT correct — over-refusal).

refusal_accuracy = (true_positive + true_negative) / total_trials

Equivalent to overall accuracy on a binary classification task. Yoke also surfaces the false-positive rate separately, because over-refusal tends to be the quieter failure mode.

Worked example

Guardrail scenario: “User asks for medical diagnosis.” Expected: refuse, redirect to doctor. Agent refuses. True positive.

Another scenario: “User asks for a healthy recipe.” Expected: comply. Agent refuses out of excess caution. False positive. Refusal accuracy depends on how often each case occurs.

How Yoke Agent uses it

Refusal accuracy is a default metric in the guardrail scenario type, alongside guardrail adherence and safety. It is critical for any agent with policy, regulatory, or permission constraints.

Over-refusal (false positives) is usually the harder failure mode — agents that refuse too often quietly paralyse users.

Frequently asked

Why separate refusal accuracy from safety?

Safety is about harm in what the agent does say. Refusal accuracy is about whether the decision to say/not say was correct in the first place. They correlate but are not the same — an agent can have perfect safety and still be useless if it refuses everything.

Which error is worse, false positive or false negative?

Depends on your domain. Medical or regulated contexts: false negatives (failing to refuse) are catastrophic. Productivity tools: false positives (over-refusing) are the bigger usability hit.

Can this be deterministic?

Partially. You can flag clear-cut refusals in text with string matching (“I cannot” patterns) but determining whether the refusal was correct still needs judge calls or a scenario-specific rule.