Glossary · Quality signal

Hallucination.

The presence of factual claims in the output that are not supported by the retrieved context or the input. Lower is better.

How it’s computed

Roughly the complement of faithfulness. The judge decomposes the answer into atomic claims and flags those not grounded in the retrieved context.

hallucination = count(ungrounded_claims)
                / count(all_claims)

Some implementations weight by severity — a fabricated number is worse than an unsupported adjective — so the score is not purely a count ratio in every flavour of the metric.

Worked example

Same scenario as faithfulness. Answer contains three claims; one is unsupported (the “1970s” date that should have been 1960s). Hallucination = 1 / 3 ≈ 0.33.

How Yoke Agent uses it

Hallucination is paired with faithfulness on every RAG and agent evaluation. Faithfulness tells you what fraction is grounded; hallucination tells you the shape and severity of the ungrounded parts. You want both.

Frequently asked

Why track both if they’re complements?

Faithfulness gives you coverage (how much is right); hallucination gives you severity (where the misses are). A faithfulness of 0.9 can hide a single catastrophic hallucination that matters more than the other 9 correct claims.

Is hallucination always bad?

For grounded Q&A, yes. For creative writing or brainstorming tasks, “unsupported by context” is the point — don’t run this metric there.

How do I reduce it?

Tighter system prompts (“only use facts from the context”), lower generation temperature, and context precision improvements (less noise for the model to mix up).