Faithfulness — Yoke Agent glossary

How it’s computed

Faithfulness is evaluated with an LLM-as-judge. The judge receives the answer and the retrieved context, decomposes the answer into atomic claims, and checks each claim against the context. The metric is:

faithfulness = count(claims supported by retrieved context)
               / count(all claims in the answer)

A score of 1.0 means every claim in the answer has backing in the context. A score of 0.5 means half the claims are ungrounded — i.e. the model added information the retriever didn’t surface.

Worked example

Question: “What is poka-yoke?”

Retrieved context: “Poka-yoke is a Japanese term coined by Shigeo Shingo in the 1960s at Toyota, referring to mechanisms that prevent errors from occurring in manufacturing processes.”

Answer: “Poka-yoke is a Japanese manufacturing term coined by Shigeo Shingo at Toyota in the 1970s. It refers to error-prevention mechanisms.”

Three atomic claims: (a) coined by Shingo ✓, (b) at Toyota ✓, (c) in the 1970s ✗ (actual: 1960s). Faithfulness = 2 / 3 = 0.67.

How Yoke Agent uses it

Faithfulness is one of the four fixed metrics that the RAG workbench runs on every experiment — alongside answer relevancy, context precision and context recall. You see it per configuration in the leaderboard, per question in the drill-down, and as a rolling signal on the canary dataset in production.

The judge model is configured separately from the generator model. Yoke warns if you pick the same model for both — same-model faithfulness judging is well known to inflate scores.

Frequently asked

What’s the difference between faithfulness and hallucination?

They are complements. Faithfulness reports the fraction of grounded claims (higher is better). Hallucination reports the severity or presence of ungrounded claims (lower is better). Most teams track both.

Can faithfulness be 1.0 and the answer still be wrong?

Yes. Faithfulness only checks whether claims are supported by the retrieved context. If the retrieved context itself is wrong or outdated, a perfectly faithful answer will also be wrong. Pair faithfulness with context recall and a correctness metric against ground truth.

Which judge model should I use?

Use a different model family than the one generating the answer. Same-model judging inflates scores. A capable mid-tier model is usually good enough — flagship judges rarely pay off for faithfulness specifically.

How do I improve a low faithfulness score?

Three moves, in order: (1) tighten the system prompt to penalise unsupported claims, (2) improve context precision so the model sees less noise, (3) lower generation temperature. If faithfulness is still low, the retriever is likely missing relevant chunks — work on recall next.