Noise sensitivity.
How much answer quality drops when irrelevant context is injected alongside relevant chunks. Lower is better.
How it’s computed
Run the same question twice — once with clean context (only relevant chunks), once with the actual retrieval output (relevant + some noise). The delta in faithfulness or correctness between the two runs is the noise sensitivity.
noise_sensitivity = score(clean) − score(with_noise)
A score near zero means the generator is robust — it ignores irrelevant chunks and produces the same answer either way. A large positive delta means the noise is actively degrading the output.
Worked example
Faithfulness with clean hand-picked context = 0.92. Faithfulness with the real retriever’s top-5 (which includes 2 irrelevant chunks) = 0.78. Noise sensitivity = 0.14 — the model loses 14 percentage points when facing realistic retrieval noise.
How Yoke Agent uses it
Noise sensitivity is an optional RAGAS metric, run on demand rather than by default. It is particularly useful when debugging pipelines that score well on small test sets but drift on larger corpora where retrievers return more noise.
Frequently asked
Why does this matter?
Real retrievers always return some noise. A model that collapses under a couple of irrelevant chunks will behave worse in production than evaluation on clean sets suggests.
How do I reduce noise sensitivity?
Two paths: make the retriever cleaner (rerankers, higher top-k thresholds), or make the generator more robust to noise (tighter prompts, lower temperature, explicit “ignore irrelevant context” instructions).
How many noise chunks should I inject for testing?
Two or three. More than that and you’re stress-testing a pathological case rather than realistic production.