Answer relevancy.
How well the answer responds to the question that was actually asked. Range 0 to 1, higher is better.
How it’s computed
The judge prompts the LLM to reverse-engineer N plausible questions from the answer, then computes cosine similarity between each reverse question’s embedding and the original question’s embedding, averaging the results.
answer_relevancy = mean(cos(embed(q_i), embed(q_original)))
for i in 1..N reverse questions
(typically N=3 to 5)
A high score means the answer, when read in isolation, would plausibly prompt the same question a user asked. A low score means the answer is drifting onto a different topic.
Worked example
Original question: “What vector stores does Yoke support?”
Answer: “Yoke Agent is an open-source evaluation studio.”
The reverse-engineered questions will look like “What is Yoke Agent?” — cosine similarity to the original around 0.2. Low relevancy means the answer is on the wrong topic.
How Yoke Agent uses it
Answer relevancy is one of the RAGAS core four in every RAG experiment. It is surfaced in the leaderboard, the per-question drill-down, and canary monitoring so you can catch regressions where the generator starts drifting off the asked question.
Pair it with faithfulness: high relevancy with low faithfulness is the classic hallucination pattern — the answer stays on topic but invents claims.
Frequently asked
Can answer relevancy be high while faithfulness is low?
Yes — the answer responds to the right question but makes things up. That is a classic hallucination pattern and why you always pair the two.
Why reverse-engineer multiple questions instead of one?
A single reverse question is noisy. Averaging over 3-5 reduces variance and makes the metric more stable across runs.
How do I improve a low score?
Usually the answer is padded with retrieved context that is related but not responsive. Tighten the system prompt so the model stays on the asked question, or add a re-ranker so it sees better context.