Guide · 21 Apr 2026 · 14 min read

How to evaluate a RAG pipeline end-to-end in 2026.

Most teams can build a retrieval-augmented generation pipeline in a weekend. Evaluating one rigorously enough to ship still takes weeks — and almost always for the wrong reasons. Here is the six-stage workflow that actually works.

A retrieval-augmented generation (RAG) pipeline has at least four knobs that materially change the answer: how you chunk documents, which embedding model you use, which retriever (and reranker) you pick, and which advanced strategy — if any — you layer on top. Each knob has a half-dozen defensible settings. You multiply them out and quickly end up with a few hundred configurations, any of which might be the right one to ship. Evaluation is how you stop guessing.

In 2025 the community mostly agreed on the metrics. In 2026 the bottleneck has shifted: metrics are a solved problem, but actually running them across a grid, keeping costs in check, and translating the numbers into a shipping decision is where most teams still stall. This guide is the end-to-end workflow we use, broken into six stages you can execute in order.

The six stages, at a glance

  1. Build (or steal) a test dataset.
  2. Pick your metrics.
  3. Define your grid axes.
  4. Run the search.
  5. Read the improvement report.
  6. Monitor in production.

Stages 1 through 5 are a one-time investment per project. Stage 6 is the forever tax. Skip any of them and you will pay interest later.

Stage 1 — Build (or steal) a test dataset

Everything downstream depends on the quality of your test set. Concretely: a reasonable evaluation dataset for a RAG pipeline has 50 to 200 Q&A pairs, each with a question, the ground-truth answer, and ideally a pointer to the source chunk(s). Fewer than 50 and your grid-search noise swamps the signal. More than 200 and the cost of every sweep balloons.

You have three practical ways to produce this set:

  • LLM-generated from your own corpus. Sample chunks, ask a large model to write plausible questions and answers for each, then have humans review. Ragas popularised the technique; it is fast and reasonably high signal, but you must review the output or you will silently train yourself on the judge model’s biases.
  • Curated by domain experts. Expensive, slow, and the only option when your corpus has real stakes (legal, medical, financial). Budget one expert-week per 100 Q&A pairs.
  • Public benchmarks. HotPotQA, MultiHop-RAG, FinanceBench and similar. Useful for sanity-checking a pipeline against a neutral baseline, rarely sufficient for shipping decisions on your own corpus.

In practice you want a hybrid: generate with an LLM, spend an afternoon pruning the obvious garbage, then lock the set with an approve step so it cannot silently drift between sweeps.

Stage 2 — Pick your metrics

There are more RAG evaluation metrics published than any team will ever run. The four you actually need first are the RAGAS core:

  • Faithfulness. Does the answer stay grounded in the retrieved context? A faithfulness score of 0.9 means 90% of the factual claims in the answer are supported by the retrieved chunks. This is your hallucination detector.
  • Answer relevancy. Does the answer actually respond to the question? Low scores here usually mean the model is padding with retrieved context that is related but not responsive.
  • Context precision. Are the retrieved chunks the right ones? Measures the signal-to-noise ratio of your retrieval.
  • Context recall. Did the retriever find everything it needed? Measures coverage.

Those four form the minimum viable signal. Add the following selectively, in this order, as your needs grow:

  • Correctness (if you have ground-truth answers).
  • Answer similarity (semantic match against a reference).
  • Entity recall (for extractive Q&A tasks).
  • Noise sensitivity (resilience when irrelevant context is injected).
  • A custom metrics hook for domain-specific rules — citation coverage, PII leakage, regulatory compliance.

Every one of these except the domain-specific hook is an LLM-as-judge call. That matters: absolute scores are not ground truth, but relative rankings across a grid are reliably informative. Read that sentence twice.

Stage 3 — Define your grid axes

This is where most teams either under-shoot (testing chunk-size and top-k only) or over-shoot (testing every embedding model on Hugging Face). A pragmatic first grid has four axes:

  1. Chunking — fixed 256, fixed 512, semantic, and recursive-with-overlap. Four values.
  2. Embeddings — the default from your provider, one cheaper/faster alternative, and one strong-but-expensive option. Three values.
  3. Retrieval — dense only, hybrid (dense + BM25), and re-ranked (dense + Cohere rerank or equivalent). Three values.
  4. Advanced strategy — none, HyDE, CRAG, Self-RAG, or a multi-query variant. Start with two values (none and one contender).

That grid is 4 × 3 × 3 × 2 = 72 combinations. With a 100-question dataset and the RAGAS core, you are looking at roughly 7,200 retrievals and as many judge calls. Budget $30 to $80 on flex-tier pricing, or $150 to $400 on standard tier. Run the numbers before you start the sweep.

A note on advanced strategies

HyDE, CRAG, Self-RAG, RAPTOR, GraphRAG, Multi-Query and Agentic are the strategies worth testing in 2026. They are not plugins or wrappers — each changes how retrieval happens fundamentally, which means they belong as grid axes, not as post-hoc add-ons. Two things to know:

  • GraphRAG and RAPTOR are expensive to index. Plan for hour-long first runs on a non-trivial corpus and real memory pressure. They shine on multi-hop questions and large heterogeneous corpora.
  • HyDE and Multi-Query are the cheapest to test and often produce surprising wins on technical documentation.

Stage 4 — Run the search

Three practical rules save most teams from themselves here:

  1. Cap your spend. Set a budget before the sweep starts. A 72-run grid at 100 questions is easy to eyeball; at 500 questions and 10 axes it is a surprise bill.
  2. Log everything. Every LLM call, every embedding call, every token count and every estimated cost. This is not optional and you do not want to retrofit it. opentelemetry plus OTLP export means it fits into whatever stack you already run.
  3. Parallelize, don’t batch-queue. Configurations are independent. Run them in parallel across providers and watch tail latency, not average.

Stage 5 — Read the improvement report

You now have 72 rows of RAGAS scores. Don’t eyeball the table. Two moves:

First, sort by the primary metric that matches your failure mode. If your users complain about wrong answers, sort by faithfulness. If they complain about irrelevant answers, sort by answer relevancy. The “best overall configuration” is a fiction until you decide what good looks like.

Second, read the top three configurations side by side and look for what they have in common. If the top three all use recursive chunking at 512 tokens with hybrid retrieval, the lesson is about those settings — not the advanced strategies they each happen to use. This “what the winners share” reading is the single highest-signal output of any grid-search and the reason improvement reports exist.

Don’t ship what you hope works.
Ship what you measured.

Stage 6 — Monitor in production

The configuration you pick will drift. Documents change, users ask new kinds of questions, your embedding model silently updates. You need three things running continuously:

  • A canary dataset. Twenty to fifty representative questions that run nightly against production with the full metric suite. Alert on absolute score drops greater than 0.05.
  • OpenTelemetry tracing. Every production request gets a span. This is the difference between “something is wrong” and “retrieval latency on multi-hop queries doubled last Tuesday.”
  • Token and cost dashboards. A run that quietly tripled its context-window usage will bite you at month-end. Catch it in week one.

If you are on a platform that gives you embedding visualization — Arize Phoenix is the obvious example — add drift dashboards on top. If not, a weekly cosine-similarity check against last month’s query distribution gets you most of the way.

Three pitfalls nobody warns you about

Pitfall 1: judge-model bias. RAGAS and G-Eval run an LLM as a judge. That judge has preferences — length, tone, citation style. If you use the same model to generate answers and to judge them, you will see inflated scores. Use a different judge family than the generator wherever possible.

Pitfall 2: dataset freshness. If your test set was generated six months ago on an older version of the corpus, it is measuring something different now. Version your datasets and regenerate at least quarterly.

Pitfall 3: the “winning” configuration is a local optimum. Grid-search finds the best combination within the grid you defined. Run a second, narrower sweep around the winner with finer-grained axes before you ship.

A minimal shipping checklist

  • 50-200 Q&A pairs, reviewed and locked.
  • RAGAS core four metrics logged, plus one domain-specific rule.
  • 4 × 3 × 3 × 2 grid, roughly 72 runs.
  • Spend cap set and logged.
  • Top-three winners inspected for common traits.
  • Canary dataset deployed to production observability.
  • OTel spans streaming into your existing stack.

If you can tick those seven boxes, you are measuring something real. If you cannot, you are still shipping vibes.


Yoke Agent was built to execute this workflow end-to-end without a notebook in sight. You can see how it runs stages 1 through 5 in the RAG workbench, and stage 6 is the default code path — OTel on, cost tracking on, canary support in the dataset editor. Clone the repo and the first sweep is about ten minutes away.