Grid-search for RAG: an old technique, retrofitted for a new problem

Open any classical machine-learning textbook and grid-search shows up early. The recipe is uncomplicated: enumerate every combination of a few scalar hyperparameters, fit a model on each combination, score them on a held-out set, pick the winner. $scikit$-learn ships GridSearchCV; XGBoost ships its own. The technique is sixty years old in spirit and ten lines of Python in practice.

That recipe also breaks the moment you point it at a retrieval-augmented generation pipeline. The “hyperparameters” aren’t scalars — they’re entire components: a chunking strategy, an embedding model, a retriever, an optional reranker, an optional advanced retrieval pattern. Each “trial” isn’t a .fit() call — it’s a full ingestion + retrieval + generation pipeline that may cost real dollars in API calls. And the score isn’t a deterministic loss — it’s an LLM-as-judge metric with measurable variance across runs.

Building Yoke Agent meant taking the grid-search idea seriously while accepting that none of the classical assumptions hold. This post is the design document.

What classical grid-search assumes

Three implicit assumptions show up everywhere in GridSearchCV-style code:

Hyperparameters are cheap to vary. Setting max_depth=5 vs max_depth=10 is a keyword argument. You don’t rebuild your dataset; you just call .fit() again.
The cost of a trial is roughly constant. Every fit costs about the same wall-clock and the same compute. Your total budget is trials × per-trial cost, and the per-trial cost is predictable.
The score is a deterministic, ground-truth loss. Cross-validated MSE, log-loss, AUC. Same model on same fold gives same score. Variance comes from the data split, not from the metric itself.

All three assumptions are false for RAG. Each one requires the search machinery to do something new.

What changes in RAG

Hyperparameters become components

Switching chunking from fixed-256 to recursive-512-overlap-64 isn’t a constructor argument — it’s a different ingestion pass that produces a different vector store. Switching from OpenAI text-embedding-3-small to bge-large is a different model with a different dimension and different network round-trips. An advanced strategy like CRAG or HyDE introduces extra LLM calls per query. None of this fits in a kwargs dict.

The Yoke design treats each axis as a first-class component. A grid is declared in YAML-ish form:

grid:
  chunking:   [fixed_256, fixed_512, recursive_512_overlap_64, semantic]
  embedding:  [text-embedding-3-small, text-embedding-3-large, bge-large]
  retriever:  [dense, hybrid, dense_with_rerank]
  strategy:   [none, hyde]

dataset: docs-eval-v3
metrics:  [faithfulness, answer_relevancy, context_precision, context_recall]
budget_usd: 80

That declaration expands to 4 × 3 × 3 × 2 = 72 configurations. The runner knows how to materialise each one without you writing the if/else tree. Adding a new axis (say, four reranker choices) makes it 288 configurations without touching code.

Trials cost real money

A 72-configuration grid against a 100-question dataset with the four RAGAS core metrics is roughly 7,200 retrievals plus 7,200 judge calls. On OpenAI flex-tier pricing that is $30-$80; on standard tier it can hit $400. Classical grid-search ignores cost because the marginal cost of a trial is your laptop’s electricity. We can’t.

So Yoke estimates cost before the sweep starts. Token count per axis combination is approximated from the dataset and the configured providers, multiplied by their published rates, totaled. The runner refuses to start if the estimate exceeds budget_usd. During the run, every LLM and embedding call is logged with token counts and USD cost; the dashboard shows the burn-down in real time. Flex-tier discount is auto-detected from provider responses, so the bookkeeping stays honest.

The score isn’t deterministic

RAGAS metrics, like G-Eval rubrics for agents, run an LLM as a judge. Same input twice can produce slightly different scores. This is fine for relative rankings — the noise is small relative to the signal between configurations — but lethal for absolute claims. A faithfulness of 0.87 from one judge is not directly comparable to 0.87 from another, and even the same judge will swing 0.02-0.04 across re-runs.

The Yoke approach: trust relative rankings, distrust absolutes. The leaderboard sorts configurations by score, not by score thresholds. The improvement report calls out what the top three configurations have in common, not which one is “the best” in some absolute sense. Where deterministic scoring is possible — tool-call accuracy parsed from TOOL_CALL invocations, entity recall via NER — Yoke uses it as a stable anchor that doesn’t drift across judge changes.

What Yoke adds that classical grid-search doesn’t

Improvement reports, not just leaderboards

GridSearchCV.best_params_ hands you a dict. That is sufficient when the parameters are scalars and the relationship between them is simple. For RAG, the interesting question is rarely “which exact combination won” — it is “what do the winners share, and what does that imply for the next sweep?” If the top five all use recursive chunking with 64-token overlap regardless of embedding model, the lesson is about chunking, not about any individual cell.

Yoke generates that reading automatically. The improvement report is a plain-language document: top configurations ranked, common traits highlighted, next-grid suggestion at the bottom. It is the artifact a product manager can read and agree with, where a 72-row table of floats is not.

Human-in-the-loop at every gate

Classical grid-search is a black box: throw the call, get a number. RAG evaluation is too consequential and too judge-bound to run that way. Yoke inserts approval gates at three places:

The dataset generated from your corpus is editable and must be approved (locked) before sweeps can use it.
The grid definition can be drafted by an LLM that proposes axes from your corpus shape, but a human approves it before any LLM call goes out.
The improvement report is a recommendation, not a deploy command. The winning configuration only ships when a human writes it to the production config.

The philosophy is “LLM as a fast junior analyst,” not “LLM as an unchecked oracle.” Each gate costs roughly a minute and prevents the kind of silent drift that turns measurement into theatre.

Observability in the default code path

GridSearchCV doesn’t emit OpenTelemetry spans because nobody asked for them on a laptop. Yoke does because RAG sweeps are server work that needs to fit in the observability stack you already operate. Every retrieval, every embedding call, every judge call gets a span; OTLP export pipes them straight to Grafana, Honeycomb or Datadog. The same telemetry that proves the sweep ran also tells you which configurations were slow.

A worked grid, end to end

A typical first sweep for a new RAG project at Yoke looks like this:

grid:
  chunking:   [fixed_256, fixed_512, recursive_512_overlap_64]
  embedding:  [text-embedding-3-small, text-embedding-3-large]
  retriever:  [dense, hybrid]
  strategy:   [none, hyde]

dataset: docs-eval-v1   # 80 LLM-generated Q&A, human-reviewed
metrics:  [faithfulness, answer_relevancy, context_precision, context_recall]
judge:    gpt-4.1-mini  # different family from the generator
budget_usd: 50

Three by two by two by two = 24 configurations. With 80 questions and the RAGAS core, that is 1,920 retrievals and 1,920 judge calls. On flex-tier pricing the estimate comes in at roughly $35; the runner accepts. After about 25 minutes the leaderboard is ready. The improvement report says something like:

The top three configurations all use recursive_512_overlap_64 chunking. Embedding choice contributed less than 0.03 to the average score. Hybrid retrieval wins by 0.04 over dense alone. HyDE helps multi-hop questions (+0.06) but makes single-hop answers slightly worse (-0.02).

Recommended configuration to ship: recursive_512_overlap_64 + text-embedding-3-large + hybrid + HyDE. Suggested next sweep: vary top-k around the chosen retriever to confirm.

That is the recommendation; a human writes it into the production config and the canary dataset starts running it nightly with the same metrics.

Three pitfalls the technique inherits

The winner is a local optimum

Grid-search finds the best cell inside the grid you defined. If you didn’t test top-k = 10, you don’t know whether the winner would be even better with more context. Yoke softens this by suggesting a narrower follow-up sweep around the winner, but the problem is fundamental to the technique. Random search and Bayesian optimisation handle high-dimensional spaces better; for the small grids most teams actually run, neither pays off.

Same-model judging inflates scores

If your generator is GPT-4 class and your judge is also GPT-4 class, expect inflated faithfulness scores — the judge tends to over-reward outputs that look like outputs it would produce. Yoke warns when judge and generator share a model family. Cross-family judging is the safe default.

The grid grows fast

Adding one axis multiplies the cost. 4 × 3 × 3 × 2 = 72 trials. Add four rerankers and four advanced strategies and it becomes 4 × 3 × 3 × 4 × 4 = 576 trials. Cost-cap your sweeps; do narrow exploration with small datasets first; expand only after the report points to a region worth zooming in on.

Bottom line

Grid-search is the right mental model for picking a RAG configuration. The classical implementation isn’t. What Yoke Agent ships is the same technique with the assumptions updated for the actual job: components instead of scalars, budgeted instead of free, judge-aware instead of deterministic, and reportable instead of just rankable.

If you want to see the workflow against your own corpus, clone Yoke Agent and the first sweep is about ten minutes away. The end-to-end evaluation guide walks through every stage; the chunking benchmark post shows what a real grid-search output looks like with numbers and findings.