Grid-search for RAG:
an old technique,
retrofitted for a new problem.
Grid search is a traditional, exhaustive technique used in machine learning for hyperparameter tuning. It works beautifully on scalars like learning rate or tree depth — and falls apart the moment your “hyperparameters” are entire pipeline components. Here is how Yoke Agent rebuilt it for RAG.
Open any classical machine-learning textbook and grid-search
shows up early. The recipe is uncomplicated: enumerate every
combination of a few scalar hyperparameters, fit a model on
each combination, score them on a held-out set, pick the
winner. \(scikit\)-learn ships GridSearchCV; XGBoost
ships its own. The technique is sixty years old in spirit and
ten lines of Python in practice.
That recipe also breaks the moment you point it at a
retrieval-augmented generation pipeline. The
“hyperparameters” aren’t scalars —
they’re entire components: a chunking strategy, an
embedding model, a retriever, an optional reranker, an
optional advanced retrieval pattern. Each “trial”
isn’t a .fit() call — it’s a
full ingestion + retrieval + generation pipeline that may
cost real dollars in API calls. And the score isn’t a
deterministic loss — it’s an LLM-as-judge metric
with measurable variance across runs.
Building Yoke Agent meant taking the grid-search idea seriously while accepting that none of the classical assumptions hold. This post is the design document.
What classical grid-search assumes
Three implicit assumptions show up everywhere in
GridSearchCV-style code:
-
Hyperparameters are cheap to vary. Setting
max_depth=5vsmax_depth=10is a keyword argument. You don’t rebuild your dataset; you just call.fit()again. - The cost of a trial is roughly constant. Every fit costs about the same wall-clock and the same compute. Your total budget is trials × per-trial cost, and the per-trial cost is predictable.
- The score is a deterministic, ground-truth loss. Cross-validated MSE, log-loss, AUC. Same model on same fold gives same score. Variance comes from the data split, not from the metric itself.
All three assumptions are false for RAG. Each one requires the search machinery to do something new.
What changes in RAG
Hyperparameters become components
Switching chunking from fixed-256 to recursive-512-overlap-64
isn’t a constructor argument — it’s a
different ingestion pass that produces a different vector
store. Switching from OpenAI text-embedding-3-small
to bge-large is a different model with a
different dimension and different network round-trips. An
advanced strategy like CRAG or HyDE introduces extra LLM
calls per query. None of this fits in a kwargs dict.
The Yoke design treats each axis as a first-class component. A grid is declared in YAML-ish form:
grid:
chunking: [fixed_256, fixed_512, recursive_512_overlap_64, semantic]
embedding: [text-embedding-3-small, text-embedding-3-large, bge-large]
retriever: [dense, hybrid, dense_with_rerank]
strategy: [none, hyde]
dataset: docs-eval-v3
metrics: [faithfulness, answer_relevancy, context_precision, context_recall]
budget_usd: 80
That declaration expands to 4 × 3 × 3 × 2 = 72 configurations. The runner knows how to materialise each one without you writing the if/else tree. Adding a new axis (say, four reranker choices) makes it 288 configurations without touching code.
Trials cost real money
A 72-configuration grid against a 100-question dataset with the four RAGAS core metrics is roughly 7,200 retrievals plus 7,200 judge calls. On OpenAI flex-tier pricing that is $30-$80; on standard tier it can hit $400. Classical grid-search ignores cost because the marginal cost of a trial is your laptop’s electricity. We can’t.
So Yoke estimates cost before the sweep starts.
Token count per axis combination is approximated from the
dataset and the configured providers, multiplied by their
published rates, totaled. The runner refuses to start if
the estimate exceeds budget_usd. During the
run, every LLM and embedding call is logged with token
counts and USD cost; the dashboard shows the burn-down in
real time. Flex-tier discount is auto-detected from
provider responses, so the bookkeeping stays honest.
The score isn’t deterministic
RAGAS metrics, like G-Eval rubrics for agents, run an LLM as a judge. Same input twice can produce slightly different scores. This is fine for relative rankings — the noise is small relative to the signal between configurations — but lethal for absolute claims. A faithfulness of 0.87 from one judge is not directly comparable to 0.87 from another, and even the same judge will swing 0.02-0.04 across re-runs.
The Yoke approach: trust relative rankings, distrust
absolutes. The leaderboard sorts configurations by score,
not by score thresholds. The improvement report calls out
what the top three configurations have in common,
not which one is “the best” in some absolute
sense. Where deterministic scoring is possible — tool-call
accuracy parsed from TOOL_CALL invocations,
entity recall via NER — Yoke uses it as a stable
anchor that doesn’t drift across judge changes.
What Yoke adds that classical grid-search doesn’t
Improvement reports, not just leaderboards
GridSearchCV.best_params_ hands you a dict.
That is sufficient when the parameters are scalars and the
relationship between them is simple. For RAG, the
interesting question is rarely “which exact
combination won” — it is “what do the
winners share, and what does that imply for the next
sweep?” If the top five all use recursive chunking
with 64-token overlap regardless of embedding model, the
lesson is about chunking, not about any individual cell.
Yoke generates that reading automatically. The improvement report is a plain-language document: top configurations ranked, common traits highlighted, next-grid suggestion at the bottom. It is the artifact a product manager can read and agree with, where a 72-row table of floats is not.
Human-in-the-loop at every gate
Classical grid-search is a black box: throw the call, get a number. RAG evaluation is too consequential and too judge-bound to run that way. Yoke inserts approval gates at three places:
- The dataset generated from your corpus is editable and must be approved (locked) before sweeps can use it.
- The grid definition can be drafted by an LLM that proposes axes from your corpus shape, but a human approves it before any LLM call goes out.
- The improvement report is a recommendation, not a deploy command. The winning configuration only ships when a human writes it to the production config.
The philosophy is “LLM as a fast junior analyst,” not “LLM as an unchecked oracle.” Each gate costs roughly a minute and prevents the kind of silent drift that turns measurement into theatre.
Observability in the default code path
GridSearchCV doesn’t emit OpenTelemetry
spans because nobody asked for them on a laptop. Yoke does
because RAG sweeps are server work that needs to fit in
the observability stack you already operate. Every retrieval,
every embedding call, every judge call gets a span; OTLP
export pipes them straight to Grafana, Honeycomb or
Datadog. The same telemetry that proves the sweep ran also
tells you which configurations were slow.
A worked grid, end to end
A typical first sweep for a new RAG project at Yoke looks like this:
grid:
chunking: [fixed_256, fixed_512, recursive_512_overlap_64]
embedding: [text-embedding-3-small, text-embedding-3-large]
retriever: [dense, hybrid]
strategy: [none, hyde]
dataset: docs-eval-v1 # 80 LLM-generated Q&A, human-reviewed
metrics: [faithfulness, answer_relevancy, context_precision, context_recall]
judge: gpt-4.1-mini # different family from the generator
budget_usd: 50
Three by two by two by two = 24 configurations. With 80 questions and the RAGAS core, that is 1,920 retrievals and 1,920 judge calls. On flex-tier pricing the estimate comes in at roughly $35; the runner accepts. After about 25 minutes the leaderboard is ready. The improvement report says something like:
The top three configurations all use
recursive_512_overlap_64chunking. Embedding choice contributed less than 0.03 to the average score. Hybrid retrieval wins by 0.04 over dense alone. HyDE helps multi-hop questions (+0.06) but makes single-hop answers slightly worse (-0.02).Recommended configuration to ship: recursive_512_overlap_64 + text-embedding-3-large + hybrid + HyDE. Suggested next sweep: vary top-k around the chosen retriever to confirm.
That is the recommendation; a human writes it into the production config and the canary dataset starts running it nightly with the same metrics.
Three pitfalls the technique inherits
The winner is a local optimum
Grid-search finds the best cell inside the grid you defined. If you didn’t test top-k = 10, you don’t know whether the winner would be even better with more context. Yoke softens this by suggesting a narrower follow-up sweep around the winner, but the problem is fundamental to the technique. Random search and Bayesian optimisation handle high-dimensional spaces better; for the small grids most teams actually run, neither pays off.
Same-model judging inflates scores
If your generator is GPT-4 class and your judge is also GPT-4 class, expect inflated faithfulness scores — the judge tends to over-reward outputs that look like outputs it would produce. Yoke warns when judge and generator share a model family. Cross-family judging is the safe default.
The grid grows fast
Adding one axis multiplies the cost. 4 × 3 × 3 × 2 = 72 trials. Add four rerankers and four advanced strategies and it becomes 4 × 3 × 3 × 4 × 4 = 576 trials. Cost-cap your sweeps; do narrow exploration with small datasets first; expand only after the report points to a region worth zooming in on.
Bottom line
Grid-search is the right mental model for picking a RAG configuration. The classical implementation isn’t. What Yoke Agent ships is the same technique with the assumptions updated for the actual job: components instead of scalars, budgeted instead of free, judge-aware instead of deterministic, and reportable instead of just rankable.
If you want to see the workflow against your own corpus, clone Yoke Agent and the first sweep is about ten minutes away. The end-to-end evaluation guide walks through every stage; the chunking benchmark post shows what a real grid-search output looks like with numbers and findings.