Why notebooks fail for RAG evaluation (and what to do instead)

Notebooks are amazing for the first two weeks of a project. They collapse iteration time to zero, they are the right-sized tool for exploratory data work, and every serious data scientist ships some of their best thinking in cells. The problem is that RAG evaluation is not exploratory data work after week two. It is a discipline that benefits from reproducibility, review, and versioning — three things a notebook actively resists.

This is not a rant against notebooks. It is a checklist for when yours has silently stopped being the right tool.

Failure 1: silent state

The canonical notebook bug: you ran cells out of order, cell 12 depends on a variable redefined in cell 7, and the result at the bottom of the notebook is not what would happen if you ran it top-to-bottom. For a one-off experiment this is an annoyance. For evaluation this is catastrophic — it means the score you quoted in Slack does not reproduce.

The test is simple: Kernel → Restart & Run All. If your notebook does not produce the same scores, you have been making decisions based on fiction. This happens more often than people admit.

Failure 2: no real reproducibility

Even if your notebook runs clean top-to-bottom today, the question is whether it will run clean in six months. The chain of dependencies is: Python version, library versions, LLM provider behaviour, embedding model version, corpus state, test dataset version. A notebook pins approximately zero of those.

In practice this shows up when someone asks “can you reproduce the 0.87 faithfulness number from Q3?” and the answer is “I’d have to re-tune the environment.” That answer is a failure.

Failure 3: hidden cost

Evaluation runs an LLM for every metric on every test case. A notebook does not log tokens or cost by default — it just prints a score. You find out what the sweep cost when the provider bill arrives, which is when the sweep is un-cancellable.

The notebook-author’s fix is usually to instrument manually: a wrapper around the API client, a CSV of token counts, a lunchtime spent reconciling with the provider dashboard. Every notebook does this independently, nobody reuses the wrapper, it drifts, cost tracking is simultaneously present and unreliable.

Failure 4: no review loop

When was the last time your team code-reviewed a notebook? Most teams don’t, because notebook diffs are user-hostile — base64 outputs, metadata noise, cell ordering changes. Which means eval changes land without scrutiny. New metric thresholds, new judge models, new chunking parameters slip in because nobody reviewed the PR that contained them.

Evaluation is the place you need more review, not less. It is the rubric by which you decide to ship. A workflow that skips review on the rubric is shipping by implication.

Failure 5: single-author knowledge

Every team has one person who knows how the eval notebook works. When they are on vacation, nothing runs. When they change jobs, a quarter’s worth of institutional knowledge walks out with them. This is not a notebook problem per se — it is a documentation problem that notebooks make worse, because they conflate code, commentary and configuration into one artifact that only makes sense to the author.

What “instead” looks like

The good news: you can migrate to a reproducible evaluation workflow without giving up the fast iteration of the notebook. The pieces are:

Version the dataset, not the code

Your Q&A test set is the actual artifact. It should live in the repo with a version, a review trail, and an approve gate. Code that uses the dataset can still be exploratory; the dataset itself cannot.

Use a grid definition, not a loop

Grid definitions should be YAML or JSON, checked into version control, and diff-able. What embeddings, what chunking, what retriever, what metrics — declarative, not imperative. Loops hidden in cells evade review.

grid:
  chunking: [fixed_256, fixed_512, recursive_512_overlap_64]
  embedding: [text-embedding-3-small, text-embedding-3-large]
  retriever: [dense, hybrid]
  strategy: [none, hyde]

metrics:
  core: [faithfulness, answer_relevancy, context_precision, context_recall]
  custom: [citation_coverage]

budget_usd: 100

Instrument cost from day one

Token and USD cost is logged for every call, automatically. Not a nice-to-have. The sweep refuses to run if the estimated cost exceeds your budget. This single rule prevents the “surprise bill at end of month” failure mode.

Produce an improvement report, not just scores

The output of a sweep should be a document, not a table of numbers. Top-three winners, common traits, explicit recommendation with rationale. A report is something a product manager can read and agree with; a table of floats is not.

Lock the production config

Once you pick a winner, the configuration lives in a config file, not in a notebook cell. Deployment reads from the file; regression tests run against it. The notebook can experiment with alternatives, but changing production requires changing the file, which requires a PR, which requires review.

When to actually migrate

Migrate when two of these are true:

More than one person on the team needs to run the eval.
You have run the same sweep twice with different results.
The notebook kernel has crashed on a long-running sweep, losing state mid-run.
Someone has asked “what did we change between Q3 and Q4” and the answer is not in git.
The eval is now blocking deploys.

If zero of those is true, your notebook is fine. Keep iterating. If two or more are true, the notebook has outlived its usefulness and the marginal hour you spend debugging it would be better spent on the migration.

Yoke Agent is built around these exact principles — versioned datasets, grid YAML, cost-capped sweeps, improvement reports, locked production configs. If you are currently staring at a notebook that has started to feel heavy, docker compose up is a good next step.