Essays from the jig.
Field notes on evaluating retrieval-augmented generation, stress-testing agents, and making “which configuration should we ship?” a question with a numeric answer. RSS.
Grid-search for RAG: an old technique, retrofitted for a new problem
Grid search is a 60-year-old hyperparameter-tuning technique. Applying it to RAG required rethinking what a hyperparameter even is — here is how Yoke Agent does it.
How to evaluate a RAG pipeline end-to-end in 2026
A pillar guide to the six stages of real-world RAG evaluation: datasets, RAGAS metrics, grid-search axes, improvement reports and production monitoring.
DeepEval vs Yoke Agent: honest comparison
Where DeepEval wins, where Yoke Agent wins, and why most serious teams end up using both together.
The 14 agent evaluation metrics Yoke ships (and why)
Every G-Eval rubric metric Yoke Agent implements, with definitions, formulas and when to reach for each one.
Benchmarking chunking strategies on a real corpus
Grid-searching four chunking strategies against a 500-document technical corpus. The numbers you need to pick one, plus the findings that surprised us.
Self-hosted LLM evaluation: a 2026 guide
Why self-hosted evaluation matters in 2026, what to demand from the tooling, and how to migrate off a SaaS platform without losing your history.
Why notebooks fail for RAG evaluation (and what to do instead)
Five failure modes of notebook-driven RAG evaluation, and a practical migration path to reproducible grid-search.