Blog

Essays from the jig.

Field notes on evaluating retrieval-augmented generation, stress-testing agents, and making “which configuration should we ship?” a question with a numeric answer. RSS.

Methodology·28 Apr 2026

Grid-search for RAG: an old technique, retrofitted for a new problem

Grid search is a 60-year-old hyperparameter-tuning technique. Applying it to RAG required rethinking what a hyperparameter even is — here is how Yoke Agent does it.

Guide·21 Apr 2026

How to evaluate a RAG pipeline end-to-end in 2026

A pillar guide to the six stages of real-world RAG evaluation: datasets, RAGAS metrics, grid-search axes, improvement reports and production monitoring.

Comparison·14 Apr 2026

DeepEval vs Yoke Agent: honest comparison

Where DeepEval wins, where Yoke Agent wins, and why most serious teams end up using both together.

Deep dive·07 Apr 2026

The 14 agent evaluation metrics Yoke ships (and why)

Every G-Eval rubric metric Yoke Agent implements, with definitions, formulas and when to reach for each one.

Benchmark·31 Mar 2026

Benchmarking chunking strategies on a real corpus

Grid-searching four chunking strategies against a 500-document technical corpus. The numbers you need to pick one, plus the findings that surprised us.

Guide·24 Mar 2026

Self-hosted LLM evaluation: a 2026 guide

Why self-hosted evaluation matters in 2026, what to demand from the tooling, and how to migrate off a SaaS platform without losing your history.

Opinion·17 Mar 2026

Why notebooks fail for RAG evaluation (and what to do instead)

Five failure modes of notebook-driven RAG evaluation, and a practical migration path to reproducible grid-search.