Benchmarking chunking strategies on a real corpus

Chunking is the knob everyone tries first and understands least. Recommendations range from “512 tokens, no overlap, ship it” to multi-hour blog posts arguing over semantic segmentation. The only way to know what works on your corpus is to measure. So we did, against a realistic technical corpus, with the four strategies that actually differ in practice.

This post is the numeric result plus the qualitative lessons. The dataset, config, and raw runs are in a companion GitHub repository you can clone and reproduce.

The setup

Corpus. 500 technical Markdown documents totalling ~3.2M tokens. Mixed formats: API references, conceptual guides, changelog entries, troubleshooting posts. Designed to resemble a realistic product documentation set.
Evaluation dataset. 100 Q&A pairs, LLM-generated against the corpus and pruned by two humans. 30 single-hop, 40 multi-hop, 30 requiring reasoning across sections.
Metrics. RAGAS core four — faithfulness, answer relevancy, context precision, context recall — reported as arithmetic means.
Judge model. gpt-4.1 on OpenAI flex tier, different from the generator to avoid same-model bias.
Generator. claude-sonnet-4.5 with temperature 0.2. Same generator across every cell in the grid.
Retriever. Dense only, top-k = 5. Held fixed to isolate the chunking effect.

The four chunking strategies

Fixed 256. Hard split every 256 tokens, no overlap. The baseline.
Fixed 512. Same idea, larger window. The community default.
Semantic. Split on semantic boundaries using embedding-based sentence clustering. Variable chunk size in the 200-600 token range.
Recursive with overlap. LangChain’s recursive splitter at 512 tokens with 64-token overlap, respecting Markdown structure (headings preserved).

Three embedding models

text-embedding-3-small (OpenAI, 1536 dim, cheap)
text-embedding-3-large (OpenAI, 3072 dim, premium)
bge-large-en-v1.5 (local, 1024 dim)

That is 4 × 3 = 12 configurations per question, 1,200 retrievals per sweep, ~6,000 judge calls across all four RAGAS metrics. Total cost for the sweep on flex-tier pricing: $47.

Results

Reported as average across all four metrics. Higher is better. See the repo for per-metric breakdowns.

Chunking	embedding-3-small	embedding-3-large	bge-large
Fixed 256	0.71	0.74	0.72
Fixed 512	0.78	0.82	0.79
Semantic	0.79	0.83	0.80
Recursive-512-overlap-64	0.82	0.86	0.83

Three findings that surprised us

Finding 1: chunking choice matters more than embedding choice

Across the grid, the chunking-strategy delta is 0.10 (0.71 worst to 0.82 best on small embeddings). The embedding-model delta, holding chunking fixed, is only 0.04 (0.82 to 0.86 on the winning chunker). For this corpus, you get more mileage from spending an afternoon on chunking than from upgrading your embedding model.

That flips the typical tuning instinct. Most teams upgrade embeddings first because it’s a one-line change. Fix chunking before you touch embeddings.

Finding 2: semantic chunking is not magic

Semantic chunking scored 0.79-0.83 depending on embedding — better than the fixed-256 and fixed-512 baselines, but worse than recursive-with-overlap. The theoretical argument for semantic chunking (“splits naturally where meaning changes”) is real, but for a Markdown-structured corpus, structural chunking that respects headings carries more signal.

Semantic chunking also costs you another embedding pass on the ingestion side. For this corpus that added ~$18 to the indexing bill. Not prohibitive, but not free.

Finding 3: overlap is the cheapest win

The delta between fixed-512 (0.82 on large embeddings) and recursive-512-overlap-64 (0.86) is 0.04 — same magnitude as upgrading from small to large embeddings. A 64-token overlap is basically free at ingestion time and materially improves context recall on multi-hop questions because the retriever can grab adjacent chunks that share a claim.

Specifically, context recall on the 40 multi-hop questions went from 0.73 (fixed 512, no overlap) to 0.81 (recursive-512-overlap-64). That is the biggest single movement we saw in the sweep.

Per-metric breakdown on the winner

The full score card for recursive-512-overlap-64 with text-embedding-3-large:

Metric	Mean	Single-hop	Multi-hop	Reasoning
Faithfulness	0.88	0.91	0.87	0.86
Answer relevancy	0.90	0.94	0.89	0.86
Context precision	0.82	0.89	0.81	0.75
Context recall	0.85	0.93	0.81	0.78

As expected, reasoning-heavy questions score lowest across every metric. The fix is almost never chunking — it is usually an advanced retrieval strategy (HyDE or a multi-query variant). That will be the next benchmark in this series.

Reproducing this

The full artifact is at github.com/Empreiteiro/yoke-agent — the corpus references, grid definition YAML, raw JSONL runs and a notebook that reproduces the table above. Typical reproduction on a laptop with flex-tier keys takes about 40 minutes and costs under $50.

git clone https://github.com/Empreiteiro/yoke-agent.git
cd yoke-agent
make dev
yoke benchmark chunking-2026-q1 --provider openai --flex

What we would change next

Run the same grid on a code-heavy corpus (source files, inline comments). We expect chunking effects to invert — code is highly structured and overlap can introduce spurious matches.
Add a reranker axis. Intuition says reranking compensates for weaker chunking more than better chunking compensates for no reranker. We’ll find out.
Bring in HyDE and Multi-Query as strategy axes on the winning chunking config, to measure the reasoning-question gap explicitly.

If you have a corpus you want benchmarked with this methodology, open an issue on the repo — the pipeline runs end-to-end from a simple YAML, so it ports cleanly.