Mistake-proof your AI.
Ship only the configuration that won.

Yoke Agent turns the guesswork of is this prompt, chunking, retriever, or agent any good? into a reproducible grid-search workflow — with RAGAS scores, G-Eval rubrics and an improvement report at the end.

Providers
9

OpenAI, Anthropic, Gemini, Cohere, Azure, Ollama, HF, Claude Code, custom.

RAG strategies
7

HyDE, CRAG, Self-RAG, RAPTOR, GraphRAG, Multi-Query, Agentic.

Agent metrics
14

G-Eval rubrics + deterministic TOOL_CALL accuracy parsing.

Why the name

Poka-Yoke, for AI systems.

Shigeo Shingo coined poka-yoke (ポカヨケ) at Toyota in the 1960s: a physical jig on the assembly line that stops a defective part from leaving. Yoke Agent is that jig, applied to AI — it makes it impossible to ship a worse configuration than the one you measured.

The promise

Don’t ship what you hope works. Ship what you measured.

01

Scientific, not anecdotal

Every claim about a configuration is backed by a reproducible grid-search and a numeric score. You stop arguing about which prompt feels better and start comparing 0.87 against 0.82.

02

Two disciplines, one studio

RAG tuning and agent evaluation normally live in separate tools and separate notebooks. Yoke Agent unifies them — shared providers, shared cost tracking, shared report format.

03

Human-in-the-loop where it matters

LLMs propose — datasets, grids, reports — and humans approve. The philosophy is “LLM as a fast junior analyst”, not “LLM as an unchecked oracle”.

Workbenches

One studio.
Two grid-search workbenches.

Each workbench is a guided pipeline: ingest → generate tests → grid-search → score → improvement report. No blank canvas, no notebook sprawl.

RAG workbench

Tune your retrieval
until the numbers agree.

  1. 01

    Ingest

    PDF, DOCX, TXT, Markdown via LangChain loaders. Five vector stores: Chroma, Pinecone, Qdrant, pgvector, Astra DB.

  2. 02

    Generate & review dataset

    LLM-written Q&A pairs from your docs. Edit, add, remove, reorder. Or import a curated JSON set.

  3. 03

    Grid-search

    Chunking, embeddings, retrievers, rerankers. Advanced strategies wired in as first-class axes — not plugins.

    HyDE CRAG Self-RAG RAPTOR GraphRAG Multi-Query Agentic
  4. 04

    Score with RAGAS + custom hooks

    Faithfulness, answer relevancy, context precision/recall. Optional correctness, similarity, entity recall, noise sensitivity. Plus your own metrics via a hook.

  5. 05

    Read the improvement report

    Plain-language recommendation on which configuration to ship and why.

Agent workbench

Stress-test agents
with simulated users.

  1. 01

    Define your agent

    System prompt, tools (native + MCP servers), guardrails. fastmcp on both sides — consume or expose.

  2. 02

    Pick personas & scenarios

    Synthetic users with distinct goals and styles. Four scenario types, each with its own rubric.

    tool_use decision_path response_quality guardrail
  3. 03

    Simulate in parallel

    Every grid combination runs against every persona. Full transcripts and tool invocations captured.

  4. 04

    Score with G-Eval + deterministic overrides

    14 rubric metrics from tool-call accuracy to refusal accuracy and hallucination. Tool use is parsed from TOOL_CALL invocations, not guessed from prose.

  5. 05

    Drill into the leaderboard

    All-metrics ranking, collapsible improvement report, per-transcript drill-down with tool traces.

Platform features

Governance is the default code path.

Cost tracking, observability and multi-tenancy aren’t bolted on — they’re how runs are instrumented by default.

Multi-provider by default

OpenAI, Anthropic, Google Gemini, Cohere, Azure, Ollama, HuggingFace, Claude Code CLI, and any OpenAI-compatible endpoint (LM Studio, vLLM). Keys live in the backend DB, never the browser.

Flex tier, detected automatically

Opt-in ~50% discount on OpenAI o3/o4-mini/gpt-4.1 and Gemini 2.5.x/3.x. Flex detection is logged per-call — no manual bookkeeping.

Token & cost tracking

Every LLM and embedding call is logged with estimated USD cost. Sweeps that look cheap up front stop being surprises at the end of the month.

OpenTelemetry & OTLP

Production-grade observability out of the box. Pipe runs into Grafana, Honeycomb or Datadog without writing glue.

Multi-tenancy & auth

Workspace isolation with first-class authentication. One installation, multiple teams, clean boundaries.

Failure categorization

Fine-grained breakdown of why a run failed — not just that it did. Patterns surface across configurations.

Self-hosted, no data egress

Your documents, questions and transcripts stay on your infrastructure. Only LLM calls leave, to providers you explicitly configure.

REST + CLI + MCP

Fully documented REST API (/docs, /redoc), yoke-agent/cli for scripted workflows, MCP both consumed and exposed.

Docker Compose, one command

make dev or docker compose up and you have the full stack: FastAPI backend, Next.js dashboard, workers and Chroma ready to go.

Why teams pick it

Ten reasons it beats a Jupyter notebook.

Most “evaluation tools” stop at chunk-size and top-k. Yoke Agent goes a lot further — and keeps the receipts.

  1. 01

    Scientific, not anecdotal

    Every configuration claim is backed by a reproducible grid search and a numeric score. 0.87 vs 0.82, not vibes.

  2. 02

    RAG and agents, one dashboard

    Shared providers, shared cost tracking, shared improvement-report format. Two disciplines that finally share a home.

  3. 03

    Provider-agnostic by design

    Nine providers, including fully-local (Ollama, LM Studio) and CLI (Claude Code) — no API key required to start.

  4. 04

    Built-in cost governance

    Token + cost logging is the default path. Flex-tier detection auto-logs the discount when providers return it.

  5. 05

    State-of-the-art retrieval

    HyDE, CRAG, Self-RAG, RAPTOR, GraphRAG, Multi-Query, Agentic — first-class grid axes, not add-ons.

  6. 06

    Human-in-the-loop by default

    LLMs propose, humans approve. Datasets, grids and reports all go through a review step before anything locks.

  7. 07

    Objective tool-use scoring

    Tool calls parsed deterministically from TOOL_CALL invocations. Accuracy never depends on a judge’s reading of prose.

  8. 08

    Self-hosted, zero egress

    Your docs, questions and transcripts never leave your infra. Only LLM calls go out — to endpoints you explicitly configure.

  9. 09

    Production observability

    OpenTelemetry + OTLP means runs fit into the stack you already operate. Grafana, Honeycomb, Datadog — no glue needed.

  10. 10

    Open source & extensible

    MIT-licensed. Pydantic schemas, clean FastAPI routers, pluggable retrievers, custom-metrics hook. Meant to be read and forked.

Honest trade-offs

A product page without trade-offs is marketing, not a product page.

These are the real constraints. Pin them up before you plan your rollout.

Self-hosted only

No managed SaaS, no hosted SSO tier, no hands-off upgrades. You run Docker Compose or make dev yourself.

Judge-bound quality

RAGAS and G-Eval rely on an LLM-as-judge. Relative rankings across a grid are trustworthy; absolute scores aren’t ground truth.

Grid cost scales fast

4×3×4×3 = 144 runs; with 20 questions that’s 2,880 LLM calls before the judge. No built-in spend cap — scope your sweeps.

Vector-store breadth = setup

Five options is great until you pick one. Chroma is the sensible default; Pinecone/Qdrant/pgvector/Astra you operate yourself.

GraphRAG & RAPTOR are heavy

Powerful, but computationally expensive on large corpora. Expect long first-run indexing and real memory pressure.

Personas are synthetic

LLM-driven simulated users surface many real failure modes — but they don’t replace actual users, and under-represent long-tail oddness.

Young, fast-moving surface

Features land quickly (multi-tenancy, OTel, custom metrics, failure categorization). Pin versions and read the notes before upgrading.

At a glance

The whole stack on one page.

Category AI Quality Studio — RAG + Agent evaluation
License Open source · see LICENSE
Deployment Self-hosted · Docker Compose or make dev
Backend Python 3.11+ · FastAPI · SQLAlchemy · Pydantic v2
Frontend Next.js 14 · React 18 · TypeScript · Tailwind
Storage SQLite default · Postgres via DATABASE_URL
Providers OpenAI · Anthropic · Gemini · Cohere · Azure · Ollama · HF · Claude Code CLI · custom OpenAI-compatible
Vector stores ChromaDB · Pinecone · Qdrant · pgvector · Astra DB
RAG eval RAGAS — 4 fixed + 4 optional metrics + custom-metrics hook
Agent eval G-Eval — 14 rubric metrics + deterministic tool-call parsing
Observability OpenTelemetry · OTLP export
Multi-tenancy Yes · with auth
Get started

Stop vibe-checking prompts.
Start measuring agents.

Clone the repo, run make dev, and you’ll be grid-searching your first RAG pipeline in about ten minutes.

Terminal
$ git clone https://github.com/yoke-agent/yoke-agent.git
$ cd yoke-agent
$ make dev
# → dashboard at http://localhost:3000
# → API docs at http://localhost:4040/docs