Self-hosted LLM evaluation: a 2026 guide

Until recently, self-hosting an LLM evaluation platform was a hair-shirt exercise. The open-source options were libraries, not products; the hosted SaaS had all the good UIs. That gap closed in 2025. There are now three to five self-hostable evaluation stacks that give you dashboards, experiment tracking and production observability without sending a byte of your data to someone else’s cloud.

This guide is about when to pick self-hosted, what the non-negotiable criteria are, how the landscape compares, and what a migration from SaaS actually looks like.

Why self-hosted in 2026

Self-hosting is not a lifestyle choice. It is the right answer when at least one of these is true:

Your documents cannot leave your infrastructure. Legal, medical, financial, defence — regulated corpora. SaaS eval platforms upload your retrieved context and LLM responses. That is a non-starter for anyone with a BAA, SOC 2 customer commitment, or sovereignty requirement.
Your prompts are IP. For many teams the system prompt is the product. Shipping it to a third-party observability service is the same mistake as putting your source code there.
Your transcripts contain PII by design. Agent logs for a customer support product are full of customer data you cannot anonymise without destroying signal.
You already have observability. If you run Grafana, Datadog or Honeycomb today and can pipe OTel spans into them, a second dashboard with the same data elsewhere is duplicated cost with worse ergonomics.
Cost predictability. SaaS evaluation charges per span or per evaluation. At scale that becomes unpredictable and correlates with usage growth exactly when you don’t want it to.

If none of those is true, self-hosted is probably the wrong choice. The hosted SaaS options (LangSmith, Confident AI, Maxim AI, Braintrust) are genuinely more turnkey for teams without data-residency pressure.

Seven criteria to demand

When evaluating a self-hosted eval stack in 2026, these are the questions worth asking. If the tool fails on any of these, it is a library dressed up as a product.

1. Docker Compose up-and-running in one command

Non-negotiable. The tool should ship with a composable deployment that brings up backend, frontend, worker, and vector store in under ten minutes. If setup requires a Kubernetes operator and a service mesh, you will never actually install it.

2. Multi-provider support, keys in the backend

Your evaluation needs to call some LLM — OpenAI, Anthropic, a local model via Ollama, whatever. API keys live in the backend database, never in the browser, and the tool supports at least five providers. Single-vendor lock-in here is bizarre; you want the option to evaluate against multiple judges.

3. OpenTelemetry, not bespoke tracing

Every run should emit OTel spans exportable over OTLP. This is how it integrates with whatever observability stack you already run. If the tool has its own proprietary trace format, you are paying the integration cost twice.

4. Cost tracking as the default path

Token and USD cost logged per call, not as a feature you enable. Evaluation sweeps are expensive; you need to know before the month closes.

5. Multi-tenancy with real auth

Workspace isolation, first-class authentication, and the ability to run multiple teams on one install. Otherwise you end up running a copy per team.

6. Extensible metrics via code, not UI

Domain-specific quality rules matter more than any built-in. The tool should let you register a custom metric as a Python function (or equivalent) and have it show up in the leaderboard alongside the built-ins. Locking metrics behind a UI builder is a long-term mistake.

7. No data egress by default

When you run a sweep, which bytes leave your infrastructure? Only the LLM calls to the providers you explicitly configured. Not telemetry to the vendor. Not “anonymised usage data”. Not error reports. If the tool phones home, it fails the self-hosted test.

The open-source landscape in 2026

Briefly, the four serious contenders worth self-hosting:

Yoke Agent. Full studio — grid-search for RAG, agent simulation, 14 G-Eval rubrics, deterministic tool-call scoring, cost tracking, OTel. Apache 2.0. Self-hosted only — no SaaS option, by design.
Arize Phoenix. Strong on observability and embedding visualization. OpenTelemetry-native. Ships with RAG evals and experiment tracking. Complementary to Yoke if your pain is production drift.
DeepEval (open-source core). Pytest-style library, Apache 2.0, strong metric breadth. Self-hosted in the sense that you import and run it locally; Confident AI is the hosted dashboard layer (which you do not have to use).
MLflow GenAI. If you already run MLflow for classical ML, its GenAI evaluation surface is reasonable. Includes a ConversationSimulator for agent testing. Weaker on RAG-specific metrics than the others.

Full matrix, with the trade-offs and decision tree, on the Alternatives page.

Migrating off a SaaS platform

If you are moving from a hosted platform (LangSmith, Braintrust, Confident AI) to self-hosted, the practical migration path is:

Week 1: stand it up, in parallel

Install the self-hosted tool next to your existing SaaS. Do not cut anything over. Pick one project, re-run its last evaluation on the new stack, and compare scores. This surfaces metric-implementation differences early — they exist.

Week 2: export your history

Export datasets, test cases and historical run data from the SaaS. Most serious SaaS tools have an export API; the worst just have a CSV download. Either way, get it out and into the new tool so you can do year-over-year comparisons.

Week 3: re-wire CI

Your regression-test pipeline probably imports a SaaS client. Swap the import to the self-hosted equivalent. This is usually the smallest change and the most psychological — once CI runs against self-hosted, the migration is effectively irreversible.

Week 4: cancel the contract, keep the dashboards

You will feel the loss of one or two specific views (inevitably the view you didn’t realise you depended on). Rebuild those in the self-hosted tool’s dashboards or as a small custom view over the OTel data. Budget two days for this; it always takes more than expected and less than feared.

The honest downsides

Self-hosted has real costs:

Upgrades are on you. New releases land weekly. If you don’t have someone pinning versions and running upgrade cycles, drift compounds fast.
No SLA. If the tool breaks on a Friday night, you fix it. There is no support ticket queue.
Feature lag. SaaS products ship features that self-hosted open-source catches up to a few months later. You are perpetually 3-6 months behind the bleeding edge of UX.
SSO / enterprise identity. Usually requires the paid tier (or not available at all on open-source-only projects). Budget for it.

Decision recap

Self-hosted wins when data residency is real, your eval spend is large, and you already run observability. SaaS wins when you need SSO, SLAs and fast time-to-value for a small team. There is no single right answer — there is only your answer, which depends on constraints you know and we don’t.

If you have decided self-hosted is the right direction and want a studio that covers RAG grid-search, agent simulation and production observability in one install, Yoke Agent is purpose-built for it. One make dev and the full stack is running.