The stack, indexed

Every capability, with a receipt.

23 capabilities across 4 capability areas. Each one links to its proof — a reproducer command, a benchmark artifact, the GitHub issue where it shipped, or the docs page that runs it. If a claim here ever drifts from the code, that is a bug: file it.

The failures are indexed too — the retraction, the failed determinacy gate, the benchmark bug our own tooling caught — on the track record.

Core evaluation

Scores you can act on, with the uncertainty attached.

44 evaluators across 7 categories

Deterministic checks, LLM-judge metrics, RAG, agent-trace, conversation, consistency, compliance, and multimodal — one API across all of them.

Evaluator docs →/eval →

Calibrated judge thresholds

LLM-judge thresholds are fit per judge model against human-labeled data, and the calibration provenance ships inside the wheel — a judge swap warns instead of silently drifting.

Calibration docs →Head-to-head benchmark →

Wilson + bootstrap CIs, power warnings

Every pass rate carries a Wilson 95% CI, every average score a bootstrap CI, and the framework refuses to call a winner when intervals overlap. Underpowered runs warn before spending, not after the wrong release.

Statistical rigor docs →Benchmark methodology →

Experiment tracking & paired A/B comparison

exp.compare runs paired McNemar tests with Benjamini–Hochberg correction across evaluators, reports Cohen's h, and tells you how many more cases you need at 80% power.

Experiments docs →

Local report browser (view --dir)

0.15.0

Point view at a folder of runs for a sortable index, open any report rendered verbatim, or diff two — pass-rate and McNemar deltas with the regressed cases stacking both runs' judge reasons so you read why a verdict flipped. Read-only, local, no new dependencies.

view --dir docs →Design + ship record (#15) →

Bootstrap: cold-start a suite from your product

A product description plus sample traces becomes a runnable EvalSuite — evaluators selected for your shape, thresholds calibrated from your traces, adversarial seed cases. ~$0.12 on default settings, $0 with an Ollama judge.

Bootstrap docs →

Scaled, gated case generation

0.12.0

--n-seed-cases scales to 500 behind structural, duplicate, and hardness gates, and the report names every reject: “generated 500, accepted 431 — dropped 38 duplicates, 12 malformed.”

Design + ship record (#11) →Bootstrap docs →

Input-quality gate — vet input before you spend

0.14.0

A free, deterministic preflight checks trace count, field completeness, near-duplicate ratio, and PII density before generation runs. It warns honestly (never silently blocks, never a vanity score) when inputs can't support a trustworthy eval set.

Design + ship record (#14) →Synthetic-data docs →

Generation toolkit: five ways to make eval data

0.13.0

Deterministic mutations (invariant/flip expectations) and template grids cost nothing; judge-verified contrast pairs, span-grounded doc-QA with refusal-bait unanswerables, and simulate-transcript export round it out. Every generator stamps provenance and reports its rejects.

Synthetic-data docs →Design + ship record (#13) →

Persona simulator — adaptive multi-turn eval

0.12.0

A persona LLM with a goal and behavior traits converses live with your model, adapting each turn; transcripts are scored by the conversation evaluators. Every output is labeled “simulated — not real traffic,” under a hard budget ceiling.

Simulate docs →Design + ship record (#10) →

Compliance presets

EU-AI-Act-shaped high-risk suites (jurisdiction-aware), PII/toxicity/bias guardrails, and procurement-grade audit packs with hash-chained NDJSON.

Compliance docs →Audit-pack format →

Agent-trace & multimodal evaluation

Tool-call accuracy and necessity, plan quality, task completion over agent traces; VQA faithfulness and document grounding over images and PDFs, with local rasterisation for Ollama VLMs.

Agent-trace docs →Multimodal evaluators →

Drift & provenance

Code changes; evals rot. This layer notices.

Prompt-drift staleness detection

0.10.0

A committed prompt_baseline.json diffs against a live scan of every prompt call site: CHANGED, REMOVED (with the three-way caveat), ADDED, UNKNOWN — and --fail-on turns selected categories into a CI gate.

Staleness docs →Epic + gate results (#4) →

Honest-unknown by construction

0.10.1–0.11.1

Every report opens with a determinacy headline (your repo's exact static-resolvability ratio — measured at 20.9% across five real repos, below the 50% gate we set and published), and files the scanner can't parse surface as UNSCANNABLE, never as false REMOVED.

Published gate failure (#4) →0.11.1 changelog →

Runtime prompt recorder

0.11.0

pytest --record-prompts captures rendered prompt fingerprints at the SDK boundary during eval runs. Static, runtime, and template prompts keep separate trust tiers: the scan proves text, recordings prove only the k-of-N renderings observed, and templates stay deferred.

Recorder docs →Design + ship record (#9) →

Attribution scan & structured prompt diff

AST-level scan of every prompt call site with NFC-normalized fingerprints (scanner v4); attribution diff renders a PR-ready markdown of exactly which prompts changed between two refs.

Attribution docs →

Document AI — PDF Hell

Adversarial PDFs with code-based ground truth.

17 trap families, byte-identical reproducibility

Every case is generated from code, so the answer key is exact — no LLM judging an LLM — and the same seed produces the same PDF, byte for byte.

/pdfhell →Repo + methodology →

Cross-modality testing (--pixels)

0.6.0

The same suite, PDF vs locally-rasterised pixels — which separated model vision from provider ingestion and nearly inverted the leaderboard. Per-model deltas published.

Leaderboard with deltas →Findings thread (pdfhell#1) →

Autoresearch trap discovery, six validation gates

Frontier models propose new traps; six gates (parseable, glyph-clean, deterministic, answerable, forbidden-clean, lint-clean) plus fresh-seed replication decide what survives. Full audit trail — every candidate, every cent — in the repo.

Methodology →/pdfhell →

CI & agent surfaces

The same engine, wherever the work happens.

eval-action — statistical PR gate

v1.2.0

Runs your suite on every PR and posts a comment with per-evaluator Wilson CIs, McNemar p-values, dollar cost, and an opinionated gate verdict. Baselines from a git ref or a committed baseline_report.json.

multivon-ai/eval-action →CI/CD docs →

multivon-mcp — 22 tools for AI agents

The evaluator surface as MCP tools, so coding agents can score outputs, ingest traces, compare runs, and generate cases mid-session.

multivon-ai/multivon-mcp →

Claude Code skills

/eval-bootstrap, /eval-audit, and /eval-explain ship inside the wheel — one install-skills command and an agent can cold-start, pre-flight, and explain eval suites.

Skills docs →

Local-first, keyless by default

The demo runs with no API key (deterministic tier, or it finds your Ollama and picks an installed model). Judges run against Ollama, LM Studio, vLLM, or any OpenAI-compatible server. No account, no telemetry.

Quickstart →

Sixty seconds to the first number, no API key required:

pip install multivon-eval && python -m multivon_eval demo