The stack, indexed
Every capability, with a receipt.
23 capabilities across 4 capability areas. Each one links to its proof — a reproducer command, a benchmark artifact, the GitHub issue where it shipped, or the docs page that runs it. If a claim here ever drifts from the code, that is a bug: file it.
The failures are indexed too — the retraction, the failed determinacy gate, the benchmark bug our own tooling caught — on the track record.
Core evaluation
Scores you can act on, with the uncertainty attached.44 evaluators across 7 categories
Deterministic checks, LLM-judge metrics, RAG, agent-trace, conversation, consistency, compliance, and multimodal — one API across all of them.
Calibrated judge thresholds
LLM-judge thresholds are fit per judge model against human-labeled data, and the calibration provenance ships inside the wheel — a judge swap warns instead of silently drifting.
Wilson + bootstrap CIs, power warnings
Every pass rate carries a Wilson 95% CI, every average score a bootstrap CI, and the framework refuses to call a winner when intervals overlap. Underpowered runs warn before spending, not after the wrong release.
Experiment tracking & paired A/B comparison
exp.compare runs paired McNemar tests with Benjamini–Hochberg correction across evaluators, reports Cohen's h, and tells you how many more cases you need at 80% power.
Local report browser (view --dir)
0.15.0Point view at a folder of runs for a sortable index, open any report rendered verbatim, or diff two — pass-rate and McNemar deltas with the regressed cases stacking both runs' judge reasons so you read why a verdict flipped. Read-only, local, no new dependencies.
Bootstrap: cold-start a suite from your product
A product description plus sample traces becomes a runnable EvalSuite — evaluators selected for your shape, thresholds calibrated from your traces, adversarial seed cases. ~$0.12 on default settings, $0 with an Ollama judge.
Scaled, gated case generation
0.12.0--n-seed-cases scales to 500 behind structural, duplicate, and hardness gates, and the report names every reject: “generated 500, accepted 431 — dropped 38 duplicates, 12 malformed.”
Input-quality gate — vet input before you spend
0.14.0A free, deterministic preflight checks trace count, field completeness, near-duplicate ratio, and PII density before generation runs. It warns honestly (never silently blocks, never a vanity score) when inputs can't support a trustworthy eval set.
Generation toolkit: five ways to make eval data
0.13.0Deterministic mutations (invariant/flip expectations) and template grids cost nothing; judge-verified contrast pairs, span-grounded doc-QA with refusal-bait unanswerables, and simulate-transcript export round it out. Every generator stamps provenance and reports its rejects.
Persona simulator — adaptive multi-turn eval
0.12.0A persona LLM with a goal and behavior traits converses live with your model, adapting each turn; transcripts are scored by the conversation evaluators. Every output is labeled “simulated — not real traffic,” under a hard budget ceiling.
Compliance presets
EU-AI-Act-shaped high-risk suites (jurisdiction-aware), PII/toxicity/bias guardrails, and procurement-grade audit packs with hash-chained NDJSON.
Agent-trace & multimodal evaluation
Tool-call accuracy and necessity, plan quality, task completion over agent traces; VQA faithfulness and document grounding over images and PDFs, with local rasterisation for Ollama VLMs.
Drift & provenance
Code changes; evals rot. This layer notices.Prompt-drift staleness detection
0.10.0A committed prompt_baseline.json diffs against a live scan of every prompt call site: CHANGED, REMOVED (with the three-way caveat), ADDED, UNKNOWN — and --fail-on turns selected categories into a CI gate.
Honest-unknown by construction
0.10.1–0.11.1Every report opens with a determinacy headline (your repo's exact static-resolvability ratio — measured at 20.9% across five real repos, below the 50% gate we set and published), and files the scanner can't parse surface as UNSCANNABLE, never as false REMOVED.
Runtime prompt recorder
0.11.0pytest --record-prompts captures rendered prompt fingerprints at the SDK boundary during eval runs. Static, runtime, and template prompts keep separate trust tiers: the scan proves text, recordings prove only the k-of-N renderings observed, and templates stay deferred.
Attribution scan & structured prompt diff
AST-level scan of every prompt call site with NFC-normalized fingerprints (scanner v4); attribution diff renders a PR-ready markdown of exactly which prompts changed between two refs.
Document AI — PDF Hell
Adversarial PDFs with code-based ground truth.17 trap families, byte-identical reproducibility
Every case is generated from code, so the answer key is exact — no LLM judging an LLM — and the same seed produces the same PDF, byte for byte.
Cross-modality testing (--pixels)
0.6.0The same suite, PDF vs locally-rasterised pixels — which separated model vision from provider ingestion and nearly inverted the leaderboard. Per-model deltas published.
Autoresearch trap discovery, six validation gates
Frontier models propose new traps; six gates (parseable, glyph-clean, deterministic, answerable, forbidden-clean, lint-clean) plus fresh-seed replication decide what survives. Full audit trail — every candidate, every cent — in the repo.
CI & agent surfaces
The same engine, wherever the work happens.eval-action — statistical PR gate
v1.2.0Runs your suite on every PR and posts a comment with per-evaluator Wilson CIs, McNemar p-values, dollar cost, and an opinionated gate verdict. Baselines from a git ref or a committed baseline_report.json.
multivon-mcp — 22 tools for AI agents
The evaluator surface as MCP tools, so coding agents can score outputs, ingest traces, compare runs, and generate cases mid-session.
Claude Code skills
/eval-bootstrap, /eval-audit, and /eval-explain ship inside the wheel — one install-skills command and an agent can cold-start, pre-flight, and explain eval suites.
Local-first, keyless by default
The demo runs with no API key (deterministic tier, or it finds your Ollama and picks an installed model). Judges run against Ollama, LM Studio, vLLM, or any OpenAI-compatible server. No account, no telemetry.
Sixty seconds to the first number, no API key required:
pip install multivon-eval && python -m multivon_eval demo