What's new across the stack
Recent ships
The view --dir report browser (0.15.0), the input-quality gate (0.14.0), the generation toolkit (0.13.0), and the staleness and recorder layers underneath. See the full roadmap →
0.15.0
view --dir — browse a folder of runs
multivon-eval view --dir opens a local report browser: a sortable index of every run in a directory, each report rendered in full, and a diff between any two that stacks both runs' judge reasons on the cases that flipped. Read-only, local-first, no new dependencies — it extends the existing view command rather than adding a service.
Read more →0.14.0
Input-quality gate — vet inputs before you spend
assess_input and multivon-eval assess run a free, deterministic preflight over your traces — trace count, field completeness, near-duplicate ratio, PII density — before any generation spend. There's no 0-100 score (the vanity metric the gate exists to prevent); it warns rather than blocks, so a WARN can't break a CI. Runs automatically inside bootstrap and generate.
Read more →0.13.0
Generation toolkit — five ways to make eval data
mutate_cases applies deterministic robustness mutations (free); cases_from_template expands parametric grids (free); generate_contrast_pairs writes judge-verified unfaithful twins; span-grounded doc-QA records source offsets and can mix in refusal-bait unanswerables; simulate --export-cases turns transcripts into cases. Every generator stamps provenance, routes through the dedupe gates, and reports its rejects.
Read more →0.12.0
Persona simulator — adaptive multi-turn eval
multivon-eval simulate drives your bot through persona-driven, adaptive multi-turn conversations: a simulator LLM plays the user, reacts to each reply, and stops at a hard budget ceiling. Every transcript carries provenance and is labeled simulated, not real traffic, so the results can't sneak into production metrics.
Read more →0.12.0
Scaled, gated case generation
bootstrap --n-seed-cases scales seed-case generation to 500 behind duplicate and hardness gates. The generation report states exactly what survived ("generated 500, accepted 431") and counts every reject, so a thin suite can't masquerade as a big one.
Read more →0.11.0
Runtime prompt recorder
Opt-in pytest --record-prompts captures rendered prompt fingerprints at call time, the way past the 20.9% static-resolvability ceiling. Static scans prove prompt text; recordings prove only the k-of-N renderings observed; templates stay out of scope. Case-to-site bindings are propose-only.
Read more →0.10.0
Evals drift as code changes — now you can see it
multivon-eval staleness diffs a live scan of your repo's prompt call sites against a committed prompt_baseline.json: CHANGED (with before/after fingerprints and the cases bound to that prompt), REMOVED, ADDED, and UNKNOWN for dynamic prompts static analysis can't read. staleness stamp binds cases to the prompts they exercise; bootstrap writes the baseline automatically. Every report opens with a determinacy headline and closes with a blind-spots footer. --fail-on gates CI.
Read more →bootstrap
Cold-start eval generator
multivon-eval bootstrap --product PRODUCT.md --traces TRACES.jsonl emits a tuned EvalSuite + adversarial seed cases + calibrated thresholds + a forwardable DISCOVERY_REPORT.md. Single LLM call, ~$0.12 per bootstrap. The fastest path from "I don't know what to eval" to a runnable suite — and 0.12.0's --n-seed-cases scales it to 500 gated cases.
Read more →intelligent-eval
The intelligent-eval layer
The 0.10-era substrate under bootstrap, in one card: auto_evaluators(case) ranks evaluators heuristically in microseconds (zero LLM cost); generate_adversarial_cases targets 10 failure modes with stress_test routing metadata; an N-shot judge-noise filter keeps only validated-hard cases (+0.80 mean failure-rate separation); a local PII/secret scan redacts before any trace leaves your machine; thresholds calibrate from your own traces (p25 of baseline scores).
install-skills
multivon-eval install-skills (Claude Code)
One-command installer for the three bundled Claude Code skills (eval-bootstrap, eval-audit, eval-explain). Symlinks SKILL.md files from the wheel into ~/.claude/skills/ so pip install -U multivon-eval propagates skill edits without re-running the installer.
Read more →research
pdfhell.research — autoresearch loop for trap discovery
An autoresearch loop (Karpathy pattern) where a rotation of Opus 4-7, GPT-5, and Gemini 2.5 Pro propose adversarial PDF traps against an 8-model eval panel. Six validation gates filter candidates before any vision-eval spend (glyph_clean added in 0.6.1 after the tofu-box bug). $88 total across two overnight runs produced 11 surviving trap families — now in mini-v4 (510 cases). The agent does not merge its own work; a human curator promotes from keep/.
Read more →mini-v4
mini-v4 leaderboard ships
The full mini-v4 suite is 510 cases across 17 trap families; the public leaderboard runs the 170-case mini-v4-sample (first 10 seeds/family). The sample surfaces three real per-trap blind spots: GPT-4o on hidden-OCR invoices (0/10), Anthropic's premium tier on a 3.5pt-footnote trap (0/10), and three models on text-run fragmentation (0/10, measured pre-0.6.1-redesign). The 2026-06-12 pixels runs traced all three to PDF ingestion, not vision.
Read more →