view --dir — local report browser
ShippedPoint multivon-eval view --dir at a folder of report JSONs and get a sortable index of every run, any report rendered verbatim, and a diff of two: pass-rate and McNemar deltas with the regressed cases stacking both runs' judge reasons so you read why a verdict flipped. Read-only, local, no new dependencies — it reuses the same stdlib server and renderer view already had. Single-file view <report.json> is unchanged.
What you’d see
Run multivon-eval view --dir runs/ and browse, open, and diff a directory of reports in your browser, fully offline. Shipped in 0.15.0 (2026-06-16).
Input-quality gate — assess inputs before you spend
Shippedassess_input and multivon-eval assess run a free, deterministic preflight over your traces or source docs across four signals — trace count, per-field completeness, near-duplicate ratio, and PII/secret density — reusing machinery the rest of the framework already trusts, so there are no new dependencies and no LLM call. There's deliberately no 0-100 score; it warns rather than blocks, so a WARN can't break a CI, and it runs as a preflight inside bootstrap and generate before the first paid call. --skip-input-gate turns it off but still leaves one line on stderr.
What you’d see
multivon-eval assess traces.jsonl flags thin, duplicative, or PII-heavy inputs before any generation spend. Shipped in 0.14.0 (2026-06-16).
Generation toolkit — five ways to make eval data
Shippedmutate_cases applies deterministic robustness mutations (typo/whitespace/case noise, unicode confusables, punctuation strip, conservative negation flip) with invariant/flip expectations; cases_from_template expands parametric grids over named axes, full product or greedy pairwise; generate_contrast_pairs writes judge-verified unfaithful twins; span-grounded doc-QA records each case's source offsets and can mix in refusal-bait unanswerables; simulate --export-cases turns persona transcripts into conversation cases. Mutations and template grids cost nothing. Every generator stamps provenance, routes through the dedupe gates, and reports its rejects. The generate CLI gained --mutate, --template/--axes/--sample, and --contrast/--no-verify.
What you’d see
multivon-eval generate --mutate / --template / --contrast produce robustness, coverage, and contrast cases from what you already have. Shipped in 0.13.0 (2026-06-13) with 73 new tests.
Persona simulator — multivon-eval simulate
ShippedPersona-driven adaptive multi-turn eval: a simulator LLM plays a configured user persona against your bot, adapting each turn to the previous reply, under a hard budget ceiling. Every transcript carries provenance and is labeled simulated, not real traffic, so the results stay separate from production metrics.
What you’d see
multivon-eval simulate runs a persona suite against your endpoint and emits scored multi-turn transcripts with provenance. Shipped in 0.12.0 (2026-06-12).
Scaled + gated case generation
Shippedbootstrap --n-seed-cases scales seed-case generation to 500 behind duplicate and hardness gates, with a generation report that states exactly what survived: "generated 500, accepted 431." Rejected candidates are counted and named, not silently dropped.
What you’d see
bootstrap --n-seed-cases 500 emits the gated suite plus the acceptance report. Shipped in 0.12.0 (2026-06-12).
Scanner v4 + UNSCANNABLE tier
ShippedHardening on the staleness scanner: call sites the scanner cannot read now surface as an explicit UNSCANNABLE tier instead of disappearing from the report.
What you’d see
Staleness reports name what the scanner couldn't read instead of silently omitting it. Shipped in 0.11.1.
Prompt-drift staleness detection
Shippedmultivon-eval staleness diffs a live scan of every prompt call site against a committed prompt_baseline.json: CHANGED (with before/after fingerprints, bound cases, and a git diff pointer), REMOVED (always with the three-way caveat: feature removed / renamed+edited / moved beyond static reach), ADDED (new prompts with no covering cases), UNKNOWN (dynamic prompts — never guessed at). staleness baseline writes the snapshot; staleness stamp binds hand-written cases to the prompt call sites they exercise; bootstrap stamps generated cases and writes the baseline automatically. Matching is content-first — line numbers and git SHAs are display-only, so a whitespace refactor or rebase produces zero false staleness. Every report opens with a determinacy headline and closes with a blind-spots footer. Deliberately deferred: propose-and-review case refresh (sync) and the eval-action CI gate — tracked below on the epic.
What you’d see
Run multivon-eval staleness . in a bootstrapped repo and see exactly which prompts changed since your cases were authored. --fail-on changed,removed turns it into a CI gate; --format markdown drops into $GITHUB_STEP_SUMMARY. Shipped in 0.10.0 with 51 new tests.
Scanner v3 + the determinacy gate — run, failed, published
ShippedBefore claiming drift coverage, we measured how much real-world prompt traffic static analysis can actually read. Scanner v2's first pass reported zero call sites on 4 of 5 real repos — blind to aliased litellm imports (pr-agent), **kwargs-unpacked calls (aider), and messages=<variable>. v3 detects all three; what it still can't read surfaces as honest UNKNOWN records instead of vanishing. Re-measured: 278 call sites across aider, gpt-researcher, open-interpreter, letta, and pr-agent — 20.9% statically resolvable, below the 50% gate. The gate failed, the result is published with the per-repo table on the epic, and the runtime recorder was promoted to priority.
What you’d see
Every staleness report opens with the determinacy headline — your repo's exact static-resolvability ratio. Baselines written by v2 print a rescan warning instead of fake drift. Shipped in 0.10.1.
install-skills CLI
ShippedOne-command installer for the three bundled Claude Code skills. Symlinks eval-bootstrap, eval-audit, and eval-explain from the wheel into ~/.claude/skills/, so pip install -U multivon-eval propagates SKILL.md edits without re-running install.
What you’d see
Run multivon-eval install-skills once. The three skills become callable in any Claude Code session as /eval-bootstrap, /eval-audit, /eval-explain. Shipped in 0.9.8.
Cross-distribution held-out F1 (Benchmark 4)
ShippedHallucination evaluator calibrated on HaluEval-QA (threshold 0.55, explicit JudgeConfig), tested without re-tuning on HaluEval-Sum. Calibration set strictly disjoint from test set. Result: F1 0.830 [0.70–0.92] on n=60.
What you’d see
/eval shows F1 0.830 [0.70–0.92] with the calibration provenance disclosed. Reproducer is benchmarks/run_truly_held_out.py. Shipped in 0.9.5 → 0.9.7.
Wilson + bootstrap CIs on every published number
Shippedbenchmarks/_add_cis.py walks every results JSON and writes Wilson CIs on precision/recall plus bootstrap CIs (1000 resamples, seed 20260603) on F1. Idempotent. Closes the "framework preaches CIs but doesn't ship them on its own numbers" dogfood violation.
What you’d see
Every F1 on the leaderboard, /eval tile, and benchmarks/README carries its 95% bootstrap CI. Shipped in 0.9.4.
Calibration provenance — zero null F1
Shipped18 calibration entries across 6 judges × 3 evaluators in _calibration_data/v2.json. Six previously-null F1 cells (opus, gpt-4o, gpt-5.5 across faithfulness/hallucination/relevance) filled via a $15–20 sweep on real held-out data.
What you’d see
No more silent 0.7 fallback. calibrated_threshold(evaluator, judge) returns a calibrated value with a recorded F1 for every shipped (judge × evaluator) pair. Shipped in 0.9.4.
Three Claude Code skills bundled in the wheel
Shippedeval-bootstrap (cold-start eval generator), eval-audit (suite review), and eval-explain (judge-output interpreter). SKILL.md files ship inside the wheel under multivon_eval/_skills/. No separate marketplace install needed.
What you’d see
After pip install multivon-eval + multivon-eval install-skills, three new slash commands work in Claude Code immediately. Shipped in 0.9.4.
Self-correction audit trail (0.9.4 → 0.9.7)
ShippedFour same-day PyPI releases responding to public peer review: contamination fix on the headline held-out claim, runtime bug in the generated bootstrap template, threshold-vs-default mismatch in the held-out reproducer. All four releases left published as the audit trail.
What you’d see
pypi.org/project/multivon-eval/ shows the release sequence. CHANGELOG documents what each release fixed and which reviewer flagged it.
Phase 1 prompt attribution — descriptive diff
ShippedAST-aware scan of prompt sources in a repo, structured diff between two refs, markdown rendering. Public API: multivon_eval.attribution.scan(repo_root), diff_records(base, head), render_markdown(diffs). Descriptive only — causal attribution intentionally deferred to Phase 2. Scanner v2 (0.10.0) added one-hop module-level constant resolution and loose fingerprints; this scan is now the substrate the staleness drift report runs on.
What you’d see
multivon-eval attribution scan <repo> and multivon-eval attribution diff <base> <head> work today. JSON, text, and markdown output. Commit b43b98c on multivon-eval main.