Latest stable: v0.15.1Roadmap →

Changelog

Recent multivon-eval releases. Full notes on GitHub.

v0.15.1·June 15, 2026

multivon-eval 0.15.1 — view --dir Python 3.10/3.11 fix

Fixed

view --dir's index renderer used f-strings with quotes/backslashes inside the {} expression — valid on Python 3.12+ (PEP 701) but a SyntaxError on 3.10/3.11, so import multivon_eval.dirview (and therefore the view command) broke on the lower half of the supported range. The package's own minimum is 3.10. Caught by a fresh-install verification pass: it slipped through local testing on 3.14 and turned the CI matrix red on 3.10/3.11. The nested markup is now a module constant, keeping the f-strings expression-only. The rest of the package was already clean on 3.11 (verified across all modules).

Full release notes →

v0.15.0·June 15, 2026

multivon-eval 0.15.0 — view --dir report browser

multivon-eval view --dir ([#15](https://github.com/multivon-ai/multivon-eval/issues/15)) — designed by a deliberation panel that rejected the framing ("studio") and landed on extending the existing view command instead. A local report browser for a folder of runs, read-only and stateless, on the same stdlib http.server harness view already uses — zero new dependencies, fully offline.

Added

INDEX (view --dir runs/) — a sortable table of every eval-report JSON in a directory: suite, model, when, n, pass rate with a Wilson CI bar, error/flaky badges, cost. A positive structural validator decides what's a report (requires the real {summary.pass_rate, cases[]} shape) — junk JSON collapses into one "k files skipped" line rather than being parsed as an empty report. error-rate >= 10% is flagged. --recursive opt-in (off by default).

OPEN (/r/<idx>) — the existing EvalReport.to_html() served verbatim with a breadcrumb; no renderer fork.

DIFF (/diff?a=&b=) — wraps report_a.compare(report_b): pass-rate / avg-score deltas, McNemar p with a significance label, and four buckets (Regressed open + colored, Fixed, Still failing, Unchanged). Regressed rows stack both runs' judge reasons (looked up by case_input) as prose, so you see exactly why a verdict flipped.

Single-file view <report.json> is unchanged; view <dir> and view --dir <dir> both enter directory mode.

Full release notes →

v0.14.0·June 15, 2026

multivon-eval 0.14.0 — input-quality gate

The input-quality gate ([#14](https://github.com/multivon-ai/multivon-eval/issues/14)) — designed by a multi-LLM deliberation panel (5 proposers, 3 critics, a synthesizer; the panel killed an over-engineered 12-signal version and every weak-NLP proxy). "Honest UNKNOWN over confident wrong" applied to the input side: make garbage-in loud and auditable before any generation spend, instead of silently producing a confident-looking suite from inputs that can't support one.

Added

assess_input(source, kind) + multivon-eval assess — a free, deterministic preflight with four signals, every one reusing already-trusted machinery (zero new dependencies): trace count, per-field completeness, near-duplicate ratio (token-Jaccard, reservoir-capped so big dumps can't hang), and PII/secret density. No scalar 0-100 score — the vanity metric the gate exists to prevent.

WARN-by-default, never a silent pass and never a silent block. PROCEED is silent (invisible when input is fine); WARN prints a determinacy headline whose denominator counts every defined signal (2 of 4 signals flagged), one line per flag, and a blind-spots footer naming what was not checked. There is no hard REFUSE in this version: a WARN can't break a CI. The standalone assess exits 1 on WARN so scripts can detect it; the inline preflight never changes bootstrap's exit code.

Runs as a preflight inside bootstrap (on the traces it already loaded, before the first paid call) and the generate CLI path. --skip-input-gate disables it but still prints one stderr line, so suppression is never truly silent.

The field-completeness signal surfaces a previously-silent bug: zero-output traces make calibration early-return uncalibrated thresholds — the gate now says so out loud.

Changed

The n<20 calibration threshold is now a single shared constant (discover.CALIBRATION_MIN_TRACES) imported by both the calibration warning and the gate, so the two can't drift.

Full release notes →

v0.13.0·June 12, 2026

multivon-eval 0.13.0 — generation toolkit

The generation toolkit ([#13](https://github.com/multivon-ai/multivon-eval/issues/13)): five ways to generate eval data, two of them free. Every generator stamps provenance, routes through the dedupe gates, and reports its rejects.

Added

mutate_cases — deterministic, zero-LLM robustness mutations (typo/whitespace/case noise, unicode confusables, punctuation strip, conservative negation flip). Each mutant records its transformation and expectation: invariant, or flip with the old label cleared rather than silently kept. Byte-deterministic per seed.

cases_from_template — parametric grids over named axes; full product (capped 2000) or greedy pairwise covering array. Rows without an expected output are valid for judge evaluators; no label is invented.

generate_contrast_pairs — a minimally-edited unfaithful twin per case, judge-verified to actually flip before acceptance; rejected twins counted. Twins share a pair_id for genuinely paired comparisons.

Span-grounded doc-QA — generate_from_text/from_file record each case's source span (offsets + chunk hash); unanswerable_fraction generates refusal-bait questions whose expected behavior is refusal. Unlocatable spans recorded as None and counted.

simulate --export-cases / results_to_cases — persona transcripts become conversation EvalCases; empty transcripts skipped and counted.

CLI: multivon-eval generate gains --mutate, --template/--axes/--sample, --contrast/--no-verify, --unanswerable-fraction, --per-case, --seed.

Notes

73 new tests. Mutation batches dedupe on exact input identity rather than the Jaccard gate (mutants are near-duplicates by construction — the gate would reject the suite it exists to create).

Full release notes →

v0.12.3·June 12, 2026

multivon-eval 0.12.3 — release-readiness campaign fixes

Fixed

phone_intl detects real-world spaced international numbers. The

regex allowed only one separator after the country code, so the docs' own

example ("+44 7911 123456") scored "No PII detected" — caught by the

release-readiness re-verification pass against the published 0.12.2 wheel.

Now: 6-14 digits after the country code, each optionally separated;

arithmetic/version strings still not flagged (test-pinned both ways).

_Also includes 0.12.1–0.12.2 — see CHANGELOG.md._

Full release notes →

v0.12.0·June 12, 2026

multivon-eval 0.12.0 — persona simulator + scaled gated generation

Two features adapted from the synthetic-eval-data space (issues #10/#11) — with the

part vendors leave out: validation, provenance, and labeled uncertainty.

Added

multivon-eval simulate — persona-driven adaptive multi-turn evaluation (#10).

Static scripts assume a fixed conversation path; the simulator drives one live: a

persona LLM (profile, goal, success criteria, behavior traits) converses with your

model_fn, adapting each turn, stopping on goal-reached / refusal / max_turns /

budget. Transcripts become conversation-shaped EvalCases scored by the existing

conversation evaluators plus a goal-completion judge. Personas come from a JSONL or

propose_personas() (one LLM call, always includes an adversarial persona).

Honesty contract, test-pinned: every output carries "simulated personas — measures

behavior under synthetic users, not real traffic"; hard budget_usd ceiling with

pre-spend estimate (personas cut off carry stop_reason="budget_exceeded", partials

never lost); judge model/temperature recorded, NO determinism claim. Recorder

synergy: each conversation binds its case_uid, so --record-prompts during

simulation yields observed case→site bindings — simulation with provenance.

Scaled + gated case generation (#11). bootstrap --n-seed-cases now works to

500 (batched ≤30/call, later batches steered away from already-accepted inputs),

and every generated case passes gates: well-formed (structural), duplicate

(NFC-loose-normalize identity OR token-Jaccard ≥ 0.85, cross-batch), and — behind

--validate-cases --baseline-model-file — the 0.8.0 hardness band via

auto.validate_adversarial_cases. No silent caps: BootstrapResult.generation_report

and a DISCOVERY_REPORT "Case generation" section print "generated N, accepted M —

dropped k duplicates, j malformed[, i outside hardness band]"; a skipped hardness

gate says so. Per-case metadata["generation"] carries batch, gates passed, and

hardness. New --budget-usd pre-spend ceiling (estimate checked before any LLM call).

Notes

54 new tests (25 simulator, 29 generation/gates); 1092 green on the tracked suite.

Both features verified live before release: a haiku-driven persona reached its goal

and scored 1.00/0.67/1.00 on the conversation evaluators; a 60-case bootstrap

produced the batch-accounting report end-to-end.

Full release notes →

v0.11.1·June 12, 2026

multivon-eval 0.11.1 — robustness hardening: honest UNKNOWN over confident wrong

Robustness hardening from an adversarial audit that ran the staleness /

provenance / scanner / bootstrap surface against malformed inputs, symlink

tricks, unicode edge cases, and concurrent writers. The theme: every failure

the audit found was a place where the tool either crashed with a raw

traceback or — worse — reported something *false*. Both violate the same

contract: honest UNKNOWN over confident wrong.

Fixed

A syntax-broken file no longer reads as REMOVED. The scanner silently

returned zero records for files it couldn't parse (syntax errors, non-UTF8

encodings), so staleness reported every baselined site in them as REMOVED —

and --fail-on removed failed CI with a misleading verdict. Unscannable

files now surface as a distinct UNSCANNABLE tier ("file exists but could

not be parsed — verdict unknown, NOT removed"), a warning line names each

file with its reason in all three renderers, JSON gains skipped_files,

and --fail-on removed no longer trips. Skipped files are a report-time

concept — never written into baselines.

Symlinks resolving outside the repo root are skipped, not recorded —

previously they wrote machine-specific absolute paths into the baseline,

producing false REMOVED+ADDED churn on every other checkout.

Fingerprints are NFC-normalized (SCANNER_VERSION 3 → 4) — composed

vs decomposed unicode ("é" as one codepoint vs e+combining-accent) is an

editor/OS artifact, not a prompt change; it previously fingerprinted as

drift. Old baselines print the standing "rescan recommended" warning.

match-statement capture patterns disqualify module constants —

case PROMPT: rebinds via a str field the scanner didn't see, letting a

rebound constant read as static (a false "static" poisons every verdict).

Clean errors instead of tracebacks: staleness stamp on malformed

JSONL (file:line in the message), staleness baseline on a nonexistent

path or missing --out dir, bootstrap on a malformed traces file, and

--site …#xx with a non-integer position — all exit 2 with actionable

messages. multivon-eval … | head no longer dumps a BrokenPipeError.

attribution scan /typo/path exits 2 instead of a green "No SDK

prompt call sites found" — a typo'd CI path looked permanently passing.

The documented 10K trace cap is now enforced with a loud truncation

warning, and a malformed *final* trace line (the normal shape of an

interrupted streamed dump) skips with a warning while malformed interior

lines stay a hard error.

Bootstrap artifacts are emitted atomically (temp dir + rename) — a

Ctrl-C mid-emission can no longer leave a half-written eval_suite.py

that looks complete. schema_version: true no longer passes the int

check (bool ⊂ int).

34 new tests across the touched surface; 1038 green.

Full release notes →

v0.11.0·June 12, 2026

multivon-eval 0.11.0 — runtime prompt recorder

The answer to the 20.9% ceiling. The determinacy gate (0.10.1) measured scanner v3 against five real repos: 20.9% of call sites are statically resolvable — the rest build prompts dynamically and are statically unbridgeable *by construction*. The runtime prompt recorder (designed in [#9](https://github.com/multivon-ai/multivon-eval/issues/9), promoted from the 0.10.0 deferred list by the gate result) is the honest path past it: during an eval run, an opt-in interceptor records the rendered prompt text per call site, fingerprinted with the same fingerprint_text the static scanner uses. A **kwargs unpack the scanner can only report as UNKNOWN is, at call time, real kwargs with real text.

The honesty discipline survives the new power — three labeled trust tiers, never collapsed:

1. static — the scan proves the prompt *text*;

2. runtime — recordings prove only the renderings *observed*, not all renderings (variable renderings per site are a fingerprint SET, and every verdict speaks in "current recordings matched k of N previously observed renderings" — a site is never called fresh because one rendering matched);

3. templates / external prompts — deferred, unverifiable.

Added

multivon_eval.recorder — opt-in runtime prompt recorder. record_prompts() context manager (non-pytest) or pytest --record-prompts (plus --record-prompts-out, --record-text). Method-level wrapping of exactly the three SDK surfaces the static scanner knows — anthropic Messages.create, openai chat.completions.create, litellm.completion/acompletion — save original, wrap, restore byte-identical on exit (inherited attributes restored by delattr, so __dict__s end exactly as found). Zero overhead when off: importing multivon_eval performs NO patching, pinned by a fresh-interpreter subprocess test. Recordings stay local in prompt_recordings.jsonl; fingerprints only by default, rendered text only behind explicit --record-text. Append-safe storage: duplicate (anchor, role, fingerprint) keys merge counts/case_uids on write, atomic rewrite.

Case binding by observation — a contextvar carries the active case_uid; EvalSuite binds it per case from _provenance.case_uid (one None-check when recording is off) and the pytest plugin binds the test nodeid per test. Recordings carry the case_uids observed per (anchor, role, fingerprint) — the run *knows* which sites fired for which case.

multivon-eval staleness baseline --merge-recordings [FILE] — merges recordings into prompt_baseline.json as source:"runtime" records with fingerprint SETS, stored under a separate runtime_records key. Merge-only: never rescans, NEVER touches static records; a static rescan never discards the runtime tier; re-merging the same recordings file is idempotent.

OBSERVED report tier — runtime-sourced sites render distinctly in text/json/markdown: compared recordings-vs-recordings (runtime-only sites *cannot* be compared against a static scan, and the report says so), always in the k/N language. The determinacy headline gains a third clause: "K sites observed at runtime." The standing footer now states all three trust tiers verbatim, next to the blind-spots list.

multivon-eval staleness stamp --from-recordings [FILE] — prints observed case→site bindings as proposals (case_uid → anchor + fingerprint with observation counts); writes only with explicit --apply --cases F.jsonl (targets land as source:"runtime", bound:"observed"). Observation removes the fabrication objection that blocked auto-binding in the 0.10.0 adversarial review — the human confirmation stays. Runtime-bound targets are verified against recordings, never against the static scan (reported unverifiable [runtime] there, by rule), and never enter the static coverage denominator.

Fixed

add_check QAG question generation no longer invents stricter sub-requirements. "Response should mention the return policy" generated questions about return *procedures* and *eligibility* the criterion never asked for, scoring a plainly-correct answer 0.33 FAIL (reproduced 3/3 trials). The generation prompt now requires every question be answerable "Yes" by any response satisfying the criterion as stated; the same answer now scores 1.0 (and the evasive control still fails). Found by a fresh-user E2E audit on the quickstart's own example.

Keyless demo picks an Ollama model you actually have — python -m multivon_eval used hardcoded llama3 and reported "detected but unreachable" when the *server* was fine and the *model* wasn't pulled. It now asks /api/tags for an installed model (text models preferred), honors DEMO_MODEL, and the failure message distinguishes "model not available" from "server unreachable".

Bootstrap creates the output dir before any paid LLM call — a typo'd or read-only --output previously failed *after* ~$0.12 of judge spend. Progress lines now print to stderr as eac

Full release notes →

v0.10.1·June 12, 2026

multivon-eval 0.10.1 — scanner v3 + the published 20.9% gate failure

Scanner v3 — the determinacy gate (spec test-plan #14) run against five real repos (aider, gpt-researcher, open-interpreter, letta, pr-agent) found that 4 of 5 reported zero call sites: not because they have no prompts, but because the scanner was silently blind to how real code calls LLMs. v3 fixes detection and honestly reports what it still cannot read.

Fixed

Aliased litellm imports detected — from litellm import acompletion then bare acompletion(...) (pr-agent's shape) now matches. Star imports and function-local imports stay out of scope.

kwargs-unpacked calls surface as UNKNOWN — litellm.completion(kwargs) (aider's shape) now emits an honest <dynamic:KwargsUnpack> record instead of vanishing.

messages=<variable> surfaces as UNKNOWN — the most common real-world shape (messages list built elsewhere) now emits one dynamic record per call site instead of nothing. A literal empty messages=[] correctly emits nothing (statically known empty).

SCANNER_VERSION bumped to 3; baselines written by v2 print a "rescan recommended" warning instead of fake drift.

Measured (the determinacy gate, public on the epic)

Honest detection changed the denominator: 73 → 278 sites across the five repos, and static resolvability is 20.9% — below the 50% gate. Conclusion recorded publicly: real-world prompt traffic is mostly dynamic construction; static analysis tracks call-site add/remove for all of it but can verify text drift only for prompts-as-constants codebases (letta-style: 58 static sites). The runtime recorder (epic) is now the priority path for the rest. The staleness report's determinacy headline makes this exact ratio visible per-repo — by design.

Full release notes →

v0.10.0·June 12, 2026

multivon-eval 0.10.0 — prompt-drift staleness + case provenance

Evals drift as code changes — this release ships the detection layer. Prompts evolve, eval suites go stale, and nobody notices until a regression sails through. 0.10.0 adds prompt-drift staleness detection: a committed baseline snapshot of every prompt call site in your repo, a read-only report that tells you exactly which prompts changed since your cases were authored, and an opt-in provenance layer binding cases to the prompts they exercise. The design went through a 3-design × 2-adversarial-critic review before a line was written; the design rule that survived every round: the tool never overclaims what static analysis can know. Every report opens with a determinacy headline ("N of M call sites statically resolvable") and closes with a standing blind-spots footer.

Added

multivon-eval staleness [PATH] — read-only drift report. Diffs a live attribution scan against the committed prompt_baseline.json: CHANGED (prompt text differs — with before/after fingerprints, bound cases, and a git diff pointer), REMOVED (always with the three-way caveat: feature removed / renamed+edited / moved beyond static reach), ADDED (new prompts with no covering cases), UNKNOWN (dynamic prompts — never guessed at). --format text|json|markdown, --fail-on changed,removed,added for CI (exit 0 report-only by default; markdown format drops straight into $GITHUB_STEP_SUMMARY).

multivon-eval staleness baseline [PATH] — writes/refreshes the baseline snapshot, printing the diff before writing. Bootstrap writes one automatically.

multivon-eval staleness stamp — binds hand-written JSONL cases to the prompt call sites they exercise (--site 'file.py::qualname.role'). Raw-line-preserving rewrite (never round-trips through load_jsonl, which would drop expected_tool_calls); idempotent restamps are byte-identical; refuses ambiguous sites instead of guessing.

multivon_eval.provenance — metadata["_provenance"] schema (case_uid, authored_at/stamped_at, git context, prompt-fingerprint targets) + a stamp() helper for Python-inline cases. Stamping never perturbs suite.lock (cases_hash excludes metadata by design) — pinned by a regression test.

Attribution scanner v2 — one-hop module-level constant resolution (SYSTEM_PROMPT = "..." then system=SYSTEM_PROMPT now resolves to real text instead of a dynamic placeholder; conditional/cross-module names honestly stay dynamic) + loose_fingerprint (whitespace-collapsed) so formatting-only prompt changes are labeled as such — flagged, never suppressed.

Bootstrap integration — --repo flag; generated cases are stamped authored_by="bootstrap" with the repo SHA (honest "authored against this state" provenance — bindings are never fabricated), and prompt_baseline.json is written alongside the suite.

Notes

Matching is content-first: line numbers and git SHAs are display-only, never matching inputs — a whitespace refactor or rebase produces zero false staleness.

Dynamic prompts gate FIRST: a prompt the scanner can't statically read is UNKNOWN forever rather than fake-fresh. The runtime recorder that closes this gap is tracked as future work.

51 new tests (staleness 27, provenance 24) + 26 extended attribution tests. 178 green across the touched surface; zero new failures elsewhere.

Full release notes →

v0.9.1·May 24, 2026

multivon-eval 0.9.1 — Anthropic reasoning-tier fix + vision module + ollama provider

Patch release driven by the pdfhell mini-v4 eval-pipeline post-mortem ([CORRECTION_NOTICE.md](https://github.com/multivon-ai/pdfhell/blob/main/pdfhell/research/CORRECTION_NOTICE.md)). The same Anthropic API change that silently broke every Opus 4-7 call in the pdfhell leaderboard would have broken any multivon-eval consumer using Opus 4-7 as a judge — fixing the underlying issue upstream closes the gap for everyone using these adapters.

Fixed

AnthropicAdapter omits temperature for reasoning-tier models. Anthropic's claude-opus-4-7 and the claude-opus-5+ family reject the parameter with a 400. The adapter detects them by name prefix and drops the field; older models still receive temperature unchanged. Same fix applied to multivon_eval.discover._call_judge. New helper AnthropicAdapter._supports_temperature() exposes the decision. 19 new unit tests cover the matrix.

Added

multivon_eval.vision module. Single call_vision(prompt, sources, judge, max_tokens) function. Providers: anthropic, openai, google, ollama. Per-provider content-block conversion (Anthropic document/image, OpenAI file/image_url, Google Part.from_bytes, Ollama images field). PDFs rasterise via pypdfium2 for ollama. Previously lived in pdfhell.vision; promoted so any multivon-eval consumer can grade images/PDFs without re-implementing per-provider plumbing.

ollama: is now a first-class JudgeConfig provider. JudgeConfig(provider="ollama", model="llama3.2") resolves to litellm's ollama/<model> driver internally. Matches the colon convention used by the rest of the SDK (anthropic:, openai:, google:). Both sync and async judge paths. OLLAMA_HOST env var sets the base URL (default http://127.0.0.1:11434).

Compatibility

No breaking changes. AnthropicAdapter constructor signature is unchanged — temperature is still accepted, just silently dropped for the reasoning tier. Existing pinned dependents (incl. pdfhell>=0.5.0) work without modification.

Tests: 19 new + 864 pre-existing passing. The two failing test_beginner_friendly.py tests are pre-existing model-name issues independent of this release.

Usage

from multivon_eval import JudgeConfig, call_vision

# Vision grading via any supported provider
judge = JudgeConfig(provider="anthropic", model="claude-opus-4-7")  # temperature omitted automatically
answer = call_vision("What is the total?", ["invoice.pdf"], judge)

# Local model judges via ollama
judge = JudgeConfig(provider="ollama", model="qwen2.5:7b")  # first-class, no provider="litellm" needed

Full changelog: [CHANGELOG.md](CHANGELOG.md)

Full release notes →

v0.9.0·May 22, 2026

v0.9.0 — field-report release

[0.9.0] — 2026-05-23

Field-report release. A fresh-user dogfood pass + a 5-app SDK deepdive surfaced a cluster of UX gaps and one parallel-execution regression. Most of the work is making the SDK behave the way its docs already promised.

Added

multivon-eval doctor — new pre-flight CLI that verifies Python version, API keys (Anthropic / OpenAI / Google), live provider pings, optional deps (presidio, opentelemetry, datasets), and ~/.multivon writability. Rich-rendered by default; --json for CI consumption. Exits 0 / 1 / 2 for ok / error / warn so CI gates can branch on the outcome.

multivon-eval bootstrap --validate — optional N-shot judge-noise filter on the generated seed cases via the existing validate_adversarial_cases primitive. Drops cases outside the (0.5, 1.0) hardness band against a stub-refusal baseline. Adds a hardness_report.jsonl alongside the filtered seed_cases.jsonl so a reviewer can audit what was thrown out.

Evaluator._skipped(reason) helper — base-class method that returns EvalResult(score=1.0, passed=True, reason="[skipped] ...", metadata={"skipped": True}). Used by every evaluator that previously returned score=0.0 when the case shape didn't fit (see Fixed below).

Skip propagation across the evaluator catalog. When the case shape doesn't fit an evaluator — no context for a RAG metric, no expected_output for an exact-match metric, no agent_trace for a tool metric — the evaluator now returns a skipped pass instead of a 0.0-score failure. Applies to Faithfulness, Hallucination, ContextPrecision, AnswerAccuracy, summarization checks, ToolCallAccuracy, ToolCallNecessity, ToolArgumentAccuracy, TrajectoryEfficiency, PlanQuality, StepFaithfulness, AgentMemoryEval, ExactMatch + 3 other deterministic, 2 text-metric evaluators, and all 4 conversation evaluators.

Refusal detection on Faithfulness + Hallucination. Short responses (<240 chars) starting with one of the known refusal prefixes ("I don't know", "I cannot", "Sorry", etc.) now skip these metrics — a correct refusal doesn't make substantive claims, so faithfulness is N/A.

ToolCallAccuracy three-shape semantics. expected_tool_calls=None skips. expected_tool_calls=[] + no calls = PASS ("Correctly called no tools"); expected_tool_calls=[] + any tool = FAIL. expected_tool_calls=[...] requires agent_trace to evaluate (skips if absent).

Parallel execution by default. EvalSuite.run(workers=...) now auto-picks min(8, len(cases)) when no tracer is supplied (was 1). A 10-case × 6-evaluator RAG suite drops from ~167s serial to ~17s with workers=8.

PIIEvaluator rewritten for full standards coverage with checksum validation and per-pattern citations to source standards:

HIPAA Safe Harbor (45 CFR § 164.514(b)(2)) — all 18 identifier categories where regex is feasible (13/18). MRN widened to 4–15 digits. PERSON_NAME via high-precision context-led pattern catches "Patient John Smith" / "Mr. Doe" / "Dr. Wilson". age_over_89 detection. Stricter admission/discharge/death-date patterns.

GDPR (Reg EU 2016/679, Art.4(1)) — UK NI Number, NHS Number (Mod-11 validated), Spain DNI/NIE, Italy Codice Fiscale, France NIR, Germany Steuer-IdNr, Netherlands BSN, Poland PESEL, Sweden Personnummer, Denmark CPR, Ireland PPSN, Finland HETU, IBAN (Mod-97 validated per ISO 13616).

DPDP India (Act 22 of 2023) — Aadhaar with Verhoeff checksum, PAN with structural validation, GSTIN, IFSC, Voter ID (EPIC), +91 mobile, Indian Driving License, Indian Passport, Vehicle Registration, Ration Card.

CCPA (Cal. Civ. Code § 1798.140(o)) — context-anchored bank account, California Driver's License.

New strict=True (default) runs Luhn / Verhoeff / Mod-97 / Mod-11 / structural validators to drop false positives on transaction IDs and order numbers.

New use_ner=True lazy-imports presidio_analyzer for partial coverage of HIPAA categories regex can't reach (unprefixed names, free-form addresses). Silent no-op when Presidio isn't installed.

Bootstrap discovery report deterministic from final evaluator list. "Why this mix" prose is now generated from the committed EvaluatorRecommendation[] list, not a separate LLM prose pass. Eliminates the previous drift where the report said "we skip Hallucination" while the suite included it, or "add PIIEvaluator" while the suite omitted it. The LLM proposer's notes are moved to a clearly-labeled "Proposer notes (advisory)" footer. When traces contain PII but the suite omits PIIEvaluator, the report surfaces a ⚠ PII gap callout with the exact suite.add_evaluators(PIIEvaluator(...)) snippet to add.

Calibration N-warning. Bootstrap writes a stderr warning when n_traces < 20 explaining that p25 over a small sample has wide CIs and the resulting thresholds shouldn't be treated as authoritative.

EvalReport API reference docs page (docs/reference/eval-report.mdx). Every public attribute and method documented with type + one-line description, plus a "common gotchas" section for the cases vs case_results / summary JSON-vs-attr / passed_by_evaluator method-vs-attr drifts.

Fixed

CRITICAL: _run_parallel silently dropped all but the first case. Regression introduced when parallel-by-default landed: a single parent_ctx = contextvars.copy_context() was reused across every ThreadPoolExecutor submission. Per Python docs, a single Context cannot be entered concurrently — parent_ctx.run() raises RuntimeError when another thread is already inside it. Threads 1..N silently captured the error into CaseResult.actual_output ("[ERROR: cannot enter context: ... is already entered]") and the user saw all-but-the-first case appear to fail with empty results. Fixed by per-submission contextvars.copy_context().run() — each thread gets its own Context snapshot.

EvalGateFailure now inherits from Exception AND SystemExit. Previously a SystemExit-only base meant library users couldn't catch it with except Exception: — a common pattern in notebooks, test harnesses, and Jupyter. Dual-inherit keeps CI exit semantics (uncaught instances still exit non-zero cleanly without traceback noise) while making except Exception as e: work for budget gate handling.

Loud stderr warning when most cases hit a judge error. When judge_error >= max(2, total/2) after a run, suite.run() writes a block at end-of-run naming the first error verbatim. Catches the "I forgot pip install multivon-eval[google]" footgun — previously the user saw pass_rate=0%, cost=$0, calls=0 and assumed the model failed; now they see the JudgeUnavailable message at the top of the block.

Bootstrap eval_suite.py no longer truncates rationale into inline comments. Full rationale lives in the module docstring; inline comments carry only the tier tag.

Notes

Existing 0.8.x users with custom JudgeRetry policies, custom adapters, or downstream code calling report.assert_budget() are unaffected by the EvalGateFailure base-class change.

The --validate flag on bootstrap is opt-in; default behavior is unchanged.

The skip-propagation change affects aggregate pass-rate numbers on existing suites — cases that previously failed at 0.0 because the data shape didn't fit will now appear as a passing-but-skipped result. Use cr.results[i].metadata.get("skipped") to filter when analyzing.

App 1–5 of the SDK deepdive (~$0.05 of real API spend across Anthropic / OpenAI / Gemini judges) serve as the integration smoke tests for the changes above.

Full release notes →

v0.8.2·May 19, 2026

v0.8.2 — ContextRecall skip semantics

Second dogfood-driven patch. Fixes a UX footgun in EvalSuite.for_rag().

🐛 Fixed

ContextRecall now skips cleanly when expected_output is missing instead of returning a confusing 0.0 quality failure.

# Before (0.8.1 and earlier):
suite = EvalSuite.for_rag()  # auto-includes ContextRecall
suite.add_cases([EvalCase(input="Q", context="...")])  # no expected_output
report = suite.run(model)
# context_recall: 0.0  reason: "Requires both case.context and case.expected_output"
# pass_rate dragged down by a metric that couldn't actually evaluate

# After (0.8.2):
# context_recall: 1.0 (passed=True)  reason: "[skipped] Requires both ..."
# metadata.skipped = True so users can filter
# pass_rate reflects actual quality, not missing ground truth

Known issue (will be fixed in 0.9.0)

A similar "returns 0.0 when input shape doesn't match" pattern exists in ~20 other evaluators (AnswerAccuracy, ExactMatch, Contains, BLEU, ROUGE, agent evaluators when no agent_trace, conversation evaluators when no conversation). These will all get the same skip-semantics treatment in 0.9.0. For now, only ContextRecall is fixed because it's the only metric auto-included by a factory suite and was therefore the most-visible footgun in the headline RAG workflow.

🧪 Tests

3 new tests in tests/test_context_recall_skip.py

Full suite: 835 passed, 13 skipped (was 832/13 at 0.8.1).

Upgrade

pip install --upgrade multivon-eval

Full release notes →

v0.8.1·May 19, 2026

v0.8.1 — Context injection fix for RAG one-liners

Critical fix for the 0.8.0 RAG one-liner UX. Surfaced by a comprehensive dogfood pass and verified end-to-end before HN launch.

🐛 Fixed

run_with_anthropic / run_with_openai / run_with_litellm now auto-inject EvalCase.context into the system prompt.

Pre-0.8.1, every RAG case run via these one-line helpers silently dropped its context — Claude/GPT got the question with no grounding, faithfulness/hallucination evaluators scored 0/N against the empty-context reality, and users had no signal the helper wasn't doing what its name implied.

# This now works as you'd expect — Claude gets the context.
suite = EvalSuite.for_rag()
suite.add_cases([
    EvalCase(
        input="What is the refund window?",
        context="Refunds within 30 days of purchase.",
    ),
])
report = suite.run_with_anthropic("claude-haiku-4-5-20251001")
# faithfulness now scores 1.0 — pre-0.8.1 it scored 0.0

How the fix works

The adapter contract is extended with an optional _call_with_case(case) method. The suite uses it when present, falls back to the existing model_fn(case.input) path for plain callables — so existing custom adapters are unaffected. Built-in AnthropicAdapter / OpenAIAdapter / LiteLLMAdapter now implement _call_with_case and auto-inject case.context into the system prompt along with a Use ONLY this context to answer RAG prefix.

List-valued context (multiple retrieved chunks) gets [chunk i] markers so the model can see the boundaries between sources.

Same fix applied to the async path via _acall_with_case (used by run_async).

🧪 Tests

14 new tests in tests/test_adapter_context_injection.py cover: _format_context_block helper, AnthropicAdapter / OpenAIAdapter context injection, system-prompt composition with both user-supplied and RAG prefixes, list-valued context, suite routing to _call_with_case when available + fallback for plain callables.

Full suite: 832 passed, 13 skipped (was 818/13 at 0.8.0).

Upgrade

pip install --upgrade multivon-eval

0.8.0 users with RAG cases using run_with_anthropic / run_with_openai should upgrade — the score difference is dramatic and the API is unchanged.

Full release notes →

v0.8.0·May 19, 2026

v0.8.0 — Intelligent-eval bootstrap + auto module

The intelligent-eval release. Two new public surfaces address the "I don't know what to eval" cold-start bottleneck for teams shipping LLM products.

✨ multivon-eval bootstrap CLI

Takes a product description + sample traces, emits a tuned EvalSuite in <60 seconds.

multivon-eval bootstrap \
  --product product.md \
  --traces traces.jsonl \
  --output ./eval-bootstrap/

Four artifacts land in the output dir:

eval_suite.py — runnable suite with 4-6 evaluators picked for your product shape

seed_cases.jsonl — 30 adversarial seed cases targeting the primary failure mode

thresholds.yaml — calibrated from your traces at p25 of baseline scores

DISCOVERY_REPORT.md — a forwardable eval design review for your team

Single Claude Haiku call for metric proposal; deterministic auto_evaluators heuristic runs as the safety net; threshold calibration on a capped trace sample. Cost ≈$0.12 per bootstrap, hard ceiling $0.15.

PII redaction (high-confidence local scan for AWS / Anthropic / OpenAI / GitHub keys, SSN, Luhn-valid credit cards, email) runs before any LLM call. Three policies via --pii-policy: redact (default), strict (abort on detection), allow (raw, with explicit confirmation).

🧠 multivon_eval.auto module

auto_evaluators(case) — pure-heuristic evaluator suggester. Pass an EvalCase, get back a ranked evaluator list with primary / secondary / guardrail tiers and confidence. Zero-cost.

generate_adversarial_cases — LLM-generates cases for 10 failure modes: ungrounded_claim, jailbreak, prompt_injection_direct/indirect, tool_injection, pii_leakage_invitation, tool_misuse, numeric_edge, off_topic, format_violation.

generate_unicode_obfuscation_cases — deterministic homoglyph / zero-width / RTL-override transforms. No LLM call.

validate_adversarial_cases — N-shot judge-noise filter. Validated +0.80 mean failure-rate separation between weak (always-confabulate) and strong (always-refuse) baselines on ungrounded_claim cases with real Claude Haiku judge — judge noise correctly filtered out at the per-shot level.

🧪 Tests

33 new tests for the bootstrap pipeline (test_discover.py)

19 tests for N-shot validation (test_auto_validate_adversarial.py)

19 tests for auto_evaluators heuristic (test_auto_evaluators.py)

7 tests for Unicode obfuscation (test_auto_unicode_obfuscation.py)

Full suite: 818 passed, 13 skipped (was 745/12 at 0.7.3)

🐛 Fixes

rag_eval.ipynb — corrected stale Experiment.add_run → exp.record(report, run_id=...) and suite.prepare() → per-evaluator ev.prepare(). Switched private accessors to public to match the quickstart notebook style.

📚 Full changelog

See [CHANGELOG.md](https://github.com/multivon-ai/multivon-eval/blob/main/CHANGELOG.md) for the complete entry.

Full release notes →

v0.7.8·May 19, 2026

v0.7.8 — Critical fix: 0.7.7 CLI was broken

Fixed

Orphan sys.exit(1) at end of cli.py — broke 0.7.7 CLI entirely. A leftover top-level sys.exit(1) from a refactor caused every CLI invocation in 0.7.7 to exit 1 after the actual command's success. 0.7.7 users should upgrade immediately.

Full release notes →

v0.7.7·May 19, 2026

v0.7.7 — multivon-eval discover (capability catalog)

Added

multivon-eval discover — emit a machine-readable capability catalog (evaluators, jurisdictions, judges, factory suites) as JSON to stdout. Same shape as multivon-mcp's eval_discover tool, surfaced for agents that don't speak MCP (or shell scripts, or CI gates).

multivon-eval discover | jq '.evaluators[] | select(.category == "rag")'

Pair with --compact to flatten to a single line for piping into jq -c.

Full release notes →

v0.7.6·May 19, 2026

v0.7.6 — RAG starter judge-noise tolerance

Fixed

init -t rag starter passes its happy path with the calibration warnings introduced in 0.7.3. The default thresholds were tuned slightly to absorb the small amount of judge noise on the canned cases.

Full release notes →

v0.7.5·May 19, 2026

v0.7.5 — save_json fix + scaffolder dedup

Fixed

EvalReport.save_json() auto-creates the parent directory. Previously failed with FileNotFoundError if the target directory didn't exist; now mkdir(parents=True, exist_ok=True) is called first.

Drop duplicate template writes in the init scaffolder — the same starter file was being written twice on some templates.

Full release notes →

v0.7.4·May 19, 2026

v0.7.4 — DPDP (India) compliance support

Added

DPDP (India) compliance support. The Digital Personal Data Protection Act (DPDP) jurisdiction is now first-class in PIIEvaluator(jurisdiction="dpdp"). Adds Aadhaar (12-digit + Verhoeff checksum), PAN (5+4+1 alpha-numeric), and India-specific patterns to the detection rules.

Full release notes →

Install with pip install multivon-eval or read the docs.