multivon-eval — the calibration-first eval framework.
Wilson + bootstrap CIs on every number, judge thresholds calibrated against human labels, and we refuse to claim wins where CIs overlap. 44 evaluators across seven categories — deterministic, LLM-judge (QAG), agent-trace, compliance, multimodal, conversation, and consistency — plus cost tracking, hash-chained audit logs, and a pytest plugin. Apache 2.0 on PyPI; no telemetry, no signup. Powers the public leaderboard at /leaderboard.
Bootstrap CLI
The fastest path: multivon-eval bootstrap
Hand over a one-paragraph product description and a few sample traces — the bootstrap CLI infers your product shape, picks 4-6 evaluators tuned to it, calibrates thresholds from your traces, and emits a runnable suite plus 30 adversarial seed cases. PII is redacted locally before any LLM call. Total cost: ~$0.12 per bootstrap (hard ceiling $0.15).
1
Install
2
Bootstrap from your product + traces
3
Review the discovery report, then run the generated suite
Outputs four files: eval_suite.py (runnable suite), seed_cases.jsonl (30 adversarial cases), thresholds.yaml (calibrated from your traces), and DISCOVERY_REPORT.md (a forwardable eval design review). Full bootstrap guide →
Or scaffold from a template
If you'd rather write your own cases from a runnable starter, the init scaffolder gives you a clean blueprint.
1
Scaffold a starter project — pick a template
2
Run it
Templates: quickstart, rag, agent, agent-langgraph, agent-openai-sdk, regulated, conversation. Each is a runnable starter — eval.py, requirements.txt, optional CI workflow.
Intelligent eval
Intelligent eval primitives: multivon_eval.auto
The bootstrap CLI is built on a programmatic API you can call directly when you want fine-grained control or want to compose pieces into your own pipeline. Each primitive is documented + tested independently.
01
auto_evaluators(case)
heuristic · 0 LLM cost · microseconds
Pass an EvalCase, get back a ranked list of recommended evaluators with primary / secondary / guardrail tiers and confidence. Pure pattern-match on case shape — RAG vs QA vs agent vs conversation. No LLM call.
02
generate_adversarial_cases(seed, mode, n)
LLM-generated · ~$0.02
Targets 10 named failure modes: ungrounded_claim, jailbreak, prompt_injection_direct / indirect, tool_injection, pii_leakage_invitation, tool_misuse, numeric_edge, off_topic, format_violation. Stress-test tags embedded in metadata so the right evaluator routes automatically.
03
validate_adversarial_cases(cases, baseline)
N-shot judge-noise filter
Runs each case N times against your baseline + the eval, computes per-case failure_rate, filters by hardness band. Catches generation noise; surfaces the cases that genuinely separate weak vs strong models. Validated +0.80 separation.
Full walkthrough in the bootstrap guide. The `multivon-eval bootstrap` CLI above composes all three primitives + PII redaction + threshold calibration into one command.
New in 0.10.0
Evals drift as code changes: multivon-eval staleness
Prompts evolve, eval suites go stale, and nobody notices until a regression sails through. 0.10.0 ships the detection layer: a committed prompt_baseline.jsonsnapshot of every prompt call site in your repo, a read-only report that tells you exactly which prompts changed since your cases were authored, and an opt-in provenance layer binding cases to the prompts they exercise. Matching is content-first — line numbers and git SHAs are display-only, never matching inputs — so a whitespace refactor or a rebase produces zero false staleness. Prompts the scanner can’t statically read are reported UNKNOWN, never fake-fresh.
Write or refresh the committed baseline (bootstrap writes one automatically)
3
Bind hand-written cases to the prompt call sites they exercise
multivon-eval staleness · sample reportstatic scan only · no LLM call · no network
baseline: prompt_baseline.json (a1b2c3d, 9 days ago, scanner v2)
determinacy: 11 of 14 call sites statically resolvable; verdicts
below cover only those. 3 dynamic sites are unknown-by-construction.
CHANGED (2) — prompt text differs from baseline
extractors/invoice.py:42 anthropic.system in Extractor.extract
fp 3fa9c1d2… → 8be07a11…
bound cases: seed_cases.jsonl #4 #5 #9
what changed: git diff a1b2c3d..HEAD -- extractors/invoice.py
router/triage.py:17 anthropic.system in Router.route
[formatting-only — loose fingerprint unchanged]
REMOVED (1) — call site not found by static scan
summarize/digest.py openai.user in build_digest
note: feature removed, OR renamed+edited in one commit,
OR prompt moved beyond static reach.
ADDED since baseline (1)
rag/answer.py:61 anthropic.system in answer
→ no cases reference this prompt
UNKNOWN (3) — dynamic prompts; static scan cannot verify their text
blind spots: does not see OpenAI Responses API, positional message
args, prompts in files/templates/hub, non-Python services.
exit 0 (report-only — add --fail-on changed,removed in CI)
every report opens with the determinacy headline and closes with the blind-spots footer
We measured this rather than assuming it: scanner v3 (0.10.1) ran the determinacy gate against five real OSS repos — aider, gpt-researcher, open-interpreter, letta, pr-agent — and found 278 prompt call sites, 20.9% of them statically resolvable. That failed our own 50% gate, and the result is published with the per-repo table on issue #4. The report’s first line tells you your repo’s exact ratio; the runtime recorder (issue #9) shipped in 0.11.0 as the path past the static ceiling: pytest --record-prompts captures runtime fingerprints that render as a separate OBSERVED tier — recordings prove the renderings observed, never all renderings.
--fail-on changed,removed turns selected categories into a CI gate (exit 0 report-only by default); --format markdown drops straight into $GITHUB_STEP_SUMMARY. REMOVED always carries the three-way caveat — it’s a prompt to investigate, never an auto-delete signal. Full design + deferred work on the staleness epic and the 0.10.0 changelog.
44 evaluators across 7 categories
Pick the category that fits your output. Deterministic for cheap, fast gates. QAG when you need an LLM judge but don't want vibes-based scoring. Agent for tool-use evaluation. Conversation for dialogue. Compliance + multimodal for regulated / document-AI workflows.
Deterministic
14 evaluators
Pure-Python checks that don't call an LLM. Cheap to run, cheap to gate CI on. Use as a first-pass filter or for outputs that have a single correct answer.
▸ ExactMatch, Contains, RegexMatch, StartsWith
▸ JSONSchemaEval (Pydantic validation)
▸ BLEU, ROUGE, BERTScore, Levenshtein, ChrfScore
▸ Latency, MaxLatency, WordCount, NotEmpty
LLM-judge (QAG)
13 evaluators
Question-Answer-Generation: instead of asking an LLM to rate 1–10 (noisy), generate yes/no questions about the output and score by fraction answered correctly. Calibrated thresholds per judge model.
▸ Faithfulness, Hallucination, Relevance
▸ AnswerAccuracy, Coherence
▸ ContextPrecision, ContextRecall
▸ Toxicity, Bias, Summarization
▸ CustomRubric, GEval, CheckEvaluator
Agent-trace
8 evaluators
Evaluate the trajectory, not just the final answer. Pairs with LangGraph + OpenAI Agents SDK tracers. Did the agent call the right tools, in the right order, with the right arguments?
Multi-turn dialogue evaluators. Useful for support chatbots and any system where context accumulates across turns.
▸ ConversationRelevance, ConversationCompleteness
▸ KnowledgeRetention, TurnConsistency
Compliance
2 evaluators
Audit-grade evaluators that run locally — no LLM judge needed. PII detection with HIPAA, GDPR, CCPA, and DPDP India jurisdictions. JSON Schema validation for structured outputs.
▸ PIIEvaluator (regex, zero egress)
▸ SchemaEvaluator (Pydantic / JSON Schema)
Multimodal
2 evaluators
Image-grounded and document-grounded faithfulness scoring — the eval layer behind the PDF Hell benchmark. Pairs with any vision model that can accept image inputs.
▸ VQAFaithfulness (image-grounded QAG)
▸ DocumentGrounding (multi-page docs)
Consistency
1 evaluators
Run-to-run stability across N samples — flags non-determinism in stochastic models or pipelines.
▸ SelfConsistency
What people use it for
RAG faithfulness
Verify the model didn't invent claims not in the retrieved context.
multivon-eval · RAG faithfulness
from multivon_eval import EvalSuite, EvalCase, Faithfulness
suite = EvalSuite("rag-faithfulness")
suite.add_evaluators(Faithfulness())
suite.add_cases([EvalCase(
input="What is the renewal period?",
context="The agreement renews annually unless terminated.",
)])
# fail_threshold exits 1 in CI if pass_rate < 0.95
report = suite.run(my_rag_model, fail_threshold=0.95)
Agent trajectory eval
Did the agent call the right tool, in the right order, with the right args?
multivon-eval · agent trajectory eval
from multivon_eval import EvalSuite, ToolCallAccuracy, ToolCallNecessity
from multivon_eval.integrations import LangGraphTracer
suite = EvalSuite("agent-trajectory")
suite.add_evaluators(
ToolCallAccuracy(),
ToolCallNecessity(), # penalises redundant tool calls
)
report = suite.run(my_agent, tracer=LangGraphTracer())
Compliance + audit pack
Generate a hash-chained audit log for procurement / SOC2 / EU AI Act.
multivon-eval · compliance + audit pack
from multivon_eval import EvalSuite, Faithfulness
from multivon_eval.compliance import ComplianceReporter
reporter = ComplianceReporter(output_dir="./audit-logs", framework="eu-ai-act")
suite = EvalSuite("regulated-eval")
suite.add_evaluators(Faithfulness())
report = suite.run(my_model)
reporter.record(report, tags={"system": "my-product"})
# Then bundle as audit ZIP via:
# multivon-eval audit-package --logs audit-logs \
# --suite regulated-eval --framework eu-ai-act --out audit-pack.zip
Numbers you can verify in the repo
Every figure below comes from a reproducer in the public benchmarks/ directory. Click any tile to see the raw JSON.
F1 figures show the 95% bootstrap confidence interval in brackets. Where a cell's CI overlaps a competitor's CI, we don't call it a win. The headline Hallucination F1 (0.804 in-distribution on HaluEval-QA) was calibrated on the same split it was tested on — see the Limitations note. The tile above shows the actually-held-out cross-distribution figure: the Hallucination evaluator with explicit Haiku JudgeConfig so the calibrated threshold (0.55, frozen from HaluEval-QA-100 calibration) is applied to HaluEval-Sum, F1 = 0.830 [0.70–0.92] on n=60. Different task family, calibration set ↮ test set. The earlier launch claim of "F1 0.78 on held-out HaluEval-Sum (Faithfulness)" was wrong (Faithfulness is itself calibrated on HaluEval-Sum, so that was in-distribution), and a follow-up iteration that quoted F1 0.85 at threshold 0.7 conflated the runtime-default threshold with the calibrated one. Both errors caught by round-2 peer review within hours and corrected; full thread preserved in the benchmarks README.
The compliance machinery is load-bearing for buyers in healthcare, insurance, legal, and financial services. Every paid PDF Hell engagement ships an audit pack generated by these primitives.
Hash-chained audit log
Every evaluation appended to an NDJSON log with SHA-256 linking to the prior entry. Tamper-evident: changing any entry invalidates every subsequent hash. Auditors verify with `reporter.verify()`.
Compliance framework mappings
EU AI Act Articles 9, 10, 14; NIST AI RMF Govern-2 / Map-5 / Measure-3; HIPAA Safe Harbor (PII redaction); DPDP India (Aadhaar / PAN / GSTIN / IFSC). Each evaluator carries the framework controls it satisfies.
No data egress (on-prem judges)
JudgeConfig(base_url=…) routes any judge call to an OpenAI-compatible endpoint (Ollama, vLLM, LM Studio, on-prem). Production data never leaves the VPC.
Suite lock + evaluator fingerprinting
SuiteLock + EvaluatorFingerprint detect suite-shape and judge-config changes between runs. A failing lockfile means an auditor can prove the eval was the eval that was contracted. Prompt drift in your application code is the staleness report's job (above) — the two drift detectors are deliberately orthogonal.
Where to next
The SDK docs cover every evaluator, every CLI command, every integration point. The benchmark repo has reproducers for every published number. PDF Hell is the benchmark-as-product layer on top.