multivon-eval — the calibration-first eval framework.

Wilson + bootstrap CIs on every number, judge thresholds calibrated against human labels, and we refuse to claim wins where CIs overlap. 44 evaluators across seven categories — deterministic, LLM-judge (QAG), agent-trace, compliance, multimodal, conversation, and consistency — plus cost tracking, hash-chained audit logs, and a pytest plugin. Apache 2.0 on PyPI; no telemetry, no signup. Powers the public leaderboard at /leaderboard.

Bootstrap CLI

The fastest path: multivon-eval bootstrap

Hand over a one-paragraph product description and a few sample traces — the bootstrap CLI infers your product shape, picks 4-6 evaluators tuned to it, calibrates thresholds from your traces, and emits a runnable suite plus 30 adversarial seed cases. PII is redacted locally before any LLM call. Total cost: ~$0.12 per bootstrap (hard ceiling $0.15).

1

Install

2

Bootstrap from your product + traces

3

Review the discovery report, then run the generated suite

Outputs four files: eval_suite.py (runnable suite), seed_cases.jsonl (30 adversarial cases), thresholds.yaml (calibrated from your traces), and DISCOVERY_REPORT.md (a forwardable eval design review). Full bootstrap guide →

Or scaffold from a template

If you'd rather write your own cases from a runnable starter, the init scaffolder gives you a clean blueprint.

1

Scaffold a starter project — pick a template

2

Run it

Templates: quickstart, rag, agent, agent-langgraph, agent-openai-sdk, regulated, conversation. Each is a runnable starter — eval.py, requirements.txt, optional CI workflow.

Intelligent eval

Intelligent eval primitives: multivon_eval.auto

The bootstrap CLI is built on a programmatic API you can call directly when you want fine-grained control or want to compose pieces into your own pipeline. Each primitive is documented + tested independently.

  1. 01

    auto_evaluators(case)

    heuristic · 0 LLM cost · microseconds

    Pass an EvalCase, get back a ranked list of recommended evaluators with primary / secondary / guardrail tiers and confidence. Pure pattern-match on case shape — RAG vs QA vs agent vs conversation. No LLM call.

  2. 02

    generate_adversarial_cases(seed, mode, n)

    LLM-generated · ~$0.02

    Targets 10 named failure modes: ungrounded_claim, jailbreak, prompt_injection_direct / indirect, tool_injection, pii_leakage_invitation, tool_misuse, numeric_edge, off_topic, format_violation. Stress-test tags embedded in metadata so the right evaluator routes automatically.

  3. 03

    validate_adversarial_cases(cases, baseline)

    N-shot judge-noise filter

    Runs each case N times against your baseline + the eval, computes per-case failure_rate, filters by hardness band. Catches generation noise; surfaces the cases that genuinely separate weak vs strong models. Validated +0.80 separation.

Full walkthrough in the bootstrap guide. The `multivon-eval bootstrap` CLI above composes all three primitives + PII redaction + threshold calibration into one command.

New in 0.10.0

Evals drift as code changes: multivon-eval staleness

Prompts evolve, eval suites go stale, and nobody notices until a regression sails through. 0.10.0 ships the detection layer: a committed prompt_baseline.jsonsnapshot of every prompt call site in your repo, a read-only report that tells you exactly which prompts changed since your cases were authored, and an opt-in provenance layer binding cases to the prompts they exercise. Matching is content-first — line numbers and git SHAs are display-only, never matching inputs — so a whitespace refactor or a rebase produces zero false staleness. Prompts the scanner can’t statically read are reported UNKNOWN, never fake-fresh.

1

Read-only drift report — CHANGED / REMOVED / ADDED / UNKNOWN, exit 0 by default

2

Write or refresh the committed baseline (bootstrap writes one automatically)

3

Bind hand-written cases to the prompt call sites they exercise

multivon-eval staleness · sample reportstatic scan only · no LLM call · no network
baseline: prompt_baseline.json (a1b2c3d, 9 days ago, scanner v2)
determinacy: 11 of 14 call sites statically resolvable; verdicts
  below cover only those. 3 dynamic sites are unknown-by-construction.

CHANGED (2) — prompt text differs from baseline
  extractors/invoice.py:42  anthropic.system  in Extractor.extract
    fp 3fa9c1d2… → 8be07a11…
    bound cases: seed_cases.jsonl #4 #5 #9
    what changed: git diff a1b2c3d..HEAD -- extractors/invoice.py
  router/triage.py:17  anthropic.system  in Router.route
    [formatting-only — loose fingerprint unchanged]

REMOVED (1) — call site not found by static scan
  summarize/digest.py  openai.user  in build_digest
    note: feature removed, OR renamed+edited in one commit,
    OR prompt moved beyond static reach.

ADDED since baseline (1)
  rag/answer.py:61  anthropic.system  in answer
    → no cases reference this prompt

UNKNOWN (3) — dynamic prompts; static scan cannot verify their text

blind spots: does not see OpenAI Responses API, positional message
  args, prompts in files/templates/hub, non-Python services.
exit 0 (report-only — add --fail-on changed,removed in CI)
every report opens with the determinacy headline and closes with the blind-spots footer

We measured this rather than assuming it: scanner v3 (0.10.1) ran the determinacy gate against five real OSS repos — aider, gpt-researcher, open-interpreter, letta, pr-agent — and found 278 prompt call sites, 20.9% of them statically resolvable. That failed our own 50% gate, and the result is published with the per-repo table on issue #4. The report’s first line tells you your repo’s exact ratio; the runtime recorder (issue #9) shipped in 0.11.0 as the path past the static ceiling: pytest --record-prompts captures runtime fingerprints that render as a separate OBSERVED tier — recordings prove the renderings observed, never all renderings.

--fail-on changed,removed turns selected categories into a CI gate (exit 0 report-only by default); --format markdown drops straight into $GITHUB_STEP_SUMMARY. REMOVED always carries the three-way caveat — it’s a prompt to investigate, never an auto-delete signal. Full design + deferred work on the staleness epic and the 0.10.0 changelog.

44 evaluators across 7 categories

Pick the category that fits your output. Deterministic for cheap, fast gates. QAG when you need an LLM judge but don't want vibes-based scoring. Agent for tool-use evaluation. Conversation for dialogue. Compliance + multimodal for regulated / document-AI workflows.

Deterministic

14 evaluators

Pure-Python checks that don't call an LLM. Cheap to run, cheap to gate CI on. Use as a first-pass filter or for outputs that have a single correct answer.

  • ExactMatch, Contains, RegexMatch, StartsWith
  • JSONSchemaEval (Pydantic validation)
  • BLEU, ROUGE, BERTScore, Levenshtein, ChrfScore
  • Latency, MaxLatency, WordCount, NotEmpty

LLM-judge (QAG)

13 evaluators

Question-Answer-Generation: instead of asking an LLM to rate 1–10 (noisy), generate yes/no questions about the output and score by fraction answered correctly. Calibrated thresholds per judge model.

  • Faithfulness, Hallucination, Relevance
  • AnswerAccuracy, Coherence
  • ContextPrecision, ContextRecall
  • Toxicity, Bias, Summarization
  • CustomRubric, GEval, CheckEvaluator

Agent-trace

8 evaluators

Evaluate the trajectory, not just the final answer. Pairs with LangGraph + OpenAI Agents SDK tracers. Did the agent call the right tools, in the right order, with the right arguments?

  • ToolCallAccuracy, ToolArgumentAccuracy, ToolCallNecessity
  • PlanQuality, TaskCompletion, StepFaithfulness
  • TrajectoryEfficiency, AgentMemoryEval

Conversation

4 evaluators

Multi-turn dialogue evaluators. Useful for support chatbots and any system where context accumulates across turns.

  • ConversationRelevance, ConversationCompleteness
  • KnowledgeRetention, TurnConsistency

Compliance

2 evaluators

Audit-grade evaluators that run locally — no LLM judge needed. PII detection with HIPAA, GDPR, CCPA, and DPDP India jurisdictions. JSON Schema validation for structured outputs.

  • PIIEvaluator (regex, zero egress)
  • SchemaEvaluator (Pydantic / JSON Schema)

Multimodal

2 evaluators

Image-grounded and document-grounded faithfulness scoring — the eval layer behind the PDF Hell benchmark. Pairs with any vision model that can accept image inputs.

  • VQAFaithfulness (image-grounded QAG)
  • DocumentGrounding (multi-page docs)

Consistency

1 evaluators

Run-to-run stability across N samples — flags non-determinism in stochastic models or pipelines.

  • SelfConsistency

What people use it for

RAG faithfulness

Verify the model didn't invent claims not in the retrieved context.

multivon-eval · RAG faithfulness

from multivon_eval import EvalSuite, EvalCase, Faithfulness

suite = EvalSuite("rag-faithfulness")
suite.add_evaluators(Faithfulness())
suite.add_cases([EvalCase(
    input="What is the renewal period?",
    context="The agreement renews annually unless terminated.",
)])
# fail_threshold exits 1 in CI if pass_rate < 0.95
report = suite.run(my_rag_model, fail_threshold=0.95)

Agent trajectory eval

Did the agent call the right tool, in the right order, with the right args?

multivon-eval · agent trajectory eval

from multivon_eval import EvalSuite, ToolCallAccuracy, ToolCallNecessity
from multivon_eval.integrations import LangGraphTracer

suite = EvalSuite("agent-trajectory")
suite.add_evaluators(
    ToolCallAccuracy(),
    ToolCallNecessity(),  # penalises redundant tool calls
)
report = suite.run(my_agent, tracer=LangGraphTracer())

Compliance + audit pack

Generate a hash-chained audit log for procurement / SOC2 / EU AI Act.

multivon-eval · compliance + audit pack

from multivon_eval import EvalSuite, Faithfulness
from multivon_eval.compliance import ComplianceReporter

reporter = ComplianceReporter(output_dir="./audit-logs", framework="eu-ai-act")
suite = EvalSuite("regulated-eval")
suite.add_evaluators(Faithfulness())
report = suite.run(my_model)
reporter.record(report, tags={"system": "my-product"})
# Then bundle as audit ZIP via:
#   multivon-eval audit-package --logs audit-logs \
#     --suite regulated-eval --framework eu-ai-act --out audit-pack.zip

Numbers you can verify in the repo

Every figure below comes from a reproducer in the public benchmarks/ directory. Click any tile to see the raw JSON.

F1 figures show the 95% bootstrap confidence interval in brackets. Where a cell's CI overlaps a competitor's CI, we don't call it a win. The headline Hallucination F1 (0.804 in-distribution on HaluEval-QA) was calibrated on the same split it was tested on — see the Limitations note. The tile above shows the actually-held-out cross-distribution figure: the Hallucination evaluator with explicit Haiku JudgeConfig so the calibrated threshold (0.55, frozen from HaluEval-QA-100 calibration) is applied to HaluEval-Sum, F1 = 0.830 [0.70–0.92] on n=60. Different task family, calibration set ↮ test set. The earlier launch claim of "F1 0.78 on held-out HaluEval-Sum (Faithfulness)" was wrong (Faithfulness is itself calibrated on HaluEval-Sum, so that was in-distribution), and a follow-up iteration that quoted F1 0.85 at threshold 0.7 conflated the runtime-default threshold with the calibrated one. Both errors caught by round-2 peer review within hours and corrected; full thread preserved in the benchmarks README.

Built for regulated AI

The compliance machinery is load-bearing for buyers in healthcare, insurance, legal, and financial services. Every paid PDF Hell engagement ships an audit pack generated by these primitives.

Hash-chained audit log

Every evaluation appended to an NDJSON log with SHA-256 linking to the prior entry. Tamper-evident: changing any entry invalidates every subsequent hash. Auditors verify with `reporter.verify()`.

Compliance framework mappings

EU AI Act Articles 9, 10, 14; NIST AI RMF Govern-2 / Map-5 / Measure-3; HIPAA Safe Harbor (PII redaction); DPDP India (Aadhaar / PAN / GSTIN / IFSC). Each evaluator carries the framework controls it satisfies.

No data egress (on-prem judges)

JudgeConfig(base_url=…) routes any judge call to an OpenAI-compatible endpoint (Ollama, vLLM, LM Studio, on-prem). Production data never leaves the VPC.

Suite lock + evaluator fingerprinting

SuiteLock + EvaluatorFingerprint detect suite-shape and judge-config changes between runs. A failing lockfile means an auditor can prove the eval was the eval that was contracted. Prompt drift in your application code is the staleness report's job (above) — the two drift detectors are deliberately orthogonal.

Where to next

The SDK docs cover every evaluator, every CLI command, every integration point. The benchmark repo has reproducers for every published number. PDF Hell is the benchmark-as-product layer on top.