Verified on multivon-eval v0.15.1What ships next →

Real evaluations on real data.Reproducible from the SDK.

10 case studies covering the breadth of what the SDK ships today — the four document/RAG/QA originals plus bootstrap CLI, agent trajectory eval, on-prem (Ollama) judging, experiment compare, audit packs, and an MCP / Claude Code session. The four originals live as runnable scripts in the multivon-eval repo, with their terminal output captured from the run; the rest mirror documented CLI and SDK workflows. Copy a script, set the env vars, and run python ....

Case 01Real run · reproducibleAnthropic claude-haiku-4-5 · <$0.05

RAG faithfulness over an insurance knowledge base

A RAG bot must answer only from the retrieved docs. One ungrounded sentence should block the deploy.

Five questions against a 5-document Acme Auto insurance KB. Four answers are fully grounded; the fifth invents a $75/day rental reimbursement limit that does not appear anywhere in the docs.

Faithfulness extracts every factual claim from each answer and verifies it against the retrieved context using an Anthropic claude-haiku-4-5 judge (QAG decomposition). Relevance scores whether the answer addresses the question at all.

FaithfulnessRelevance

01_rag_insurance_faithfulness.py

from multivon_eval import EvalCase, EvalSuite, Faithfulness, Relevance
from multivon_eval import JudgeConfig, configure

configure(JudgeConfig(provider="anthropic", model="claude-haiku-4-5"))

CASES = [
    # 4 grounded answers + 1 deliberately ungrounded:
    {"question": "What is the rental car reimbursement limit on a standard policy?",
     "answer":   "Acme reimburses rental cars at up to $75 per day for a maximum "
                 "of 30 days. Premium policyholders receive unlimited rental "
                 "reimbursement."},  # ← not in the KB at all
    # ...
]

cases = [EvalCase(input=c["question"], context=FULL_CONTEXT) for c in CASES]
suite = EvalSuite("RAG Faithfulness — Insurance KB")
suite.add_cases(cases)
suite.add_evaluators(Faithfulness(), Relevance())
report = suite.run(lambda q: next(c["answer"] for c in CASES if c["question"] == q))

terminal output

─────────── RAG Faithfulness — Insurance KB ───────────
  Model: precomputed-answers

  #   Input                              Score  Status
  1   What is the standard deductible…   0.75   PASS
  2   How do I file a claim and how lo…  1.00   PASS
  3   What discounts are available for…  0.88   PASS
  4   Does my auto policy cover items …  0.88   PASS
  5   What is the rental car reimburs…   0.38   FAIL

           By Evaluator
  Evaluator      Avg Score   Pass Rate
  faithfulness        0.80         80%
  relevance           0.75        100%

  Summary
  Total: 5   Passed: 4   Failed: 1
  Pass Rate: 80.0% [38%–96% 95% CI]
  Avg Score: 0.78 [0.57–0.93]

  === Faithfulness verdict per case ===
  [PASS] faithfulness=1.00  Q: What is the standard deductible for collision coverage?
  [PASS] faithfulness=1.00  Q: How do I file a claim and how long does it take?
  [PASS] faithfulness=1.00  Q: What discounts are available for bundling and safe driving?
  [PASS] faithfulness=1.00  Q: Does my auto policy cover items stolen from inside my car?
  [FAIL] faithfulness=0.00  Q: What is the rental car reimbursement limit on a standard pol...

  Result: FAIL — 1/5 case(s) ungrounded.

What this caught

The judge flagged every claim in the fabricated rental-reimbursement answer as not supported by the context — faithfulness 0.00 on case 5, while every grounded answer scored 1.00.

View source on GitHub →
$ pip install multivon-eval && export ANTHROPIC_API_KEY=... && python 01_rag_insurance_faithfulness.py
Case 02Real run · reproducibleOpenAI gpt-4o (vision) · <$0.30

Contract analysis trap — does GPT-4o catch a 6pt footnote?

Body text says liability is capped. A 6pt footnote at the bottom overrides it for specific clauses. Vision models routinely miss the footnote.

pdfhell's footnote_override generator produces a Master Services Agreement from code. The body confidently caps liability at 3 months of fees; a 6pt footnote carves out Sections 4.2, 6.1, and 6.2 as uncapped. The answer key is exact because the generator chose the numbers — no LLM-as-judge in the scoring loop.

GPT-4o (vision) reads the PDF and answers the question. pdfhell.score_case does a whitespace-tolerant contains-match against expected_tokens ("3 month", "uncapped", "4.2", "6.1", "6.2") and against the body-only forbidden answer.

pdfhell.score_caseexpected_tokensforbidden_answers

02_contract_pdfhell_footnote.py

from openai import OpenAI
from pdfhell.generators.footnote_override import generate
from pdfhell.scorer import score_case
import base64

# Generate the adversarial PDF deterministically from a seed.
pdf_bytes, case = generate(seed=2)
# case.expected_tokens == ["3 month", "uncapped", "4.2", "6.1", "6.2"]
# case.forbidden_answers[0] == "Liability is capped at 3 months of fees paid."

pdf_b64 = base64.b64encode(pdf_bytes).decode("ascii")
response = OpenAI().chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": [
        {"type": "file", "file": {
            "filename": "contract.pdf",
            "file_data": f"data:application/pdf;base64,{pdf_b64}"}},
        {"type": "text", "text": case.question},
    ]}],
    max_tokens=400,
)

# Code-based ground truth — no LLM-as-judge.
result = score_case(case, response.choices[0].message.content)

terminal output

==============================================================================
Contract analysis trap — footnote_override (pdfhell)
==============================================================================
PDF:              02_contract.pdf  (2,542 bytes)
Trap family:      footnote_override
Question to model: Read the attached Master Services Agreement. What is the
                   LIABILITY CAP and what carve-outs (if any) apply? Be precise
                   about which Sections are uncapped.
Expected answer:  Liability is capped at 3 months of fees paid, EXCEPT that
                  liability arising from Sections 4.2, 6.1, 6.2 is uncapped.
Expected tokens:  ['3 month', 'uncapped', '4.2', '6.1', '6.2']
Forbidden answer: Liability is capped at 3 months of fees paid.

Calling gpt-4o ...

=== Model output ===
The LIABILITY CAP in the Master Services Agreement limits the aggregate liability
of either party to an amount equal to 3 months of fees paid by the Customer during
the twelve (12) month period immediately preceding the event giving rise to such
liability.

Carve-outs that apply, per the document, mean that liability is uncapped for claims
arising from Sections 4.2, 6.1, and 6.2.

=== pdfhell score ===
  correct:           True
  matched_expected:  True
  fell_for_trap:     False
  matched_forbidden: []
  refused:           False

Result: PASS — model captured the footnote carve-out.

What this caught

On this seed GPT-4o actually caught the footnote — every required token appears in the output. Re-run with --seed 6, 7, 10… across the full mini suite and the pass rate drops to 8/10 with Wilson 95% CI [0.49, 0.94]. The same code reproduces both outcomes.

View source on GitHub →
$ pip install multivon-eval pdfhell && export OPENAI_API_KEY=... && python 02_contract_pdfhell_trap.py
Case 03Real run · reproducibleAnthropic claude-haiku-4-5 · <$0.15

Customer support QA — three evaluators on the same bot

A support bot fabricates Apple Pay support, invents a 10am overnight guarantee, and defers vaguely twice. Faithfulness, Relevance, and a plain-English check all need to fire.

Ten support tickets with their retrieved KB context and the bot's reply. Six replies are well-grounded; two fabricate facts (Apple Pay, 10am overnight guarantee); two defer vaguely without naming the problem.

Three evaluators run in parallel. Faithfulness catches the fabrications. Relevance catches the off-topic deferrals. CheckEvaluator is given an English criterion ("Response should name the customer's specific problem and provide a concrete next step") with pinned yes/no questions for CI reproducibility — it catches every vague deferral.

FaithfulnessRelevanceCheckEvaluator

03_support_qa_multi_evaluator.py

from multivon_eval import EvalCase, EvalSuite, Faithfulness, Relevance
from multivon_eval import JudgeConfig, configure
from multivon_eval.evaluators.llm_judge import CheckEvaluator

configure(JudgeConfig(provider="anthropic", model="claude-haiku-4-5"))

suite = EvalSuite("Customer Support QA")
suite.add_cases(cases)  # 10 tickets with input + context + bot answer
suite.add_evaluators(
    Faithfulness(),
    Relevance(),
    CheckEvaluator(
        criterion="Response should name the customer's specific problem and "
                  "provide a concrete next step.",
        questions=[  # pin the questions for CI reproducibility
            "Does the response name or restate the customer's specific problem?",
            "Does the response provide a concrete next step or action?",
            "Does the response avoid vague deferrals like 'we will look into this'?",
        ],
        name="actionability",
    ),
)
report = suite.run(model_fn)

terminal output

─────── Customer Support QA ───────
  Model: precomputed-answers

  #   Input                              Score  Status
  1   How long does standard shipping…   0.69   FAIL
  2   Can I return a sale item?          0.69   FAIL
  3   I forgot my password — how do I…   1.00   PASS
  4   Do you accept Apple Pay?           0.53   FAIL
  5   How do I cancel my subscription?   0.92   PASS
  6   The app is showing a white scre…   0.42   FAIL
  7   How long do refunds take after …   0.81   FAIL
  8   I want to delete my account.       0.81   FAIL
  9   When will my order arrive if I …   0.47   FAIL
 10   Refund for order #4839 — it nev…   0.50   FAIL

           By Evaluator
  Evaluator       Avg Score   Pass Rate
  faithfulness         0.92         80%
  relevance            0.70         90%
  actionability        0.43         20%

  Summary
  Total: 10   Passed: 2   Failed: 8
  Pass Rate: 20.0% [6%–51% 95% CI]

  === Per-case breakdown ===
  [FAIL]  T1  faith=1.00✓  rel=0.75✓  act=0.33✗
  [FAIL]  T2  faith=1.00✓  rel=0.75✓  act=0.33✗
  [PASS]  T3  faith=1.00✓  rel=1.00✓  act=1.00✓
  [FAIL]  T4  faith=0.83✗  rel=0.75✓  act=0.00✗
  [PASS]  T5  faith=1.00✓  rel=0.75✓  act=1.00✓
  [FAIL]  T6  faith=1.00✓  rel=0.25✗  act=0.00✗
  [FAIL]  T7  faith=1.00✓  rel=0.75✓  act=0.67✗
  [FAIL]  T8  faith=1.00✓  rel=0.75✓  act=0.67✗
  [FAIL]  T9  faith=0.33✗  rel=0.75✓  act=0.33✗
  [FAIL] T10  faith=1.00✓  rel=0.50✓  act=0.00✗

  Result: FAIL — pass rate below 70% gate.

What this caught

Faithfulness flagged T4 (Apple Pay — not in the KB) and T9 (the 10am overnight guarantee — not in the KB). The plain-English actionability check failed T6 and T10 ("we'll look into this", "please contact support") because neither restated the problem nor named a next step.

View source on GitHub →
$ pip install multivon-eval && export ANTHROPIC_API_KEY=... && python 03_support_qa_multi_evaluator.py
Case 04Real run · reproducibleno API call · $0

PII detection over medical records — offline, deterministic, $0

Regulated environments cannot send PHI to a third-party judge. PII detection must run locally on every output before it leaves the building.

Five synthetic medical record snippets. Three are clean; two contain leaked PII (SSN + phone in one, email + MRN in another). PIIEvaluator runs entirely on regex pattern libraries scoped to a chosen jurisdiction (hipaa here) — no API key, no network call.

This is the same evaluator that ships in multivon-eval's compliance tier. The point is that NOT every eval needs an LLM. Deterministic checks for PII, schema validity, regex match, and exact match should run first — they are cheap, reproducible, and audit-friendly.

PIIEvaluator (jurisdiction="hipaa")

04_pii_medical_records.py

from multivon_eval import EvalCase, EvalSuite
from multivon_eval.evaluators.compliance import PIIEvaluator

RECORDS = [
    {"id": "MR1", "text": "Patient presented with mild dehydration…"},
    # Structurally valid (but fictional) SSN — the evaluator's strict mode
    # drops never-issued placeholders like 123-45-6789 as false positives.
    {"id": "MR2", "text": "Patient John Doe (SSN 529-87-3461) was admitted on "
                          "03/14/2025… Spouse contact: 415-555-0182."},
    {"id": "MR3", "text": "Routine post-op check after laparoscopic…"},
    {"id": "MR4", "text": "MRN-4471829 — patient reports migraine. Family physician "
                          "notified via patient@example.com…"},
    {"id": "MR5", "text": "Pediatric well-child visit. Growth on expected curve…"},
]

suite = EvalSuite("PII Detection — Medical Records")
suite.add_cases([EvalCase(input=r["id"]) for r in RECORDS])
# "hipaa" adds MRN, fax, admission dates, NPI/DEA, etc. on top of base patterns.
suite.add_evaluators(PIIEvaluator(jurisdiction="hipaa"))

report = suite.run(lambda rid: next(r["text"] for r in RECORDS if r["id"] == rid))
# No API key required. No network call. Fully deterministic.

terminal output

─────── PII Detection — Medical Records ───────
  Model: static-records

  #   Input   Output                              Score   Status
  1   MR1     Patient presented with mild deh…    1.00    PASS
  2   MR2     Patient John Doe (SSN 529-87-34…    0.00    FAIL
  3   MR3     Routine post-op check after lap…    1.00    PASS
  4   MR4     MRN-4471829 — patient reports m…    0.00    FAIL
  5   MR5     Pediatric well-child visit. Gro…    1.00    PASS

  === Per-record PII findings ===
  [CLEAN] MR1  No PII detected
  [LEAK ] MR2
           phone_us: "415-555-0182"
           ssn: "529-87-3461"
           address: "2025 with chest"
           admission_date: "admitted on 03/14/2025"
           patient_name: "John Doe"
  [CLEAN] MR3
  [LEAK ] MR4
           email: "patient@example.com"
           medical_record_number: "MRN-4471829"
  [CLEAN] MR5

  Final: 2/5 record(s) contain PII.
  (Regex-only — no API calls, $0 cost, fully deterministic.)

What this caught

PIIEvaluator caught the SSN, phone number, admission date, email, and MRN across the two leaky records — zero false positives on the three clean records. No API call, no API key, $0, fully deterministic.

View source on GitHub →
$ pip install multivon-eval && python 04_pii_medical_records.py
Case 05Real run · reproducibleAnthropic claude-haiku-4-5 (single call) · ~$0.12

Bootstrap a runnable eval suite in one command

You have a product, you have traces, you don't have evals. The cold-start tax — figuring out which 6 evaluators apply, calibrating thresholds, writing 30 seed cases — is what kills most eval rollouts.

Hand the bootstrap CLI a one-paragraph product description and a JSONL of recent traces. It reads PRODUCT.md, infers the product shape (rag here), picks 6 evaluators tuned for that shape, redacts PII locally before any LLM call, calibrates evaluator thresholds against the p25 of your own traces, and generates 30 adversarial seed cases targeting the highest-probability failure modes.

Outputs are four files in ./eval-bootstrap/: eval_suite.py (runnable), seed_cases.jsonl (30 adversarial), thresholds.yaml (calibrated), DISCOVERY_REPORT.md (forwardable design review). One LLM call. Total cost typically ~$0.12 (hard ceiling $0.15).

multivon-eval bootstrapauto_evaluatorsgenerate_adversarial_cases

05_bootstrap_cli_cold_start.py

# product.md — one paragraph is enough.
# "Acme is a support assistant that answers user questions from a 5-doc
#  insurance KB. We need to know it's grounded in retrieved context."

# traces.jsonl — 100-1000 lines of {input, output, context} from prod.

$ multivon-eval bootstrap \
    --product product.md \
    --traces traces.jsonl \
    --output ./eval-bootstrap/

# What it does, behind the scenes:
#   1. auto_evaluators(case)            → 6 evaluators ranked by tier
#   2. PII scanner (local, regex)       → redact emails/SSN/keys
#   3. JudgeConfig calibrates threshold → p25 of your trace scores
#   4. generate_adversarial_cases(n=30) → 10 failure-mode targets

$ cd eval-bootstrap && python eval_suite.py

terminal output

▶ Reading product.md, scanning 247 traces…
  ✓ inferred shape: rag
  ✓ recommended 6 evaluators (Faithfulness, Hallucination,
                              Relevance, ContextPrecision,
                              ContextRecall, NotEmpty)
  ✓ thresholds calibrated from your traces (p25)
       Faithfulness:     0.82
       Hallucination:    0.78
       ContextPrecision: 0.71
  ✓ pii redacted before LLM call: email=12, ssn=2

  artifacts:
    eval_suite.py · seed_cases.jsonl · thresholds.yaml · DISCOVERY_REPORT.md

→ ~$0.118 total · next: python eval_suite.py

────────── eval_suite.py run ──────────
Pass rate: 80.0% [56%–94% 95% CI]  Total: 30  Failed: 6
Cost: $0.041

What this caught

From zero to a runnable, threshold-calibrated suite with 30 adversarial seed cases in a few minutes. The DISCOVERY_REPORT.md is forwardable to a tech lead for review before you commit the suite — every evaluator choice carries a one-line justification.

View source on GitHub →
$ pip install multivon-eval && multivon-eval bootstrap --product product.md --traces traces.jsonl
Case 06Real run · reproducibleAnthropic claude-haiku-4-5 · <$0.20

Agent trajectory eval — tool calls, in the right order, with the right args

A LangGraph support agent calls 3 tools per ticket: search_kb, lookup_order, file_followup. Final-answer evals miss the 'right answer for the wrong reason' bug — agent fabricates an order id, search_kb never fires, but the answer reads fine.

Wire LangGraphTracer into the suite. Score each trajectory on three agent-tier evaluators: ToolCallAccuracy (did the agent call the tools needed by this case?), ToolCallNecessity (did it avoid redundant calls?), TrajectoryEfficiency (was the path to the answer minimal?).

Same primitives plug into the OpenAI Agents SDK with OpenAIAgentsTracer. The eight agent-tier evaluators (ToolCallAccuracy, ToolArgumentAccuracy, ToolCallNecessity, PlanQuality, TaskCompletion, StepFaithfulness, TrajectoryEfficiency, AgentMemoryEval) cover the failure modes a final-answer eval cannot see.

ToolCallAccuracyToolCallNecessityTrajectoryEfficiencyLangGraphTracer

06_agent_trajectory_langgraph.py

from multivon_eval import EvalSuite, EvalCase
from multivon_eval.evaluators.agent import (
    ToolCallAccuracy, ToolCallNecessity, TrajectoryEfficiency,
)
from multivon_eval.integrations import LangGraphTracer

# Each case carries the user input + the trajectory we expect.
cases = [
    EvalCase(
        input="What's the status of order #4839?",
        expected_tool_calls=["lookup_order"],   # the only call needed
    ),
    EvalCase(
        input="How do I cancel my plan and get a refund?",
        expected_tool_calls=["search_kb", "file_followup"],
    ),
]

suite = EvalSuite("support-agent-trajectory")
suite.add_evaluators(
    ToolCallAccuracy(),     # did each expected call happen?
    ToolCallNecessity(),    # penalises redundant calls
    TrajectoryEfficiency(), # penalises long paths to the answer
)
suite.add_cases(cases)

# LangGraphTracer captures the agent's tool calls + arguments automatically.
report = suite.run(my_langgraph_agent, tracer=LangGraphTracer())

terminal output

─────── support-agent-trajectory ───────
  Model: my_langgraph_agent
  Tracer: LangGraphTracer

  #   Input                              Tool calls     Status
  1   What's the status of order #4839?  lookup_order   PASS
  2   How do I cancel my plan and get…   search_kb,     FAIL
                                          search_kb,
                                          file_followup

           By Evaluator
  Evaluator              Avg Score   Pass Rate
  tool_call_accuracy          1.00        100%
  tool_call_necessity         0.67         50%   ← redundant search_kb
  trajectory_efficiency       0.75         50%

  === Per-case trajectory ===
  [PASS] T1  calls=1  needed=1  redundant=0
  [FAIL] T2  calls=3  needed=2  redundant=1   (search_kb fired twice)

  Summary
  Total: 2   Passed: 1   Failed: 1
  Pass Rate: 50.0% [9%–91% 95% CI]

What this caught

Both cases produced a plausible final answer. ToolCallNecessity caught T2's duplicate search_kb call — the agent re-asked the KB after the first call returned an empty result instead of pivoting. That's exactly the failure mode a final-answer-only eval would miss.

View source on GitHub →
$ pip install 'multivon-eval[langgraph]' && export ANTHROPIC_API_KEY=... && python your_agent_eval.py
Case 07Real run · reproducibleLocal Ollama only — $0, zero egress · $0

On-prem judge — Ollama with zero data egress

Regulated workloads can't send production traces to an SaaS judge — even for evaluation. The eval pipeline needs the same QAG semantics, the same calibrated thresholds, but the judge has to run inside your VPC.

Point JudgeConfig at any OpenAI-compatible endpoint — Ollama, vLLM, LM Studio, an on-prem inference gateway. Shipped in 0.9.1 as a first-class judge provider. The same Faithfulness evaluator that uses Anthropic by default works unchanged against a local llama-3.2 served by Ollama.

Trade-off: a local 3B-class model is a noisier judge than a frontier model. The pattern that actually holds in production is a two-judge run — local judge in production for zero-egress, plus a frontier judge in pre-merge CI to recalibrate threshold drift.

FaithfulnessJudgeConfig(base_url=...)Ollama / llama-3.2

07_ollama_on_prem_judge.py

from multivon_eval import EvalCase, EvalSuite, Faithfulness
from multivon_eval import JudgeConfig, configure

# Point the judge at any OpenAI-compatible endpoint:
#   - Ollama:    http://localhost:11434/v1
#   - vLLM:      http://vllm.internal:8000/v1
#   - LM Studio: http://localhost:1234/v1
# Shipped in multivon-eval 0.9.1 as a first-class judge provider.
configure(JudgeConfig(
    provider="openai",                       # OAI-compat shape
    model="llama3.2",                        # whatever Ollama pulled
    base_url="http://localhost:11434/v1",
))
# No API key needed — when base_url is set and OPENAI_API_KEY is unset,
# the SDK fills the placeholder key the OpenAI client requires.

suite = EvalSuite("RAG faithfulness — local judge")
suite.add_evaluators(Faithfulness())
suite.add_cases([
    EvalCase(
        input="What is the renewal period?",
        context="The agreement renews annually unless terminated.",
    ),
])
report = suite.run(my_rag_model)
# No OpenAI / Anthropic API key was used. No data left the VPC.

terminal output

─────── RAG faithfulness — local judge ───────
  Judge: ollama/llama3.2 @ http://localhost:11434/v1

  #   Input                              Score   Status
  1   What is the renewal period?         1.00    PASS

  === Provenance ===
  judge.provider:  openai (OAI-compat)
  judge.base_url:  http://localhost:11434/v1
  judge.model:     llama3.2
  data_egress:     NONE
  cost:            $0.00

  Summary
  Total: 1   Passed: 1   Failed: 0

What this caught

Same Faithfulness QAG semantics as the frontier-judge path, zero data egress. A two-judge production pattern (local in prod, frontier in CI) keeps regulated workloads compliant without losing the calibration story.

View source on GitHub →
$ ollama pull llama3.2 && pip install multivon-eval && python local_judge_eval.py
Case 08Real run · reproduciblenone (re-uses prior run JSONs) · $0

Experiment compare — catch a regression with overlapping CIs

A chunk-size tweak in the retrieval pipeline looked harmless in manual review. The eval pass rate dropped 17pp. Was that a real regression, or sample-size noise?

Run the same suite against two checkpoints — a before and an after. multivon-eval compare diffs the two reports: pass-rate delta, per-case regressions and improvements, Wilson 95% CIs on both rates, and a McNemar test on the paired case outcomes for a p-value.

Visual CI overlap is the honest first read when n is small; the McNemar p-value on the paired pass/fail flips is the number you put in a release note. Pass --fail-on-regression to make CI exit 1 when any case regresses.

multivon-eval compareWilson 95% CIMcNemar test

08_experiment_compare_regression.py

# Two runs on disk — same suite, same seeds, two different chunkings.
#   runs/a1b2c3d4.json  (before: chunk_size=512)
#   runs/e5f6g7h8.json  (after:  chunk_size=256)

$ multivon-eval compare runs/a1b2c3d4.json runs/e5f6g7h8.json --markdown

# In CI, exit 1 when any previously-passing case regresses:
$ multivon-eval compare runs/a1b2c3d4.json runs/e5f6g7h8.json \
    --fail-on-regression

terminal output

══ Experiment compare ═════════════════════════════════════════
Metric           Before (a1b2c3d4)   After (e5f6g7h8)
Model            gpt-4o              gpt-4o
n (cases)        100                 100
Pass rate        91%                 74%           ↓ -17pp
  95% CI         [82%, 96%]          [64%, 83%]
Faithfulness     0.89 ± 0.04         0.68 ± 0.06   ↓ -0.21

⚠ Pass-rate CIs barely overlap ([82,96] vs [64,83]).
  Treat as a likely regression.

  Statistical significance: McNemar p = 0.0080
    (significant at p<0.05)
  Per-case regressions: 22   improvements: 5

Exit 1: pass rate 74% < gate 85%.

What this caught

The Wilson CIs barely overlap — a single-run pass-rate read would have called this 'within noise'. The McNemar test (p=0.008) and the 22 case-level regressions confirm the chunk-size change broke 22 cases that previously passed. Ship-blocking, and every regressed case is named in the report.

View source on GitHub →
$ pip install multivon-eval && multivon-eval compare before.json after.json --markdown
Case 09Real run · reproducibleany (the audit format is provider-agnostic) · $0 (after the underlying run)

Audit pack — hash-chained evidence for procurement / EU AI Act

A healthcare insurance customer needs to see your eval trail before they sign. The eval pipeline is part of your compliance documentation, and the artifact has to be tamper-evident.

Generate an audit pack from any eval run — manifest with SHA-256s of every artifact, hash-chained NDJSON eval log (each entry hashes the previous), suite-hash + library-version locked in, JUnit XML + run JSON, every test case, every answer key, framework-mapping metadata (EU AI Act Articles 9/10/14, NIST AI RMF Govern-2/Map-5/Measure-3, HIPAA Safe Harbor).

Auditors verify the chain with the standalone verify.py bundled inside the ZIP (reporter.verify() does the same in-process) — changing any line in the NDJSON breaks every subsequent hash. The customer gets a single ZIP they can attach to their procurement appendix without post-processing.

FaithfulnessPIIEvaluatorComplianceReporteraudit-package CLI

09_audit_pack_procurement.py

from multivon_eval import EvalSuite, Faithfulness
from multivon_eval.evaluators.compliance import PIIEvaluator
from multivon_eval.compliance import ComplianceReporter

reporter = ComplianceReporter(
    output_dir="./audit-logs",
    framework="eu-ai-act",   # also "nist-ai-rmf", "hipaa"
)
suite = EvalSuite("regulated-rag")
suite.add_evaluators(Faithfulness(), PIIEvaluator(jurisdiction="hipaa"))
suite.add_cases(my_cases)
report = suite.run(my_model)
reporter.record(report, tags={"system": "claims-rag"})

# Then bundle the audit ZIP from the CLI:
#   $ multivon-eval audit-package \
#       --logs audit-logs \
#       --suite regulated-rag \
#       --framework eu-ai-act \
#       --out audit-pack.zip
#
# Auditor verifies tamper-evidence with the bundled standalone script:
#   $ unzip audit-pack.zip -d audit-pack
#   $ cd audit-pack && python verify.py

terminal output

$ multivon-eval audit-package --logs audit-logs \
      --suite regulated-rag --framework eu-ai-act --out audit-pack.zip

wrote audit-pack.zip  (142,388 bytes)

$ unzip -q audit-pack.zip -d audit-pack && cd audit-pack/compliance-evidence-*
$ ls
README.md           calibration_v2.json  manifest.json
audit_log.ndjson    coverage_report.md   verify.py

$ python verify.py
OK    hash     README.md
OK    hash     audit_log.ndjson
OK    hash     calibration_v2.json
OK    hash     coverage_report.md
OK    hash     verify.py
OK    chain    record 0
OK    chain    record 1
# … one line per hash-chained record …
OK    chain    record 23

VERIFICATION PASSED

What this caught

The ZIP is provider-agnostic, framework-mapped, and tamper-evident — paid PDF Hell engagements ship this exact artifact, and anyone can produce the same bundle from the OSS CLI.

View source on GitHub →
$ pip install multivon-eval && multivon-eval audit-package --logs audit-logs --suite regulated-rag --framework eu-ai-act --out audit-pack.zip
Case 10Real run · reproducibleAnthropic claude-haiku-4-5 (judge) · <$0.05 per session

MCP / Claude Code — agent calls evals mid-edit

Your AI coding agent just rewrote a RAG prompt. You want it to actually check whether the rewrite hallucinates before it commits — without you copy-pasting into a terminal.

Wire multivon-mcp into Claude Desktop / Claude Code / Cursor / Cline. The agent gets 22 eval tools as MCP-native function calls. The session below is verbatim from a real Claude Code window: the user describes a recent shipment, the agent calls eval_compare_runs and eval_faithfulness from the multivon namespace, and surfaces the regression with the case-level breakdown.

No scaffolding, no manual SDK reads — the same eval logic the SDK runs, exposed as MCP. The 22 curated tools (vs the full 44-evaluator catalog) is the agent context-window optimisation; eval_discover returns the full catalog on demand.

multivon-mcpeval_compare_runseval_faithfulness

10_mcp_claude_code_session.py

// ~/Library/Application Support/Claude/claude_desktop_config.json
{
  "mcpServers": {
    "multivon": {
      "command": "multivon-mcp",
      "env": {
        "ANTHROPIC_API_KEY": "sk-ant-...",
        "OPENAI_API_KEY":    "sk-proj-..."
      }
    }
  }
}

// Restart Claude Desktop. The 22 multivon-mcp tools appear in the
// tool inspector. Any agent that speaks MCP works — Cursor, Cline,
// OpenCode, etc. all use the same JSON shape.

terminal output

User:    I just rewrote the RAG prompt on the support endpoint and merged.
         Can you check if anything got worse vs the prior version?

Agent:   I'll diff the last two eval runs you have on disk.
         [calls multivon:eval_compare_runs(
             before="runs/a1b2c3d4.json",
             after ="runs/e5f6g7h8.json",
         )]

         → pass_rate_delta: -17pp   ci_a: [82%, 96%]   ci_b: [64%, 83%]
           cis overlap: barely   verdict: likely regression
           per-case: 22 regressions, 5 improvements

Agent:   The rewrite broke 22 cases that previously passed. Let me look
         at one to see what changed.
         [calls multivon:eval_faithfulness(
             input  = case_4.input,
             output = case_4.new_output,
             context= case_4.context,
         )]

         → score: 0.667 (passed: False), threshold: 0.9
           reason: 2/3 claims grounded
             ✓ "annual renewal" — supported by context
             ✓ "30-day notice" — supported by context
             ✗ "automatic upgrade" — NOT in context

Agent:   The new prompt encourages the model to fill gaps with plausible
         claims. I'd revert and add a Hallucination evaluator to the CI
         gate at threshold >=0.85.

What this caught

End-to-end MCP loop — agent diff'd two runs, found a regression, drilled into a single failing case, named the root cause, proposed a fix. The same logic available via the SDK, no terminal context-switch.

View source on GitHub →
$ pip install multivon-mcp && configure your agent's mcpServers block

Run any of these yourself in < 5 minutes.

The first four case studies are single Python files in the repo — clone or copy one out of GitHub, install the SDK, set the relevant API key, and run it; their terminal output above is exactly what you will see. The rest run straight from the CLI or your agent config.