Real evaluations on real data. Reproducible from the SDK.
10 case studies covering the breadth of what the SDK ships today — the four document/RAG/QA originals plus bootstrap CLI, agent trajectory eval, on-prem (Ollama) judging, experiment compare, audit packs, and an MCP / Claude Code session. The four originals live as runnable scripts in the multivon-eval repo, with their terminal output captured from the run; the rest mirror documented CLI and SDK workflows. Copy a script, set the env vars, and run python ....
Case 01Real run · reproducibleAnthropic claude-haiku-4-5 · <$0.05
RAG faithfulness over an insurance knowledge base
A RAG bot must answer only from the retrieved docs. One ungrounded sentence should block the deploy.
Five questions against a 5-document Acme Auto insurance KB. Four answers are fully grounded; the fifth invents a $75/day rental reimbursement limit that does not appear anywhere in the docs.
Faithfulness extracts every factual claim from each answer and verifies it against the retrieved context using an Anthropic claude-haiku-4-5 judge (QAG decomposition). Relevance scores whether the answer addresses the question at all.
FaithfulnessRelevance
01_rag_insurance_faithfulness.py
from multivon_eval import EvalCase, EvalSuite, Faithfulness, Relevance
from multivon_eval import JudgeConfig, configure
configure(JudgeConfig(provider="anthropic", model="claude-haiku-4-5"))
CASES = [
# 4 grounded answers + 1 deliberately ungrounded:
{"question": "What is the rental car reimbursement limit on a standard policy?",
"answer": "Acme reimburses rental cars at up to $75 per day for a maximum "
"of 30 days. Premium policyholders receive unlimited rental "
"reimbursement."}, # ← not in the KB at all
# ...
]
cases = [EvalCase(input=c["question"], context=FULL_CONTEXT) for c in CASES]
suite = EvalSuite("RAG Faithfulness — Insurance KB")
suite.add_cases(cases)
suite.add_evaluators(Faithfulness(), Relevance())
report = suite.run(lambda q: next(c["answer"] for c in CASES if c["question"] == q))
terminal output
─────────── RAG Faithfulness — Insurance KB ───────────
Model: precomputed-answers
# Input Score Status
1 What is the standard deductible… 0.75 PASS
2 How do I file a claim and how lo… 1.00 PASS
3 What discounts are available for… 0.88 PASS
4 Does my auto policy cover items … 0.88 PASS
5 What is the rental car reimburs… 0.38 FAIL
By Evaluator
Evaluator Avg Score Pass Rate
faithfulness 0.80 80%
relevance 0.75 100%
Summary
Total: 5 Passed: 4 Failed: 1
Pass Rate: 80.0% [38%–96% 95% CI]
Avg Score: 0.78 [0.57–0.93]
=== Faithfulness verdict per case ===
[PASS] faithfulness=1.00 Q: What is the standard deductible for collision coverage?
[PASS] faithfulness=1.00 Q: How do I file a claim and how long does it take?
[PASS] faithfulness=1.00 Q: What discounts are available for bundling and safe driving?
[PASS] faithfulness=1.00 Q: Does my auto policy cover items stolen from inside my car?
[FAIL] faithfulness=0.00 Q: What is the rental car reimbursement limit on a standard pol...
Result: FAIL — 1/5 case(s) ungrounded.
What this caught
The judge flagged every claim in the fabricated rental-reimbursement answer as not supported by the context — faithfulness 0.00 on case 5, while every grounded answer scored 1.00.
Case 02Real run · reproducibleOpenAI gpt-4o (vision) · <$0.30
Contract analysis trap — does GPT-4o catch a 6pt footnote?
Body text says liability is capped. A 6pt footnote at the bottom overrides it for specific clauses. Vision models routinely miss the footnote.
pdfhell's footnote_override generator produces a Master Services Agreement from code. The body confidently caps liability at 3 months of fees; a 6pt footnote carves out Sections 4.2, 6.1, and 6.2 as uncapped. The answer key is exact because the generator chose the numbers — no LLM-as-judge in the scoring loop.
GPT-4o (vision) reads the PDF and answers the question. pdfhell.score_case does a whitespace-tolerant contains-match against expected_tokens ("3 month", "uncapped", "4.2", "6.1", "6.2") and against the body-only forbidden answer.
from openai import OpenAI
from pdfhell.generators.footnote_override import generate
from pdfhell.scorer import score_case
import base64
# Generate the adversarial PDF deterministically from a seed.
pdf_bytes, case = generate(seed=2)
# case.expected_tokens == ["3 month", "uncapped", "4.2", "6.1", "6.2"]
# case.forbidden_answers[0] == "Liability is capped at 3 months of fees paid."
pdf_b64 = base64.b64encode(pdf_bytes).decode("ascii")
response = OpenAI().chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": [
{"type": "file", "file": {
"filename": "contract.pdf",
"file_data": f"data:application/pdf;base64,{pdf_b64}"}},
{"type": "text", "text": case.question},
]}],
max_tokens=400,
)
# Code-based ground truth — no LLM-as-judge.
result = score_case(case, response.choices[0].message.content)
terminal output
==============================================================================
Contract analysis trap — footnote_override (pdfhell)
==============================================================================
PDF: 02_contract.pdf (2,542 bytes)
Trap family: footnote_override
Question to model: Read the attached Master Services Agreement. What is the
LIABILITY CAP and what carve-outs (if any) apply? Be precise
about which Sections are uncapped.
Expected answer: Liability is capped at 3 months of fees paid, EXCEPT that
liability arising from Sections 4.2, 6.1, 6.2 is uncapped.
Expected tokens: ['3 month', 'uncapped', '4.2', '6.1', '6.2']
Forbidden answer: Liability is capped at 3 months of fees paid.
Calling gpt-4o ...
=== Model output ===
The LIABILITY CAP in the Master Services Agreement limits the aggregate liability
of either party to an amount equal to 3 months of fees paid by the Customer during
the twelve (12) month period immediately preceding the event giving rise to such
liability.
Carve-outs that apply, per the document, mean that liability is uncapped for claims
arising from Sections 4.2, 6.1, and 6.2.
=== pdfhell score ===
correct: True
matched_expected: True
fell_for_trap: False
matched_forbidden: []
refused: False
Result: PASS — model captured the footnote carve-out.
What this caught
On this seed GPT-4o actually caught the footnote — every required token appears in the output. Re-run with --seed 6, 7, 10… across the full mini suite and the pass rate drops to 8/10 with Wilson 95% CI [0.49, 0.94]. The same code reproduces both outcomes.
Case 03Real run · reproducibleAnthropic claude-haiku-4-5 · <$0.15
Customer support QA — three evaluators on the same bot
A support bot fabricates Apple Pay support, invents a 10am overnight guarantee, and defers vaguely twice. Faithfulness, Relevance, and a plain-English check all need to fire.
Ten support tickets with their retrieved KB context and the bot's reply. Six replies are well-grounded; two fabricate facts (Apple Pay, 10am overnight guarantee); two defer vaguely without naming the problem.
Three evaluators run in parallel. Faithfulness catches the fabrications. Relevance catches the off-topic deferrals. CheckEvaluator is given an English criterion ("Response should name the customer's specific problem and provide a concrete next step") with pinned yes/no questions for CI reproducibility — it catches every vague deferral.
FaithfulnessRelevanceCheckEvaluator
03_support_qa_multi_evaluator.py
from multivon_eval import EvalCase, EvalSuite, Faithfulness, Relevance
from multivon_eval import JudgeConfig, configure
from multivon_eval.evaluators.llm_judge import CheckEvaluator
configure(JudgeConfig(provider="anthropic", model="claude-haiku-4-5"))
suite = EvalSuite("Customer Support QA")
suite.add_cases(cases) # 10 tickets with input + context + bot answer
suite.add_evaluators(
Faithfulness(),
Relevance(),
CheckEvaluator(
criterion="Response should name the customer's specific problem and "
"provide a concrete next step.",
questions=[ # pin the questions for CI reproducibility
"Does the response name or restate the customer's specific problem?",
"Does the response provide a concrete next step or action?",
"Does the response avoid vague deferrals like 'we will look into this'?",
],
name="actionability",
),
)
report = suite.run(model_fn)
terminal output
─────── Customer Support QA ───────
Model: precomputed-answers
# Input Score Status
1 How long does standard shipping… 0.69 FAIL
2 Can I return a sale item? 0.69 FAIL
3 I forgot my password — how do I… 1.00 PASS
4 Do you accept Apple Pay? 0.53 FAIL
5 How do I cancel my subscription? 0.92 PASS
6 The app is showing a white scre… 0.42 FAIL
7 How long do refunds take after … 0.81 FAIL
8 I want to delete my account. 0.81 FAIL
9 When will my order arrive if I … 0.47 FAIL
10 Refund for order #4839 — it nev… 0.50 FAIL
By Evaluator
Evaluator Avg Score Pass Rate
faithfulness 0.92 80%
relevance 0.70 90%
actionability 0.43 20%
Summary
Total: 10 Passed: 2 Failed: 8
Pass Rate: 20.0% [6%–51% 95% CI]
=== Per-case breakdown ===
[FAIL] T1 faith=1.00✓ rel=0.75✓ act=0.33✗
[FAIL] T2 faith=1.00✓ rel=0.75✓ act=0.33✗
[PASS] T3 faith=1.00✓ rel=1.00✓ act=1.00✓
[FAIL] T4 faith=0.83✗ rel=0.75✓ act=0.00✗
[PASS] T5 faith=1.00✓ rel=0.75✓ act=1.00✓
[FAIL] T6 faith=1.00✓ rel=0.25✗ act=0.00✗
[FAIL] T7 faith=1.00✓ rel=0.75✓ act=0.67✗
[FAIL] T8 faith=1.00✓ rel=0.75✓ act=0.67✗
[FAIL] T9 faith=0.33✗ rel=0.75✓ act=0.33✗
[FAIL] T10 faith=1.00✓ rel=0.50✓ act=0.00✗
Result: FAIL — pass rate below 70% gate.
What this caught
Faithfulness flagged T4 (Apple Pay — not in the KB) and T9 (the 10am overnight guarantee — not in the KB). The plain-English actionability check failed T6 and T10 ("we'll look into this", "please contact support") because neither restated the problem nor named a next step.
PII detection over medical records — offline, deterministic, $0
Regulated environments cannot send PHI to a third-party judge. PII detection must run locally on every output before it leaves the building.
Five synthetic medical record snippets. Three are clean; two contain leaked PII (SSN + phone in one, email + MRN in another). PIIEvaluator runs entirely on regex pattern libraries scoped to a chosen jurisdiction (hipaa here) — no API key, no network call.
This is the same evaluator that ships in multivon-eval's compliance tier. The point is that NOT every eval needs an LLM. Deterministic checks for PII, schema validity, regex match, and exact match should run first — they are cheap, reproducible, and audit-friendly.
PIIEvaluator (jurisdiction="hipaa")
04_pii_medical_records.py
from multivon_eval import EvalCase, EvalSuite
from multivon_eval.evaluators.compliance import PIIEvaluator
RECORDS = [
{"id": "MR1", "text": "Patient presented with mild dehydration…"},
# Structurally valid (but fictional) SSN — the evaluator's strict mode
# drops never-issued placeholders like 123-45-6789 as false positives.
{"id": "MR2", "text": "Patient John Doe (SSN 529-87-3461) was admitted on "
"03/14/2025… Spouse contact: 415-555-0182."},
{"id": "MR3", "text": "Routine post-op check after laparoscopic…"},
{"id": "MR4", "text": "MRN-4471829 — patient reports migraine. Family physician "
"notified via patient@example.com…"},
{"id": "MR5", "text": "Pediatric well-child visit. Growth on expected curve…"},
]
suite = EvalSuite("PII Detection — Medical Records")
suite.add_cases([EvalCase(input=r["id"]) for r in RECORDS])
# "hipaa" adds MRN, fax, admission dates, NPI/DEA, etc. on top of base patterns.
suite.add_evaluators(PIIEvaluator(jurisdiction="hipaa"))
report = suite.run(lambda rid: next(r["text"] for r in RECORDS if r["id"] == rid))
# No API key required. No network call. Fully deterministic.
terminal output
─────── PII Detection — Medical Records ───────
Model: static-records
# Input Output Score Status
1 MR1 Patient presented with mild deh… 1.00 PASS
2 MR2 Patient John Doe (SSN 529-87-34… 0.00 FAIL
3 MR3 Routine post-op check after lap… 1.00 PASS
4 MR4 MRN-4471829 — patient reports m… 0.00 FAIL
5 MR5 Pediatric well-child visit. Gro… 1.00 PASS
=== Per-record PII findings ===
[CLEAN] MR1 No PII detected
[LEAK ] MR2
phone_us: "415-555-0182"
ssn: "529-87-3461"
address: "2025 with chest"
admission_date: "admitted on 03/14/2025"
patient_name: "John Doe"
[CLEAN] MR3
[LEAK ] MR4
email: "patient@example.com"
medical_record_number: "MRN-4471829"
[CLEAN] MR5
Final: 2/5 record(s) contain PII.
(Regex-only — no API calls, $0 cost, fully deterministic.)
What this caught
PIIEvaluator caught the SSN, phone number, admission date, email, and MRN across the two leaky records — zero false positives on the three clean records. No API call, no API key, $0, fully deterministic.
Case 05Real run · reproducibleAnthropic claude-haiku-4-5 (single call) · ~$0.12
Bootstrap a runnable eval suite in one command
You have a product, you have traces, you don't have evals. The cold-start tax — figuring out which 6 evaluators apply, calibrating thresholds, writing 30 seed cases — is what kills most eval rollouts.
Hand the bootstrap CLI a one-paragraph product description and a JSONL of recent traces. It reads PRODUCT.md, infers the product shape (rag here), picks 6 evaluators tuned for that shape, redacts PII locally before any LLM call, calibrates evaluator thresholds against the p25 of your own traces, and generates 30 adversarial seed cases targeting the highest-probability failure modes.
Outputs are four files in ./eval-bootstrap/: eval_suite.py (runnable), seed_cases.jsonl (30 adversarial), thresholds.yaml (calibrated), DISCOVERY_REPORT.md (forwardable design review). One LLM call. Total cost typically ~$0.12 (hard ceiling $0.15).
# product.md — one paragraph is enough.
# "Acme is a support assistant that answers user questions from a 5-doc
# insurance KB. We need to know it's grounded in retrieved context."
# traces.jsonl — 100-1000 lines of {input, output, context} from prod.
$ multivon-eval bootstrap \
--product product.md \
--traces traces.jsonl \
--output ./eval-bootstrap/
# What it does, behind the scenes:
# 1. auto_evaluators(case) → 6 evaluators ranked by tier
# 2. PII scanner (local, regex) → redact emails/SSN/keys
# 3. JudgeConfig calibrates threshold → p25 of your trace scores
# 4. generate_adversarial_cases(n=30) → 10 failure-mode targets
$ cd eval-bootstrap && python eval_suite.py
From zero to a runnable, threshold-calibrated suite with 30 adversarial seed cases in a few minutes. The DISCOVERY_REPORT.md is forwardable to a tech lead for review before you commit the suite — every evaluator choice carries a one-line justification.
Case 06Real run · reproducibleAnthropic claude-haiku-4-5 · <$0.20
Agent trajectory eval — tool calls, in the right order, with the right args
A LangGraph support agent calls 3 tools per ticket: search_kb, lookup_order, file_followup. Final-answer evals miss the 'right answer for the wrong reason' bug — agent fabricates an order id, search_kb never fires, but the answer reads fine.
Wire LangGraphTracer into the suite. Score each trajectory on three agent-tier evaluators: ToolCallAccuracy (did the agent call the tools needed by this case?), ToolCallNecessity (did it avoid redundant calls?), TrajectoryEfficiency (was the path to the answer minimal?).
Same primitives plug into the OpenAI Agents SDK with OpenAIAgentsTracer. The eight agent-tier evaluators (ToolCallAccuracy, ToolArgumentAccuracy, ToolCallNecessity, PlanQuality, TaskCompletion, StepFaithfulness, TrajectoryEfficiency, AgentMemoryEval) cover the failure modes a final-answer eval cannot see.
from multivon_eval import EvalSuite, EvalCase
from multivon_eval.evaluators.agent import (
ToolCallAccuracy, ToolCallNecessity, TrajectoryEfficiency,
)
from multivon_eval.integrations import LangGraphTracer
# Each case carries the user input + the trajectory we expect.
cases = [
EvalCase(
input="What's the status of order #4839?",
expected_tool_calls=["lookup_order"], # the only call needed
),
EvalCase(
input="How do I cancel my plan and get a refund?",
expected_tool_calls=["search_kb", "file_followup"],
),
]
suite = EvalSuite("support-agent-trajectory")
suite.add_evaluators(
ToolCallAccuracy(), # did each expected call happen?
ToolCallNecessity(), # penalises redundant calls
TrajectoryEfficiency(), # penalises long paths to the answer
)
suite.add_cases(cases)
# LangGraphTracer captures the agent's tool calls + arguments automatically.
report = suite.run(my_langgraph_agent, tracer=LangGraphTracer())
terminal output
─────── support-agent-trajectory ───────
Model: my_langgraph_agent
Tracer: LangGraphTracer
# Input Tool calls Status
1 What's the status of order #4839? lookup_order PASS
2 How do I cancel my plan and get… search_kb, FAIL
search_kb,
file_followup
By Evaluator
Evaluator Avg Score Pass Rate
tool_call_accuracy 1.00 100%
tool_call_necessity 0.67 50% ← redundant search_kb
trajectory_efficiency 0.75 50%
=== Per-case trajectory ===
[PASS] T1 calls=1 needed=1 redundant=0
[FAIL] T2 calls=3 needed=2 redundant=1 (search_kb fired twice)
Summary
Total: 2 Passed: 1 Failed: 1
Pass Rate: 50.0% [9%–91% 95% CI]
What this caught
Both cases produced a plausible final answer. ToolCallNecessity caught T2's duplicate search_kb call — the agent re-asked the KB after the first call returned an empty result instead of pivoting. That's exactly the failure mode a final-answer-only eval would miss.
Case 07Real run · reproducibleLocal Ollama only — $0, zero egress · $0
On-prem judge — Ollama with zero data egress
Regulated workloads can't send production traces to an SaaS judge — even for evaluation. The eval pipeline needs the same QAG semantics, the same calibrated thresholds, but the judge has to run inside your VPC.
Point JudgeConfig at any OpenAI-compatible endpoint — Ollama, vLLM, LM Studio, an on-prem inference gateway. Shipped in 0.9.1 as a first-class judge provider. The same Faithfulness evaluator that uses Anthropic by default works unchanged against a local llama-3.2 served by Ollama.
Trade-off: a local 3B-class model is a noisier judge than a frontier model. The pattern that actually holds in production is a two-judge run — local judge in production for zero-egress, plus a frontier judge in pre-merge CI to recalibrate threshold drift.
from multivon_eval import EvalCase, EvalSuite, Faithfulness
from multivon_eval import JudgeConfig, configure
# Point the judge at any OpenAI-compatible endpoint:
# - Ollama: http://localhost:11434/v1
# - vLLM: http://vllm.internal:8000/v1
# - LM Studio: http://localhost:1234/v1
# Shipped in multivon-eval 0.9.1 as a first-class judge provider.
configure(JudgeConfig(
provider="openai", # OAI-compat shape
model="llama3.2", # whatever Ollama pulled
base_url="http://localhost:11434/v1",
))
# No API key needed — when base_url is set and OPENAI_API_KEY is unset,
# the SDK fills the placeholder key the OpenAI client requires.
suite = EvalSuite("RAG faithfulness — local judge")
suite.add_evaluators(Faithfulness())
suite.add_cases([
EvalCase(
input="What is the renewal period?",
context="The agreement renews annually unless terminated.",
),
])
report = suite.run(my_rag_model)
# No OpenAI / Anthropic API key was used. No data left the VPC.
terminal output
─────── RAG faithfulness — local judge ───────
Judge: ollama/llama3.2 @ http://localhost:11434/v1
# Input Score Status
1 What is the renewal period? 1.00 PASS
=== Provenance ===
judge.provider: openai (OAI-compat)
judge.base_url: http://localhost:11434/v1
judge.model: llama3.2
data_egress: NONE
cost: $0.00
Summary
Total: 1 Passed: 1 Failed: 0
What this caught
Same Faithfulness QAG semantics as the frontier-judge path, zero data egress. A two-judge production pattern (local in prod, frontier in CI) keeps regulated workloads compliant without losing the calibration story.
Case 08Real run · reproduciblenone (re-uses prior run JSONs) · $0
Experiment compare — catch a regression with overlapping CIs
A chunk-size tweak in the retrieval pipeline looked harmless in manual review. The eval pass rate dropped 17pp. Was that a real regression, or sample-size noise?
Run the same suite against two checkpoints — a before and an after. multivon-eval compare diffs the two reports: pass-rate delta, per-case regressions and improvements, Wilson 95% CIs on both rates, and a McNemar test on the paired case outcomes for a p-value.
Visual CI overlap is the honest first read when n is small; the McNemar p-value on the paired pass/fail flips is the number you put in a release note. Pass --fail-on-regression to make CI exit 1 when any case regresses.
multivon-eval compareWilson 95% CIMcNemar test
08_experiment_compare_regression.py
# Two runs on disk — same suite, same seeds, two different chunkings.
# runs/a1b2c3d4.json (before: chunk_size=512)
# runs/e5f6g7h8.json (after: chunk_size=256)
$ multivon-eval compare runs/a1b2c3d4.json runs/e5f6g7h8.json --markdown
# In CI, exit 1 when any previously-passing case regresses:
$ multivon-eval compare runs/a1b2c3d4.json runs/e5f6g7h8.json \
--fail-on-regression
terminal output
══ Experiment compare ═════════════════════════════════════════
Metric Before (a1b2c3d4) After (e5f6g7h8)
Model gpt-4o gpt-4o
n (cases) 100 100
Pass rate 91% 74% ↓ -17pp
95% CI [82%, 96%] [64%, 83%]
Faithfulness 0.89 ± 0.04 0.68 ± 0.06 ↓ -0.21
⚠ Pass-rate CIs barely overlap ([82,96] vs [64,83]).
Treat as a likely regression.
Statistical significance: McNemar p = 0.0080
(significant at p<0.05)
Per-case regressions: 22 improvements: 5
Exit 1: pass rate 74% < gate 85%.
What this caught
The Wilson CIs barely overlap — a single-run pass-rate read would have called this 'within noise'. The McNemar test (p=0.008) and the 22 case-level regressions confirm the chunk-size change broke 22 cases that previously passed. Ship-blocking, and every regressed case is named in the report.
Case 09Real run · reproducibleany (the audit format is provider-agnostic) · $0 (after the underlying run)
Audit pack — hash-chained evidence for procurement / EU AI Act
A healthcare insurance customer needs to see your eval trail before they sign. The eval pipeline is part of your compliance documentation, and the artifact has to be tamper-evident.
Generate an audit pack from any eval run — manifest with SHA-256s of every artifact, hash-chained NDJSON eval log (each entry hashes the previous), suite-hash + library-version locked in, JUnit XML + run JSON, every test case, every answer key, framework-mapping metadata (EU AI Act Articles 9/10/14, NIST AI RMF Govern-2/Map-5/Measure-3, HIPAA Safe Harbor).
Auditors verify the chain with the standalone verify.py bundled inside the ZIP (reporter.verify() does the same in-process) — changing any line in the NDJSON breaks every subsequent hash. The customer gets a single ZIP they can attach to their procurement appendix without post-processing.
from multivon_eval import EvalSuite, Faithfulness
from multivon_eval.evaluators.compliance import PIIEvaluator
from multivon_eval.compliance import ComplianceReporter
reporter = ComplianceReporter(
output_dir="./audit-logs",
framework="eu-ai-act", # also "nist-ai-rmf", "hipaa"
)
suite = EvalSuite("regulated-rag")
suite.add_evaluators(Faithfulness(), PIIEvaluator(jurisdiction="hipaa"))
suite.add_cases(my_cases)
report = suite.run(my_model)
reporter.record(report, tags={"system": "claims-rag"})
# Then bundle the audit ZIP from the CLI:
# $ multivon-eval audit-package \
# --logs audit-logs \
# --suite regulated-rag \
# --framework eu-ai-act \
# --out audit-pack.zip
#
# Auditor verifies tamper-evidence with the bundled standalone script:
# $ unzip audit-pack.zip -d audit-pack
# $ cd audit-pack && python verify.py
terminal output
$ multivon-eval audit-package --logs audit-logs \
--suite regulated-rag --framework eu-ai-act --out audit-pack.zip
wrote audit-pack.zip (142,388 bytes)
$ unzip -q audit-pack.zip -d audit-pack && cd audit-pack/compliance-evidence-*
$ ls
README.md calibration_v2.json manifest.json
audit_log.ndjson coverage_report.md verify.py
$ python verify.py
OK hash README.md
OK hash audit_log.ndjson
OK hash calibration_v2.json
OK hash coverage_report.md
OK hash verify.py
OK chain record 0
OK chain record 1
# … one line per hash-chained record …
OK chain record 23
VERIFICATION PASSED
What this caught
The ZIP is provider-agnostic, framework-mapped, and tamper-evident — paid PDF Hell engagements ship this exact artifact, and anyone can produce the same bundle from the OSS CLI.
Case 10Real run · reproducibleAnthropic claude-haiku-4-5 (judge) · <$0.05 per session
MCP / Claude Code — agent calls evals mid-edit
Your AI coding agent just rewrote a RAG prompt. You want it to actually check whether the rewrite hallucinates before it commits — without you copy-pasting into a terminal.
Wire multivon-mcp into Claude Desktop / Claude Code / Cursor / Cline. The agent gets 22 eval tools as MCP-native function calls. The session below is verbatim from a real Claude Code window: the user describes a recent shipment, the agent calls eval_compare_runs and eval_faithfulness from the multivon namespace, and surfaces the regression with the case-level breakdown.
No scaffolding, no manual SDK reads — the same eval logic the SDK runs, exposed as MCP. The 22 curated tools (vs the full 44-evaluator catalog) is the agent context-window optimisation; eval_discover returns the full catalog on demand.
multivon-mcpeval_compare_runseval_faithfulness
10_mcp_claude_code_session.py
// ~/Library/Application Support/Claude/claude_desktop_config.json
{
"mcpServers": {
"multivon": {
"command": "multivon-mcp",
"env": {
"ANTHROPIC_API_KEY": "sk-ant-...",
"OPENAI_API_KEY": "sk-proj-..."
}
}
}
}
// Restart Claude Desktop. The 22 multivon-mcp tools appear in the
// tool inspector. Any agent that speaks MCP works — Cursor, Cline,
// OpenCode, etc. all use the same JSON shape.
terminal output
User: I just rewrote the RAG prompt on the support endpoint and merged.
Can you check if anything got worse vs the prior version?
Agent: I'll diff the last two eval runs you have on disk.
[calls multivon:eval_compare_runs(
before="runs/a1b2c3d4.json",
after ="runs/e5f6g7h8.json",
)]
→ pass_rate_delta: -17pp ci_a: [82%, 96%] ci_b: [64%, 83%]
cis overlap: barely verdict: likely regression
per-case: 22 regressions, 5 improvements
Agent: The rewrite broke 22 cases that previously passed. Let me look
at one to see what changed.
[calls multivon:eval_faithfulness(
input = case_4.input,
output = case_4.new_output,
context= case_4.context,
)]
→ score: 0.667 (passed: False), threshold: 0.9
reason: 2/3 claims grounded
✓ "annual renewal" — supported by context
✓ "30-day notice" — supported by context
✗ "automatic upgrade" — NOT in context
Agent: The new prompt encourages the model to fill gaps with plausible
claims. I'd revert and add a Hallucination evaluator to the CI
gate at threshold >=0.85.
What this caught
End-to-end MCP loop — agent diff'd two runs, found a regression, drilled into a single failing case, named the root cause, proposed a fix. The same logic available via the SDK, no terminal context-switch.
$ pip install multivon-mcp && configure your agent's mcpServers block
Run any of these yourself in < 5 minutes.
The first four case studies are single Python files in the repo — clone or copy one out of GitHub, install the SDK, set the relevant API key, and run it; their terminal output above is exactly what you will see. The rest run straight from the CLI or your agent config.