Reproducible benchmarks for AI evaluation
Every number below has a reproducer command and a raw-JSON source. PDF Hell measures vision models against an adversarial mini suite; multivon-eval benchmarks measure the underlying engine's hallucination detection, multi-judge agreement, cost-per-case, and cache reproducibility.
PDF Hell — mini-v4 sample (17 traps × 10 cases)
11 models · 10 cases per trap family · PDF runs 2026-05-24 · pixels-only runs 2026-06-12Per-trap pass rates across 11 frontier vision models. Code-based ground truth — no LLM judging another LLM. Each cell is passes out of 10. 11 of the 17 trap families were proposed and validated by an autoresearch loop. Raw run JSON on GitHub — reproduce any row with uvx pdfhell run --model <p>:<m> --suite mini-v4-sample.
Pixels-only column: the same cases rasterised locally at 150 dpi and sent as images (--pixels), so the model never sees the PDF’s embedded text layer. The deltas show how much of each score is the model’s vision vs its provider’s PDF-ingestion pipeline — e.g. gpt-4o’s Hidden-OCR 0% becomes 100% on pixels (text-layer trust, not blindness). Per-trap cells below remain PDF-modality. Method + per-trap pixels data: pdfhell#1. Note: the ZWS-split and U+-confuse families were redesigned in pdfhell 0.6.1 after the pixels run exposed a rendering bug (pdfhell#8) — those two columns reflect the pre-redesign traps.
| Model | Overall (PDF) | Pixels-only @150 dpi · Δ vs PDF | Hidden OCR | Footnote | Split table | Composite | 3.5pt foot | Xpage coref | U+ confuse | ZWS split | FX clause | Mirror foot | Dash sign | 180° amt | Checksum | Mirror glyph | Bold rule | Shade rule | Color rule |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| openai:gpt-5 | 161/170 (95%) | 68%-27.1 | 7/10 | 9/10 | 10/10 | 10/10 | 10/10 | 10/10 | 10/10 | 5/10 | 10/10 | 10/10 | 10/10 | 10/10 | 10/10 | 10/10 | 10/10 | 10/10 | 10/10 |
| anthropic:claude-haiku-4-5 | 155/170 (91%) | 58%-32.9 | 10/10 | 8/10 | 10/10 | 10/10 | 10/10 | 10/10 | 8/10 | 10/10 | 8/10 | 10/10 | 4/10 trap: 60% | 10/10 | 8/10 | 9/10 | 10/10 | 10/10 | 10/10 |
| google:gemini-flash-lite-latest | 151/170 (89%) | 61%-28.2 | 10/10 | 8/10 | 10/10 | 10/10 | 10/10 | 10/10 | 10/10 | 0/10 trap: 50% | 10/10 | 10/10 | 7/10 | 10/10 | 6/10 | 10/10 | 10/10 | 10/10 | 10/10 |
| openai:gpt-4o | 138/170 (81%) | 60%-21.2 | 0/10 trap: 90% | 8/10 | 9/10 | 10/10 | 10/10 | 10/10 | 7/10 | 5/10 | 3/10 | 10/10 | 10/10 | 10/10 | 6/10 | 10/10 | 10/10 | 10/10 | 10/10 |
| anthropic:claude-opus-4-7 | 135/170 (79%) | 86%+6.5 | 6/10 | 9/10 | 10/10 | 10/10 | 0/10 | 10/10 | 9/10 | 0/10 trap: 70% | 10/10 | 10/10 | 10/10 | 10/10 | 10/10 | 6/10 | 10/10 | 10/10 | 5/10 trap: 50% |
| google:gemini-2.5-pro | 114/170 (67%) | 73%+5.9 | 10/10 | 8/10 | 10/10 | 10/10 | 10/10 | 10/10 | 4/10 | 4/10 | 10/10 | 0/10 | 10/10 | 7/10 | 10/10 | 0/10 | 6/10 | 0/10 | 5/10 |
| anthropic:claude-sonnet-4-6 | 103/170 (61%) | 63%+2.4 | 10/10 | 8/10 | 10/10 | 10/10 | 0/10 trap: 100% | 10/10 | 7/10 | 0/10 trap: 100% | 10/10 | 10/10 | 1/10 trap: 90% | 0/10 trap: 90% | 7/10 | 0/10 trap: 100% | 10/10 | 10/10 | 0/10 |
| google:gemini-2.5-flash | 101/170 (59%) | 65%+5.9 | 10/10 | 8/10 | 10/10 | 10/10 | 10/10 | 10/10 | 6/10 | 6/10 | 8/10 | 0/10 | 0/10 trap: 90% | 2/10 | 10/10 | 0/10 | 6/10 | 0/10 | 5/10 trap: 50% |
| ollama:qwen2.5vl:3b | 60/170 (35%) | — | 10/10 | 6/10 | 0/10 | 0/10 | 0/10 trap: 100% | 0/10 trap: 100% | 3/10 trap: 70% | 1/10 | 0/10 trap: 50% | 3/10 | 9/10 | 10/10 | 3/10 trap: 70% | 4/10 | 6/10 | 0/10 | 5/10 trap: 50% |
| ollama:gemma3:4b | 25/170 (15%) | — | 2/10 | 5/10 | 0/10 | 0/10 trap: 100% | 0/10 trap: 100% | 0/10 trap: 100% | 5/10 | 1/10 | 0/10 | 0/10 | 2/10 | 0/10 | 5/10 | 0/10 trap: 50% | 1/10 | 0/10 | 4/10 trap: 50% |
| ollama:moondream | 0/170 (0%) | — | 0/10 | 0/10 | 0/10 | 0/10 | 0/10 | 0/10 | 0/10 | 0/10 | 0/10 | 0/10 | 0/10 | 0/10 | 0/10 | 0/10 | 0/10 | 0/10 | 0/10 |
Three replicable per-trap blind spots: (1) GPT-4o falls for hidden_ocr_mismatch 10/10 (holds from mini-v1); (2) Opus 4-7 + Sonnet 4-6 both fail scale_dependent_rendering 0/10 on PDF input — the 3.5pt-footnote trap that Haiku 4-5 passes 9/10 and GPT-5 passes 10/10; on locally-rasterised pixels Opus passes 10/10, so the failure is PDF ingestion, not vision; (3) Opus 4-7 + Sonnet 4-6 + Gemini-Flash-Lite all fail zero_width_space_split 0/10 (pre-0.6.1 trap design; on pixels Opus only improves to 30% — no inversion). Aggregate gap, explained:Sonnet 4-6 (60.6%) sits 31 points under Haiku 4-5 (91.2%) on PDF input, but the 2026-06-12 pixels runs show it was never a capability gap — Sonnet’s 60.6% ≈ its 62.9% on pixels while Haiku drops to 58.2%. Haiku’s lead came from the text layer: two ingestion behaviors, same provider.
Reproduce any row
pip install pdfhell export ANTHROPIC_API_KEY=... # or OPENAI_API_KEY / GOOGLE_API_KEY pdfhell run --model anthropic:claude-opus-4-7 --suite mini-v4-sample # or the full 510-case suite (~3x cost): pdfhell run --model anthropic:claude-opus-4-7 --suite mini-v4
multivon-eval benchmarks
How the underlying SDK’s judges perform on hallucination detection, cross-judge agreement, cost-per-case, and reproducibility. JSON results live in benchmarks/results/ of the multivon-eval repo; this section reads them directly at build time.
Hallucination detection · head-to-head
halueval_qa · N=100 cases · claude-haiku-4-5 judge| Evaluator | Precision | Recall | False positives | Latency | F1 |
|---|---|---|---|---|---|
| multivon_eval (QAG) | 0.788 | 0.820 | 11 | 2955ms | 0.804[0.71–0.88] |
| simple_judge (1-10) | 0.617 | 1.000 | 31 | 708ms | 0.763[0.68–0.83] |
| deepeval (GPT-4o-mini) | 0.456 | 0.820 | 49 | 1421ms | 0.586[0.48–0.68] |
| keyword_overlap | 0.605 | 0.460 | 15 | 0ms | 0.523[0.39–0.64] |
In-distribution result: the multivon-eval QAG row uses the same HaluEval-QA-100 split that calibrated its threshold (dataset_hash: halueval-qa-2024-100c), so the F1 is best read as a sanity check that the calibrated default works — not as a held-out generalization claim. For the held-out figure on HaluEval-Summarization (frozen threshold, different task), see the Faithfulness section in the benchmarks README. F1 brackets show the 95% bootstrap CI (1000 resamples, stable seed).
All four evaluators run against the same 50 HaluEval QA samples (×2 variants = 100 cases, balanced positive/negative). multivon-eval uses QAG (binary yes/no questions) instead of numeric scales — see the methodology page. Raw JSON: hallucination.json.
Multi-judge agreement · per-judge accuracy
halueval_qa · N=50 pairs · temperature=0| # | Judge model | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|---|
| 1 | gemini-2.5-flash | 0.860 | 0.950 | 0.760 | 0.844 |
| 2 | gpt-4o-mini | 0.820 | 0.900 | 0.720 | 0.800 |
| 3 | gpt-4o | 0.780 | 0.792 | 0.760 | 0.776 |
| 4 | claude-haiku-4-5 | 0.800 | 0.895 | 0.680 | 0.773 |
| 5 | claude-sonnet-4-6 | 0.720 | 0.720 | 0.720 | 0.720 |
Pairwise judge agreement (Cohen's κ)
| Judge pair | κ | Strength |
|---|---|---|
| claude-sonnet-4-6 ↔ gpt-4o | 0.800 | substantial |
| claude-haiku-4-5 ↔ gemini-2.5-flash | 0.790 | substantial |
| gpt-4o-mini ↔ gpt-4o | 0.758 | substantial |
| gpt-4o ↔ gemini-2.5-flash | 0.758 | substantial |
| gpt-4o-mini ↔ gemini-2.5-flash | 0.750 | substantial |
| claude-sonnet-4-6 ↔ gpt-4o-mini | 0.720 | substantial |
| claude-sonnet-4-6 ↔ gemini-2.5-flash | 0.720 | substantial |
| claude-haiku-4-5 ↔ gpt-4o | 0.717 | substantial |
| claude-haiku-4-5 ↔ gpt-4o-mini | 0.706 | substantial |
| claude-haiku-4-5 ↔ claude-sonnet-4-6 | 0.600 | moderate |
κ interpretation per Landis & Koch (1977). Pairs with κ < 0.61 indicate genuine judge-disagreement on hard cases — the examples that benefit most from cross-judge calibration.
Cost & latency
50 cases · 4 LLM-judge evaluators · workers=1| Cost per case | $0.00127 | 17.1 judge calls per case |
| Total cost for the run | $0.0635 | 46,294 input + 6,605 output tokens |
| Wall clock | 15.0 min | 18.0s per case avg |
| Extrapolation to 5,000 cases | $6.35 | Linear extrapolation; workers=1 (single LLM call at a time) |
| Deterministic tier (no LLM) | $0.00 | 3 evaluators, instant |
Workers=1 is enforced for the cost benchmark so per-call usage records reach the active CostTracker. The 5,000-case extrapolation is linear and ignores judge-cache hits (see the next section for what caching does to re-runs). Raw JSON: cost_latency.json.
Reproducibility & cache
halueval_qa (first 10) · 10 cases × 10 reps · claude-haiku-4-5| Mode | Wall clock | pass_rate σ | avg_score σ | N reps |
|---|---|---|---|---|
| Cache ON (rep 1 cold, rest warm) | 8.5s | 0.0000 | 0.0000 | 10 |
| Cache OFF (every call hits the API) | 74.3s | 0.0422 | 0.0084 | 10 |
Cache hit speedup: 8.7× (cold → warm rerun). Cache ON reproduces identical scores rep-to-rep. Cache OFF exposes the irreducible judge variance at temperature=0. Raw JSON: reproducibility.json.
Reproduce locally
git clone https://github.com/multivon-ai/multivon-eval cd multivon-eval/benchmarks pip install -e .. anthropic openai google-genai export ANTHROPIC_API_KEY=... export OPENAI_API_KEY=... export GOOGLE_API_KEY=... # pick one (each writes to benchmarks/results/*.json): python run_hallucination_benchmark.py python run_multi_judge_agreement_benchmark.py python run_cost_latency_benchmark.py python run_reproducibility_benchmark.py
All four scripts are deterministic at temperature=0 modulo API non-determinism. Wall-clock estimates: ~50–120s each for the in-house judges, longer when external judges are added.
Submit a new judge
Open-source judges (Patronus Lynx, Prometheus-2, Vectara HHEM, others) are actively being added. PRs welcome: add a JudgeConfig entry in run_multi_judge_agreement_benchmark.py or wire an external-judge adapter in run_external_judges_benchmark.py, run the benchmark, commit the updated JSON. This page auto-updates on the next deploy.