Leaderboard · two boards · one page

Reproducible benchmarks for AI evaluation

Every number below has a reproducer command and a raw-JSON source. PDF Hell measures vision models against an adversarial mini suite; multivon-eval benchmarks measure the underlying engine's hallucination detection, multi-judge agreement, cost-per-case, and cache reproducibility.

PDF Hell — mini-v4 sample (17 traps × 10 cases)

11 models · 10 cases per trap family · PDF runs 2026-05-24 · pixels-only runs 2026-06-12

Per-trap pass rates across 11 frontier vision models. Code-based ground truth — no LLM judging another LLM. Each cell is passes out of 10. 11 of the 17 trap families were proposed and validated by an autoresearch loop. Raw run JSON on GitHub — reproduce any row with uvx pdfhell run --model <p>:<m> --suite mini-v4-sample.

Pixels-only column: the same cases rasterised locally at 150 dpi and sent as images (--pixels), so the model never sees the PDF’s embedded text layer. The deltas show how much of each score is the model’s vision vs its provider’s PDF-ingestion pipeline — e.g. gpt-4o’s Hidden-OCR 0% becomes 100% on pixels (text-layer trust, not blindness). Per-trap cells below remain PDF-modality. Method + per-trap pixels data: pdfhell#1. Note: the ZWS-split and U+-confuse families were redesigned in pdfhell 0.6.1 after the pixels run exposed a rendering bug (pdfhell#8) — those two columns reflect the pre-redesign traps.

ModelOverall (PDF)Pixels-only
@150 dpi · Δ vs PDF
Hidden OCRFootnoteSplit tableComposite3.5pt footXpage corefU+ confuseZWS splitFX clauseMirror footDash sign180° amtChecksumMirror glyphBold ruleShade ruleColor rule
openai:gpt-5161/170 (95%)68%-27.17/109/1010/1010/1010/1010/1010/105/1010/1010/1010/1010/1010/1010/1010/1010/1010/10
anthropic:claude-haiku-4-5155/170 (91%)58%-32.910/108/1010/1010/1010/1010/108/1010/108/1010/104/10
trap: 60%
10/108/109/1010/1010/1010/10
google:gemini-flash-lite-latest151/170 (89%)61%-28.210/108/1010/1010/1010/1010/1010/100/10
trap: 50%
10/1010/107/1010/106/1010/1010/1010/1010/10
openai:gpt-4o138/170 (81%)60%-21.20/10
trap: 90%
8/109/1010/1010/1010/107/105/103/1010/1010/1010/106/1010/1010/1010/1010/10
anthropic:claude-opus-4-7135/170 (79%)86%+6.56/109/1010/1010/100/1010/109/100/10
trap: 70%
10/1010/1010/1010/1010/106/1010/1010/105/10
trap: 50%
google:gemini-2.5-pro114/170 (67%)73%+5.910/108/1010/1010/1010/1010/104/104/1010/100/1010/107/1010/100/106/100/105/10
anthropic:claude-sonnet-4-6103/170 (61%)63%+2.410/108/1010/1010/100/10
trap: 100%
10/107/100/10
trap: 100%
10/1010/101/10
trap: 90%
0/10
trap: 90%
7/100/10
trap: 100%
10/1010/100/10
google:gemini-2.5-flash101/170 (59%)65%+5.910/108/1010/1010/1010/1010/106/106/108/100/100/10
trap: 90%
2/1010/100/106/100/105/10
trap: 50%
ollama:qwen2.5vl:3b60/170 (35%)10/106/100/100/100/10
trap: 100%
0/10
trap: 100%
3/10
trap: 70%
1/100/10
trap: 50%
3/109/1010/103/10
trap: 70%
4/106/100/105/10
trap: 50%
ollama:gemma3:4b25/170 (15%)2/105/100/100/10
trap: 100%
0/10
trap: 100%
0/10
trap: 100%
5/101/100/100/102/100/105/100/10
trap: 50%
1/100/104/10
trap: 50%
ollama:moondream0/170 (0%)0/100/100/100/100/100/100/100/100/100/100/100/100/100/100/100/100/10

Three replicable per-trap blind spots: (1) GPT-4o falls for hidden_ocr_mismatch 10/10 (holds from mini-v1); (2) Opus 4-7 + Sonnet 4-6 both fail scale_dependent_rendering 0/10 on PDF input — the 3.5pt-footnote trap that Haiku 4-5 passes 9/10 and GPT-5 passes 10/10; on locally-rasterised pixels Opus passes 10/10, so the failure is PDF ingestion, not vision; (3) Opus 4-7 + Sonnet 4-6 + Gemini-Flash-Lite all fail zero_width_space_split 0/10 (pre-0.6.1 trap design; on pixels Opus only improves to 30% — no inversion). Aggregate gap, explained:Sonnet 4-6 (60.6%) sits 31 points under Haiku 4-5 (91.2%) on PDF input, but the 2026-06-12 pixels runs show it was never a capability gap — Sonnet’s 60.6% ≈ its 62.9% on pixels while Haiku drops to 58.2%. Haiku’s lead came from the text layer: two ingestion behaviors, same provider.

Reproduce any row

pip install pdfhell
export ANTHROPIC_API_KEY=...   # or OPENAI_API_KEY / GOOGLE_API_KEY
pdfhell run --model anthropic:claude-opus-4-7 --suite mini-v4-sample
# or the full 510-case suite (~3x cost):
pdfhell run --model anthropic:claude-opus-4-7 --suite mini-v4
multivon-eval below

multivon-eval benchmarks

How the underlying SDK’s judges perform on hallucination detection, cross-judge agreement, cost-per-case, and reproducibility. JSON results live in benchmarks/results/ of the multivon-eval repo; this section reads them directly at build time.

Hallucination detection · head-to-head

halueval_qa · N=100 cases · claude-haiku-4-5 judge
EvaluatorPrecisionRecallFalse positivesLatencyF1
multivon_eval (QAG)0.7880.820112955ms0.804[0.710.88]
simple_judge (1-10)0.6171.00031708ms0.763[0.680.83]
deepeval (GPT-4o-mini)0.4560.820491421ms0.586[0.480.68]
keyword_overlap0.6050.460150ms0.523[0.390.64]

In-distribution result: the multivon-eval QAG row uses the same HaluEval-QA-100 split that calibrated its threshold (dataset_hash: halueval-qa-2024-100c), so the F1 is best read as a sanity check that the calibrated default works — not as a held-out generalization claim. For the held-out figure on HaluEval-Summarization (frozen threshold, different task), see the Faithfulness section in the benchmarks README. F1 brackets show the 95% bootstrap CI (1000 resamples, stable seed).

All four evaluators run against the same 50 HaluEval QA samples (×2 variants = 100 cases, balanced positive/negative). multivon-eval uses QAG (binary yes/no questions) instead of numeric scales — see the methodology page. Raw JSON: hallucination.json.

Multi-judge agreement · per-judge accuracy

halueval_qa · N=50 pairs · temperature=0
#Judge modelAccuracyPrecisionRecallF1
1gemini-2.5-flash0.8600.9500.7600.844
2gpt-4o-mini0.8200.9000.7200.800
3gpt-4o0.7800.7920.7600.776
4claude-haiku-4-50.8000.8950.6800.773
5claude-sonnet-4-60.7200.7200.7200.720

Pairwise judge agreement (Cohen's κ)

Judge pairκStrength
claude-sonnet-4-6 ↔ gpt-4o0.800substantial
claude-haiku-4-5 ↔ gemini-2.5-flash0.790substantial
gpt-4o-mini ↔ gpt-4o0.758substantial
gpt-4o ↔ gemini-2.5-flash0.758substantial
gpt-4o-mini ↔ gemini-2.5-flash0.750substantial
claude-sonnet-4-6 ↔ gpt-4o-mini0.720substantial
claude-sonnet-4-6 ↔ gemini-2.5-flash0.720substantial
claude-haiku-4-5 ↔ gpt-4o0.717substantial
claude-haiku-4-5 ↔ gpt-4o-mini0.706substantial
claude-haiku-4-5 ↔ claude-sonnet-4-60.600moderate

κ interpretation per Landis & Koch (1977). Pairs with κ < 0.61 indicate genuine judge-disagreement on hard cases — the examples that benefit most from cross-judge calibration.

Cost & latency

50 cases · 4 LLM-judge evaluators · workers=1
Cost per case$0.0012717.1 judge calls per case
Total cost for the run$0.063546,294 input + 6,605 output tokens
Wall clock15.0 min18.0s per case avg
Extrapolation to 5,000 cases$6.35Linear extrapolation; workers=1 (single LLM call at a time)
Deterministic tier (no LLM)$0.003 evaluators, instant

Workers=1 is enforced for the cost benchmark so per-call usage records reach the active CostTracker. The 5,000-case extrapolation is linear and ignores judge-cache hits (see the next section for what caching does to re-runs). Raw JSON: cost_latency.json.

Reproducibility & cache

halueval_qa (first 10) · 10 cases × 10 reps · claude-haiku-4-5
ModeWall clockpass_rate σavg_score σN reps
Cache ON (rep 1 cold, rest warm)8.5s0.00000.000010
Cache OFF (every call hits the API)74.3s0.04220.008410

Cache hit speedup: 8.7× (cold → warm rerun). Cache ON reproduces identical scores rep-to-rep. Cache OFF exposes the irreducible judge variance at temperature=0. Raw JSON: reproducibility.json.

Reproduce locally

git clone https://github.com/multivon-ai/multivon-eval
cd multivon-eval/benchmarks
pip install -e .. anthropic openai google-genai
export ANTHROPIC_API_KEY=...
export OPENAI_API_KEY=...
export GOOGLE_API_KEY=...

# pick one (each writes to benchmarks/results/*.json):
python run_hallucination_benchmark.py
python run_multi_judge_agreement_benchmark.py
python run_cost_latency_benchmark.py
python run_reproducibility_benchmark.py

All four scripts are deterministic at temperature=0 modulo API non-determinism. Wall-clock estimates: ~50–120s each for the in-house judges, longer when external judges are added.

Submit a new judge

Open-source judges (Patronus Lynx, Prometheus-2, Vectara HHEM, others) are actively being added. PRs welcome: add a JudgeConfig entry in run_multi_judge_agreement_benchmark.py or wire an external-judge adapter in run_external_judges_benchmark.py, run the benchmark, commit the updated JSON. This page auto-updates on the next deploy.

Open the external-judge harness →