Leaderboard · two boards · one page

Reproducible benchmarks for AI evaluation

Every number below has a reproducer command and a raw-JSON source. PDF Hell measures vision models against an adversarial mini suite; multivon-eval benchmarks measure the underlying engine's hallucination detection, multi-judge agreement, cost-per-case, and cache reproducibility.

PDF Hell (11 models)multivon-eval benchmarks Submit a model →

PDF Hell — mini-v4 sample (17 traps × 10 cases)

11 models · 10 cases per trap family · PDF runs 2026-05-24 · pixels-only runs 2026-06-12

Per-trap pass rates across 11 frontier vision models. Code-based ground truth — no LLM judging another LLM. Each cell is passes out of 10. 11 of the 17 trap families were proposed and validated by an autoresearch loop. Raw run JSON on GitHub — reproduce any row with uvx pdfhell run --model <p>:<m> --suite mini-v4-sample.

Pixels-only column: the same cases rasterised locally at 150 dpi and sent as images (--pixels), so the model never sees the PDF’s embedded text layer. The deltas show how much of each score is the model’s vision vs its provider’s PDF-ingestion pipeline — e.g. gpt-4o’s Hidden-OCR 0% becomes 100% on pixels (text-layer trust, not blindness). Per-trap cells below remain PDF-modality. Method + per-trap pixels data: pdfhell#1. Note: the ZWS-split and U+-confuse families were redesigned in pdfhell 0.6.1 after the pixels run exposed a rendering bug (pdfhell#8) — those two columns reflect the pre-redesign traps.

Model	Overall (PDF)	Pixels-only @150 dpi · Δ vs PDF	Hidden OCR	Footnote	Split table	Composite	3.5pt foot	Xpage coref	U+ confuse	ZWS split	FX clause	Mirror foot	Dash sign	180° amt	Checksum	Mirror glyph	Bold rule	Shade rule	Color rule
openai:gpt-5	161/170 (95%)	68%-27.1	7/10	9/10	10/10	10/10	10/10	10/10	10/10	5/10	10/10	10/10	10/10	10/10	10/10	10/10	10/10	10/10	10/10
anthropic:claude-haiku-4-5	155/170 (91%)	58%-32.9	10/10	8/10	10/10	10/10	10/10	10/10	8/10	10/10	8/10	10/10	4/10 trap: 60%	10/10	8/10	9/10	10/10	10/10	10/10
google:gemini-flash-lite-latest	151/170 (89%)	61%-28.2	10/10	8/10	10/10	10/10	10/10	10/10	10/10	0/10 trap: 50%	10/10	10/10	7/10	10/10	6/10	10/10	10/10	10/10	10/10
openai:gpt-4o	138/170 (81%)	60%-21.2	0/10 trap: 90%	8/10	9/10	10/10	10/10	10/10	7/10	5/10	3/10	10/10	10/10	10/10	6/10	10/10	10/10	10/10	10/10
anthropic:claude-opus-4-7	135/170 (79%)	86%+6.5	6/10	9/10	10/10	10/10	0/10	10/10	9/10	0/10 trap: 70%	10/10	10/10	10/10	10/10	10/10	6/10	10/10	10/10	5/10 trap: 50%
google:gemini-2.5-pro	114/170 (67%)	73%+5.9	10/10	8/10	10/10	10/10	10/10	10/10	4/10	4/10	10/10	0/10	10/10	7/10	10/10	0/10	6/10	0/10	5/10
anthropic:claude-sonnet-4-6	103/170 (61%)	63%+2.4	10/10	8/10	10/10	10/10	0/10 trap: 100%	10/10	7/10	0/10 trap: 100%	10/10	10/10	1/10 trap: 90%	0/10 trap: 90%	7/10	0/10 trap: 100%	10/10	10/10	0/10
google:gemini-2.5-flash	101/170 (59%)	65%+5.9	10/10	8/10	10/10	10/10	10/10	10/10	6/10	6/10	8/10	0/10	0/10 trap: 90%	2/10	10/10	0/10	6/10	0/10	5/10 trap: 50%
ollama:qwen2.5vl:3b	60/170 (35%)	—	10/10	6/10	0/10	0/10	0/10 trap: 100%	0/10 trap: 100%	3/10 trap: 70%	1/10	0/10 trap: 50%	3/10	9/10	10/10	3/10 trap: 70%	4/10	6/10	0/10	5/10 trap: 50%
ollama:gemma3:4b	25/170 (15%)	—	2/10	5/10	0/10	0/10 trap: 100%	0/10 trap: 100%	0/10 trap: 100%	5/10	1/10	0/10	0/10	2/10	0/10	5/10	0/10 trap: 50%	1/10	0/10	4/10 trap: 50%
ollama:moondream	0/170 (0%)	—	0/10	0/10	0/10	0/10	0/10	0/10	0/10	0/10	0/10	0/10	0/10	0/10	0/10	0/10	0/10	0/10	0/10

Three replicable per-trap blind spots: (1) GPT-4o falls for hidden_ocr_mismatch 10/10 (holds from mini-v1); (2) Opus 4-7 + Sonnet 4-6 both fail scale_dependent_rendering 0/10 on PDF input — the 3.5pt-footnote trap that Haiku 4-5 passes 9/10 and GPT-5 passes 10/10; on locally-rasterised pixels Opus passes 10/10, so the failure is PDF ingestion, not vision; (3) Opus 4-7 + Sonnet 4-6 + Gemini-Flash-Lite all fail zero_width_space_split 0/10 (pre-0.6.1 trap design; on pixels Opus only improves to 30% — no inversion). Aggregate gap, explained:Sonnet 4-6 (60.6%) sits 31 points under Haiku 4-5 (91.2%) on PDF input, but the 2026-06-12 pixels runs show it was never a capability gap — Sonnet’s 60.6% ≈ its 62.9% on pixels while Haiku drops to 58.2%. Haiku’s lead came from the text layer: two ingestion behaviors, same provider.

Reproduce any row

pip install pdfhell
export ANTHROPIC_API_KEY=...   # or OPENAI_API_KEY / GOOGLE_API_KEY
pdfhell run --model anthropic:claude-opus-4-7 --suite mini-v4-sample
# or the full 510-case suite (~3x cost):
pdfhell run --model anthropic:claude-opus-4-7 --suite mini-v4

multivon-eval below

multivon-eval benchmarks

How the underlying SDK’s judges perform on hallucination detection, cross-judge agreement, cost-per-case, and reproducibility. JSON results live in benchmarks/results/ of the multivon-eval repo; this section reads them directly at build time.

in-distribution, HaluEval-QA n=100, CI bootstrap-1000

$0.00127

cost / case

4 LLM-judge evaluators

8.7×

cache hit speedup

cold → warm rerun

Hallucination detection · head-to-head

halueval_qa · N=100 cases · claude-haiku-4-5 judge

Evaluator	Precision	Recall	False positives	Latency	F1
multivon_eval (QAG)	0.788	0.820	11	2955ms	0.804[0.71–0.88]
simple_judge (1-10)	0.617	1.000	31	708ms	0.763[0.68–0.83]
deepeval (GPT-4o-mini)	0.456	0.820	49	1421ms	0.586[0.48–0.68]
keyword_overlap	0.605	0.460	15	0ms	0.523[0.39–0.64]

In-distribution result: the multivon-eval QAG row uses the same HaluEval-QA-100 split that calibrated its threshold (dataset_hash: halueval-qa-2024-100c), so the F1 is best read as a sanity check that the calibrated default works — not as a held-out generalization claim. For the held-out figure on HaluEval-Summarization (frozen threshold, different task), see the Faithfulness section in the benchmarks README. F1 brackets show the 95% bootstrap CI (1000 resamples, stable seed).

All four evaluators run against the same 50 HaluEval QA samples (×2 variants = 100 cases, balanced positive/negative). multivon-eval uses QAG (binary yes/no questions) instead of numeric scales — see the methodology page. Raw JSON: hallucination.json.

Multi-judge agreement · per-judge accuracy

halueval_qa · N=50 pairs · temperature=0

#	Judge model	Accuracy	Precision	Recall	F1
1	gemini-2.5-flash	0.860	0.950	0.760	0.844
2	gpt-4o-mini	0.820	0.900	0.720	0.800
3	gpt-4o	0.780	0.792	0.760	0.776
4	claude-haiku-4-5	0.800	0.895	0.680	0.773
5	claude-sonnet-4-6	0.720	0.720	0.720	0.720

Pairwise judge agreement (Cohen's κ)

Judge pair	κ	Strength
claude-sonnet-4-6 ↔ gpt-4o	0.800	substantial
claude-haiku-4-5 ↔ gemini-2.5-flash	0.790	substantial
gpt-4o-mini ↔ gpt-4o	0.758	substantial
gpt-4o ↔ gemini-2.5-flash	0.758	substantial
gpt-4o-mini ↔ gemini-2.5-flash	0.750	substantial
claude-sonnet-4-6 ↔ gpt-4o-mini	0.720	substantial
claude-sonnet-4-6 ↔ gemini-2.5-flash	0.720	substantial
claude-haiku-4-5 ↔ gpt-4o	0.717	substantial
claude-haiku-4-5 ↔ gpt-4o-mini	0.706	substantial
claude-haiku-4-5 ↔ claude-sonnet-4-6	0.600	moderate

κ interpretation per Landis & Koch (1977). Pairs with κ < 0.61 indicate genuine judge-disagreement on hard cases — the examples that benefit most from cross-judge calibration.

Cost & latency

50 cases · 4 LLM-judge evaluators · workers=1

Cost per case	$0.00127	17.1 judge calls per case
Total cost for the run	$0.0635	46,294 input + 6,605 output tokens
Wall clock	15.0 min	18.0s per case avg
Extrapolation to 5,000 cases	$6.35	Linear extrapolation; workers=1 (single LLM call at a time)
Deterministic tier (no LLM)	$0.00	3 evaluators, instant

Workers=1 is enforced for the cost benchmark so per-call usage records reach the active CostTracker. The 5,000-case extrapolation is linear and ignores judge-cache hits (see the next section for what caching does to re-runs). Raw JSON: cost_latency.json.

Reproducibility & cache

halueval_qa (first 10) · 10 cases × 10 reps · claude-haiku-4-5

Mode	Wall clock	pass_rate σ	avg_score σ	N reps
Cache ON (rep 1 cold, rest warm)	8.5s	0.0000	0.0000	10
Cache OFF (every call hits the API)	74.3s	0.0422	0.0084	10

Cache hit speedup: 8.7× (cold → warm rerun). Cache ON reproduces identical scores rep-to-rep. Cache OFF exposes the irreducible judge variance at temperature=0. Raw JSON: reproducibility.json.

Reproduce locally

git clone https://github.com/multivon-ai/multivon-eval
cd multivon-eval/benchmarks
pip install -e .. anthropic openai google-genai
export ANTHROPIC_API_KEY=...
export OPENAI_API_KEY=...
export GOOGLE_API_KEY=...

# pick one (each writes to benchmarks/results/*.json):
python run_hallucination_benchmark.py
python run_multi_judge_agreement_benchmark.py
python run_cost_latency_benchmark.py
python run_reproducibility_benchmark.py

All four scripts are deterministic at temperature=0 modulo API non-determinism. Wall-clock estimates: ~50–120s each for the in-house judges, longer when external judges are added.

Submit a new judge

Open-source judges (Patronus Lynx, Prometheus-2, Vectara HHEM, others) are actively being added. PRs welcome: add a JudgeConfig entry in run_multi_judge_agreement_benchmark.py or wire an external-judge adapter in run_external_judges_benchmark.py, run the benchmark, commit the updated JSON. This page auto-updates on the next deploy.

Open the external-judge harness →