Adversarial PDFs that break AI document readers.

Every case is a PDF generated from code. The answer is exactly known. No LLM-as-judge, no circular assurance — just procedural ground truth and a model that either passes or fails. The current suite is 510 cases across 17 trap families — 6 hand-authored, 11 discovered by the autoresearch loop.

Findings · mini-v4-sample · PDF runs 2026-05-24 · pixels runs 2026-06-12

Three model blind spots — and a leaderboard that nearly inverts.

The leaderboard surfaces three per-trap blind spots that reproduce across frontier models. The 2026-06-12 cross-modality twin run — the same cases sent once as PDFs and once as locally-rasterised 150dpi images — shows where those failures live: in the provider’s PDF-ingestion pipeline, not the vision stack. Every finding below reproduces on fresh random seeds.

Hidden-OCR trap · ingestion failure, not blindness

GPT-4o falls for the hidden-OCR trap, 10/10 — on PDF input only

GPT-4o returns the invisible PDF text-layer amount instead of the visible page amount, every time. Send the same cases as locally-rasterised pixels — no text layer to trust — and it passes 100%. The failure is text-layer trust rather than blindness. GPT-5 fixed most of the PDF-modality failure (80% pass).

Scale-dependent rendering · ingestion failure, not vision

Anthropic premium and reasoning tiers both fail the 3.5pt-footnote trap on PDF input, 0%

Opus 4-7 and Sonnet 4-6 both score 0/10 on the 3.5pt-footnote trap when the PDF goes through the provider's ingestion pipeline. Rasterise the same cases locally at 150 dpi and Opus passes 100% — the small print survives honest rasterisation, so the failure lives in PDF ingestion, not at a vision legibility threshold. Haiku 4-5 passes 90% and GPT-5 100% on PDF input.

Text-run fragmentation · bites in both modalities

Three models all fail the text-run-fragmentation trap

The binding amount is split across two adjacent text runs at the exact pen position: it reads as one continuous string on the page but is fragmented in the text stream. Sonnet 4-6, Opus 4-7, and Gemini Flash Lite all score 0/10 on PDF input; Haiku 4-5 passes 80%. Unlike the two findings above, this one only improves on pixels (Opus 0% → 30%) — it does not invert. Note: the published 0/10s were measured on the pre-0.6.1 trap design, which injected a literal U+200B that rendered a visible ■; the family was redesigned in pdfhell 0.6.1.

Within-provider gap · explained

Sonnet 4-6 scores 31 points below Haiku 4-5 — explained: two ingestion behaviors, same provider

Sonnet 60.6% vs Haiku 91.2% on PDF input (n=170) looked like an anomaly. The cross-modality run resolved it, and it was not a capability gap. Sonnet's PDF score (60.6%) ≈ its pixels score (62.9%), while Haiku drops 91.2% → 58.2% on pixels — Haiku's lead came from the text layer. Same provider, two different ingestion behaviors.

Full per-model × per-trap numbers on the leaderboard. 11 of the 17 trap families were proposed and validated by an autoresearch loop; one of those (mirror_image_glyphs) catches three different frontier models at 0%.

mini-v1 → v4: successive revisions of the procedurally-generated trap suite. v1 + v2 are the six hand-authored trap families; v3 + v4 are autoresearch-discovered additions. The current suite (mini-v4-sample) is 17 trap families × n=170.

Cross-modality · 2026-06-12

Same 170 cases, two modalities — the ranking nearly inverts.

pdfhell 0.6.0 added --pixels: every case rasterised locally at 150dpi and sent as images, so the model never sees the PDF’s embedded text layer. The PDF-vs-pixels delta isolates the provider’s PDF-ingestion pipeline from the model’s actual vision. The PDF leaders collapse on pixels, and Opus 4-7 — a stronger pixel reader than PDF reader — takes the lead at 85.9%.

Model	PDF	Pixels @150 dpi	Δ
GPT-5	94.7%	67.6%	-27.1
Haiku 4-5	91.2%	58.2%	-33.0
Gemini Flash Lite	88.8%	60.6%	-28.2
GPT-4o	81.2%	60.0%	-21.2
Opus 4-7	79.4%	85.9%	+6.5
Gemini 2.5 Pro	67.1%	72.9%	+5.8
Sonnet 4-6	60.6%	62.9%	+2.3
Gemini 2.5 Flash	59.4%	65.3%	+5.9

Per-trap, the inversions are surgical: GPT-4o’s hidden_ocr_mismatch 0% → 100% (text-layer trust, not blindness) and Opus’s scale_dependent_rendering 0% → 100%. zero_width_space_split only improves (Opus 0% → 30%) — it does not invert; that trap bites in both modalities.

Run hygiene (the 2026-05-24 retraction made this a publishing gate): max api_error_rate across the panel 0.6%. gemini-2.5-pro refused 11.2% of pixel inputs — refusals scored as failures and disclosed, not dropped.

Pixels column on the leaderboard →Method + per-trap pixels data · pdfhell#1 →README cross-modality section →

Retraction · 2026-05-24

The original Opus 4-7 headline was wrong, and we retracted it.

Earlier versions of the repo README, the 0.4.0 / 0.5.0 release notes, and the original CONFIRMATION_REPORT claimed Claude Opus 4-7 fails 0% on all seven mini-v4 trap families. That claim was an eval artifact, not a model failure: every Opus call had failed with a temperature deprecated provider API error that the runner silently scored as a wrong answer.

The finding was retracted on 2026-05-24 and the leaderboard re-run. The numbers on this page are post-correction — Opus 4-7 scores 79.4% overall, and two narrower per-trap failures (scale_dependent_rendering, zero_width_space_split, both 0/10) survive correction — both later traced to PDF ingestion rather than vision by the cross-modality run above. The retraction notice is permanent in the repo README and the full post-mortem stays in the research directory.

README retraction notice →CORRECTION_NOTICE.md →

Quickstart

Install (zero setup with uv)

Generate one agent-discovered trap PDF for inspection

Reproduce a real ingestion blind spot (Opus fails scale_dep 0/10 on PDF input; add --pixels and it inverts)

Or run your own autoresearch loop to discover new traps

Provider shorthand: anthropic:, openai:, google:. API key from env (ANTHROPIC_API_KEY, etc.).

The 17 traps

Six hand-authored — the originals from mini-v1 and mini-v2. Eleven proposed by an autoresearch loop — a rotation of Opus 4-7, GPT-5, and Gemini 2.5 Pro proposing candidate trap generators against pdfhell’s 8-model eval panel; each one here survived six validation gates (glyph_clean added in 0.6.1) and human curation. Three deep-dives below.

hand-authored

mini-v1 / v2

Hidden OCR mismatch

hidden_ocr_mismatch

Invisible text layer disagrees with the rendered page.

hand-authored

mini-v1 / v2

Footnote override

footnote_override

A 6pt footnote overrides the body clause.

hand-authored

mini-v1 / v2

Split table across pages

split_table_across_pages

Header on page 1, body rows on page 2 — no header repeat.

autoresearch · mini-v3

Opus 4-7

Unicode confusable total

unicode_confusable_total

A digit-zero confusable: 'T0TAL:' (U+0030) vs 'TOTAL:'; a printed clause names which label binds. Redesigned in 0.6.1 — through 0.6.0 the family used a Cyrillic О that actually rendered a visible tofu box. Re-measured 2026-06-12 on the redesign: Gemini 2.5 Pro fails it at 20%; four models score 100%.

autoresearch · mini-v3

GPT-5

Mirrored footer notice

mirrored_footer_notice

Binding amount only in a horizontally-mirrored footer notice; vision-only readers fall back to the visible decoy.

autoresearch · mini-v3

Gemini 2.5 Pro

Zero width space split

zero_width_space_split

The binding amount is split across two adjacent text runs at the exact pen position: one continuous string on the page, fragmented in the text stream. Redesigned in 0.6.1 — through 0.6.0 the family injected a literal U+200B that rendered a visible ■. Re-measured 2026-06-12 on the redesign: Opus 4-7 still fails 0/10 while all seven other panel models score 100% — the blind spot survived clean methodology; Flash-Lite's old 0% (the artifact) is withdrawn.

autoresearch · mini-v3

GPT-5

Currency mismatch conversion

currency_mismatch_conversion

Invoice headlines a EUR total; settlement clause requires USD payment at a stated FX rate. Salience-only readers grab EUR.

autoresearch · mini-v4

Opus 4-7

Em dash minus sign

em_dash_minus_sign

Em/en-dash codepoint where a minus sign is expected; a printed clause names the binding interpretation.

autoresearch · mini-v4

Opus 4-7

Upside down amount

upside_down_amount

180-degree rotated binding amount in a labelled box. Vision pipelines that skip rotated regions fall to the upright decoy.

autoresearch · mini-v4

Opus 4-7

Checksum validation rule

checksum_validation_rule

Printed rule: pick the candidate whose digit-sum mod K equals N. Pure procedural rule-following test.

autoresearch · mini-v4

Opus 4-7

Mirror image glyphs

mirror_image_glyphs

Horizontally-mirrored glyphs in the binding amount. 5 of 8 panel models at 0%; only OpenAI passes cleanly.

autoresearch · mini-v4

GPT-5

Boldface binding rule

boldface_binding_rule

Printed rule: 'use the boldface amount'. Visual-property × printed-rule combination.

autoresearch · mini-v4

GPT-5

Shaded box binding rule

shaded_box_binding_rule

Printed rule: 'use the amount in the shaded box'. Layout-property × printed-rule combination.

autoresearch · mini-v4

Gemini 2.5 Pro

Color grounding trap

color_grounding_trap

Printed rule: 'use the red amount'. Semantic colour grounding × printed-rule combination.

Three deep-dives

Three representative traps — two of the original hand-authored families (the live GPT-4o and Anthropic-premium blind spots) and one from the autoresearch loop (the strongest cross-provider replication in mini-v4). Full per-trap generator source for all 17 lives in pdfhell/generators.

hand-authored · hidden_ocr_mismatch

Hidden OCR mismatch

Invisible text layer disagrees with the rendered page.

How it's generated

An invoice with realistic line items and totals. The visible TOTAL DUE is set by the seed. A second copy of the total — with a different amount — is written into the PDF's text content stream using PDF render mode 3 (placed in the text stream but never rasterised). A human sees one amount; a text-extraction pipeline sees another.

What it detects

Vision models that trust the page (correct) vs. agents that fuse a text-extraction layer with vision output and silently prefer the layer (incorrect). The most common silent failure in scanned-then-OCR'd document pipelines in production.

Expected failure mode

Model answers the hidden amount instead of the visible amount — diagnosable from the recorded forbidden-answer match.

hand-authored · scale_dependent_rendering

Scale-dependent rendering

The headline number is a decoy; the binding value sits in a 3.5pt footnote — and whether the model ever sees that footnote depends on how its provider ingests the PDF.

How it's generated

An invoice with a large, prominent decoy total and a 3.5pt footnote naming the real binding amount (1.5–3× the decoy). The question explicitly instructs the model to read all small print, so it can't claim the prompt was ambiguous: either the footnote survived the provider's PDF-ingestion pipeline or it didn't.

What it detects

Provider PDF-ingestion pipelines that lose the sub-threshold footnote before the model ever sees it. Opus 4-7 and Sonnet 4-6 both score 0/10 on PDF-modality input — but rasterise the same cases locally at 150 dpi (--pixels) and Opus passes 100%. The 2026-06-12 cross-modality run pins the failure on PDF ingestion, not on a vision legibility threshold. Haiku 4-5 passes 90% and GPT-5 100% on PDF input.

Expected failure mode

Model returns the large decoy total instead of the footnote's binding amount — diagnosable from the recorded forbidden-answer match.

autoresearch · Opus 4-7 · mini-v4 · mirror_image_glyphs

Mirror-image glyphs

Horizontally-mirrored binding amount. 5 of 8 panel models at 0%; only OpenAI passes cleanly.

How it's generated

A single-page invoice with two candidate totals: one in upright glyphs (the visible decoy) and one with horizontally-mirrored glyphs (each character flipped left-to-right). A printed rule on the same page names the mirrored amount as binding. Each mirrored character is drawn from a flipped glyph path, so the mirror is intrinsic to the PDF — not a layer transform a smart OCR pipeline could detect and unmirror.

What it detects

Vision pipelines that don't normalise glyph orientation, or that treat mirrored glyphs as noise and fall back to the upright decoy. Five of eight panel models score 0/10; only OpenAI's family reads the mirrored amount cleanly. Strongest single replication signal in mini-v4 — the trap catches three different frontier providers.

Expected failure mode

Model returns the upright decoy amount, or returns the visually-similar but semantically-wrong character from a literal OCR of the mirrored glyph (e.g. b vs d).

Procedural ground truth, not vibes.

Generated from code

Every trap is a Python generator with a deterministic seed. The PDF is constructed cell-by-cell with reportlab, and the answer key is whatever literal value the generator chose. Re-running with the same seed produces byte-identical PDFs and identical answer keys.

Scored by string match

The headline correctness signal is contains-match (whitespace-tolerant, case-insensitive, currency-prefix-tolerant) between the model's free-text answer and the expected value. No LLM judges. No circular assurance.

Designed failure modes

Each trap names the specific failure it elicits (e.g. 'model trusted hidden OCR over visible page'). Forbidden-answer detection records when a model fell into the designed trap vs. hallucinated some third value. The diagnostic is the product, not the score.

QAG is the explanation, not the score

multivon-eval's DocumentGrounding (QAG-based) is available as a separate layer for users who want a human-readable 'why did this fail' breakdown. It runs after primary scoring, so a judge-model failure cannot change the pass/fail signal.

Trap discovery is itself a research artifact

The autoresearch loop (pdfhell.research) doesn't just propose traps — it logs every candidate, every researcher, every dollar to results.tsv + budget.jsonl. The 11 mini-v3/v4 traps are reproducible from the same artifacts. METHODOLOGY.md and CONFIRMATION_REPORT.md formalise the claim and the validation.

Validation independent of discovery

Every trap a model fails during discovery is re-tested on a fresh, non-overlapping set of random seeds before it counts. This discipline is also what bounded the 2026-05-24 retraction: after the temperature-bug fix, the full panel was re-run, the broad 'Opus fails everything' claim did not survive, and the two narrower Opus failures that did (scale_dependent_rendering, zero_width_space_split) reproduce on fresh seeds. See the retraction note above.

Reproducible by design

Every case is produced from a deterministic seed; same seed → byte-identical PDF. Anyone can re-derive any case with pdfhell make --trap X --seed N. No private holdout; the methodology is its own holdout. mini-v4 reserves non-overlapping seed ranges per trap family so any subset can be regenerated independently.

The full mini-v4 suite is 17 trap families × 30 cases = 510 cases — the seed ranges below. The public leaderboard runs mini-v4-sample: the first 10 seeds of each range (e.g. 1001–1010), so 170 cases total — same cases, ~3× cheaper to run (~$17 vs ~$100 across the full model panel). Published benchmark comparisons that need statistical power use the full 510.

Trap family	Seeds	Cases
hidden_ocr_mismatch	1001 – 1030	30
footnote_override	2001 – 2030	30
split_table_across_pages	3001 – 3030	30
composite_trap	4001 – 4030	30
scale_dependent_rendering	5001 – 5030	30
cross_page_coreference	6001 – 6030	30
unicode_confusable_total	7001 – 7030	30
zero_width_space_split	7101 – 7130	30
currency_mismatch_conversion	7201 – 7230	30
mirrored_footer_notice	7301 – 7330	30
em_dash_minus_sign	8001 – 8030	30
upside_down_amount	8101 – 8130	30
checksum_validation_rule	8201 – 8230	30
mirror_image_glyphs	8301 – 8330	30
boldface_binding_rule	8401 – 8430	30
shaded_box_binding_rule	8501 – 8530	30
color_grounding_trap	8601 – 8630	30
mini-v4 (full) · 17 families	first 10/family → mini-v4-sample (170)	510

Want PDF Hell on your document templates?

The OSS suite covers 17 trap families against generic invoices, contracts, and financial tables. If you need adversarial variants of your own MSAs / claim forms / EOBs / medical records, or a custom trap family for your vertical, drop us a line — inbound only, no fake pricing.

Read the docs →Star on GitHub Custom suites

FAQ

How is this different from DocVQA / MMBench / SWE-Bench / etc.?

Existing document benchmarks measure correctness on clean, naturally-occurring PDFs. PDF Hell measures correctness under adversarial PDF structures that are common in production but absent in academic benchmarks (hidden OCR layers, footnote overrides, page-broken tables). It's a stress test, not a sufficient eval — pair it with a domain benchmark for full coverage.

Why no LLM-as-judge?

Because the same complexity that fools a document AI also fools a judge model. We've watched customers buy 'Evidence Pack' reports and discover the judge passed an answer that contradicted the source. PDF Hell removes the judge from the scoring path. The answer is set by code; the model's output is matched against a known string.

Can I add my own traps?

Yes. Each generator is a single Python file under pdfhell/generators/. Pattern: take a seed, draw a PDF with reportlab, return (pdf_bytes, HellCase) with the answer key. Submit a PR; if the trap surfaces a new failure mode the existing families don't catch, we'll land it.

How were the 11 newer trap families discovered?

By an autoresearch loop inspired by Karpathy's autoresearch repo. Three rotating researcher LLMs (Opus 4-7, GPT-5, Gemini 2.5 Pro) propose candidate trap generators; each candidate passes six validation gates (parseable, glyph-clean, deterministic, answerable, forbidden-clean, lint-clean) before any vision-eval spend — glyph_clean was added in 0.6.1 after the tofu-box bug (pdfhell#8), when two families turned out to render substitution glyphs where they claimed visual normality; a candidate's score is spread × novelty across the 8-model panel, gated by solvability. The agent doesn't merge its own work — every kept candidate sits in pdfhell/research/keep/ until a human curator promotes it. Two overnight runs ($88 total) yielded 11 promotable trap families. See pdfhell/research/METHODOLOGY.md and CONFIRMATION_REPORT.md.

Is the Opus 4-7 finding real?

The original version of it (0% on all seven trap families) was not — it was retracted on 2026-05-24 after we found a provider temperature bug had been scored as wrong answers; see the retraction note above. What survives correction: Opus 4-7 scores 79.4% overall on mini-v4-sample (PDF modality) and fails scale_dependent_rendering and zero_width_space_split at 0/10 each — narrower, real, and replicated on fresh seeds. The 2026-06-12 cross-modality run then explained where those failures live: scale_dependent_rendering inverts to 100% on locally-rasterised pixels (a PDF-ingestion failure, not vision), while zero_width_space_split only improves to 30% — and that family was itself redesigned in 0.6.1 after the pixels run exposed a rendering bug in the trap (pdfhell#8). The full post-mortem is in pdfhell/research/CORRECTION_NOTICE.md. Anyone with API keys can re-run the suite and verify.

What about voice / images / code?

PDF Hell is the first wedge — documents are the most concrete entry point. The pattern (procedural adversarial input + code-based ground truth + designed failure modes) generalises. code-hell, voice-hell, image-hell are roadmap. Subscribe to the GitHub repo for releases.

Is this Multivon's product or just a research project?

Both, but the answer surprises some people. PDF Hell is open-source under Apache 2.0 (the public benchmark) and the full SDK + benchmark + CI integration are free. Multivon is pre-product and pre-revenue — we're not selling SaaS, not running a hosted service, not packaging an enterprise tier. If you need on-prem, custom trap families, or paid support for regulated deployments, email hello@multivon.ai. We respond to inbound; we don't have a price sheet.