Three eval frameworks disagree on 56% of cases.

Cohen's κ = 0.03. What we measured, what we found, and what to do about it.

Featured findingBenchmarks·12 min read·May 13, 2026

Three open-source eval frameworks disagree on 56% of cases. Cohen's κ = 0.03.

Same judge, same dataset, same seed. multivon-eval, DeepEval, and RAGAS produce different verdicts on more than half the cases. The detection-prompt gap is small; the shipped-calibration gap is huge. Full repo, methodology, circularity disclosure included.

Read more
Benchmarks·9 min read·May 14, 2026

Cross-dataset calibration: F1=0.787 on data it never trained on

v2 benchmark: re-ran the head-to-head on RAGTruth — a dataset multivon-eval's calibration has never seen. At our HaluEval-derived threshold, F1=0.787 on RAGTruth, higher than the in-distribution 0.63. Removes the v1 circularity caveat with real numbers.

Read more
Evaluation·7 min read·April 26, 2026

Structured extraction: 10–15% of LLM outputs fail to parse, and most evals miss it

Models fail to return parseable output on 10–15% of prompts. Here's why standard evals miss it, and how to catch format failures before they hit production.

Read more
Evaluation·8 min read·April 26, 2026

Single-run LLM evals routinely misrank models (NAACL 2025)

Single-run evaluation scores are so noisy they routinely misrank models and miss regressions. Here's the research behind it, and what to do instead.

Read more
Technical·5 min read·April 24, 2026

QAG vs LLM-as-Judge: Why We Score With Questions, Not Numbers

Asking a model to rate output 1-10 introduces its own hallucination risk. There's a more reliable way: generate yes/no questions and score by the fraction answered correctly.

Read more