Three eval frameworks disagree on 56% of cases.
Cohen's κ = 0.03. What we measured, what we found, and what to do about it.
Three open-source eval frameworks disagree on 56% of cases. Cohen's κ = 0.03.
Same judge, same dataset, same seed. multivon-eval, DeepEval, and RAGAS produce different verdicts on more than half the cases. The detection-prompt gap is small; the shipped-calibration gap is huge. Full repo, methodology, circularity disclosure included.
Cross-dataset calibration: F1=0.787 on data it never trained on
v2 benchmark: re-ran the head-to-head on RAGTruth — a dataset multivon-eval's calibration has never seen. At our HaluEval-derived threshold, F1=0.787 on RAGTruth, higher than the in-distribution 0.63. Removes the v1 circularity caveat with real numbers.
Structured extraction: 10–15% of LLM outputs fail to parse, and most evals miss it
Models fail to return parseable output on 10–15% of prompts. Here's why standard evals miss it, and how to catch format failures before they hit production.
Single-run LLM evals routinely misrank models (NAACL 2025)
Single-run evaluation scores are so noisy they routinely misrank models and miss regressions. Here's the research behind it, and what to do instead.
QAG vs LLM-as-Judge: Why We Score With Questions, Not Numbers
Asking a model to rate output 1-10 introduces its own hallucination risk. There's a more reliable way: generate yes/no questions and score by the fraction answered correctly.