When you ask an LLM to rate a response on a scale of 1 to 10, you're asking it to do something humans are also bad at: produce a precise numeric judgment with no clear reference point.
What does a 7 mean? Is it 70% good? Better than average? The model doesn't know either. It produces a number that feels approximately right, influenced by factors that have nothing to do with the quality you're trying to measure.
There's a better approach, and it's been sitting in the evaluation literature for a while: QAG — Question-Answer Generation scoring.
How QAG works
Instead of asking "how good is this response?", QAG does two things:
-
Generates a set of specific yes/no questions about the ideal output. For a faithfulness eval, this might be: "Does the response claim the company was founded in 2024? Does the response mention all four product lines? Does the response cite a statistic not found in the context?"
-
Scores by the fraction answered correctly. If 8 of 10 questions pass, the score is 0.8. The score is a direct count, not an interpretation.
The judge's job shrinks from "form a holistic opinion" to "answer a binary question." LLMs are significantly more reliable at the latter.
Why numeric scoring breaks down
Position bias. When comparing responses, LLMs favor whichever appears first — independent of quality.
Verbosity preference. Longer responses get rated higher, even when they're padded or off-topic.
Model family bias. GPT-4 judges score GPT-4 outputs higher. Claude judges score Claude outputs higher. This isn't intentional; it's a reflection of training distribution.
Fine-grained scale collapse. The difference between a 6 and a 7 is essentially noise. Research shows LLM scoring is only reliably consistent at 3-4 point scales — not the 10-point scales most teams use.
None of these biases appear in the score. You just see a number that drifts in ways that are hard to diagnose.
What QAG doesn't solve
QAG requires generating good questions, which is itself an LLM task. A poorly specified question introduces its own ambiguity. "Is the response helpful?" is a bad QAG question — it's just a numeric judgment reframed as binary.
Good QAG questions are specific, verifiable, and grounded in the context or expected output:
- ✓ "Does the response recommend the user contact support before attempting a self-fix?"
- ✓ "Does the response avoid mentioning pricing not found in the context?"
- ✗ "Is the response high quality?"
- ✗ "Does the response seem accurate?"
The investment in writing good questions upfront pays off in stable, auditable scores that are easy to debug when something changes.
Auditable by default
The biggest practical advantage of QAG: when a score drops, you know exactly which questions failed. That's a debugging surface that numeric scoring completely lacks.
A faithfulness score dropping from 0.87 to 0.71 tells you something changed. Seeing that the question "Does the response avoid citing sources not in the context?" went from 92% pass rate to 61% tells you what changed.
OmniEval, Multivon's evaluation framework, uses QAG scoring throughout its LLM-as-judge evaluators. Join the early access list to be among the first to try it.