Why LLM Evals Fail in Production (And What To Do About It)

You ran the evals. They passed. You shipped. Then production broke.

This is the most common story in LLM engineering right now, and it happens for a predictable set of reasons. Understanding them is the first step to building evals that actually catch problems before your users do.

The playground-to-production gap

Playground evals pass because they're tested against inputs you wrote. Production fails because real users are creative, ambiguous, and adversarial in ways you didn't anticipate.

The fix isn't more evals at the end — it's continuous evaluation against a sample of real production traffic. Your test set should be a living document, updated every time you find a new failure mode.

Optimizing the score, not the outcome

When teams first add LLM-as-judge scoring, they start optimizing for the judge's score. After a few iterations, the model has learned to please the judge — not to produce genuinely better outputs.

This is Goodhart's Law applied to AI: once a measure becomes a target, it ceases to be a good measure. Binary QAG questions (did the response correctly answer X? yes or no) are harder to game because they're grounded in specific, verifiable claims rather than a holistic impression.

Evaluator bias compounds silently

LLM judges have documented biases: they favor longer responses, prefer outputs from their own model family, and are influenced by the order responses are presented. None of this is flagged — you just get a score.

A judge that consistently rates your model's outputs 10-15% higher than warranted won't be caught by looking at absolute scores. You'll only notice when you compare against a ground truth or switch judges and see the gap.

No regression tracking

Most teams check evals before shipping a change. Few check whether the score on unchanged cases has drifted. Model providers update their models silently. Your context gets longer. Subtle prompt changes interact in unexpected ways.

Without tracking eval scores over time, you have no early warning system. A regression that would have been a one-line fix if caught in week one becomes a multi-week investigation in week eight.

What actually works

Ground your eval set in production data. Export a sample of real inputs weekly. Label the ones that produced bad outputs. Add them to your test set.

Use binary questions, not scores. "Does the response contain a concrete next step?" is easier to evaluate reliably than "Rate this response's helpfulness from 1 to 10." The former is auditable; the latter is vibes with extra steps.

Track deltas, not absolutes. A score of 0.87 is meaningless without context. A score that dropped from 0.91 to 0.87 across the last three releases is a signal worth investigating.

Run evals in CI. The moment evals become a manual step, they become an optional step. A pass threshold in your GitHub Actions workflow means regressions can't ship, full stop.

Multivon is building evaluation tooling for teams running LLMs in production. If eval reliability is a problem you're hitting, get in touch.