Insights

Practical thinking on AI evaluation, model quality, and shipping with confidence.

Engineering·6 min read·April 20, 2026

Why LLM Evals Fail in Production (And What To Do About It)

Teams spend weeks tuning prompts, get green scores in the playground, then watch things fall apart in production. Here's why it keeps happening.

Technical·5 min read·April 15, 2026

QAG vs LLM-as-Judge: Why We Score With Questions, Not Numbers

Asking a model to rate output 1-10 introduces its own hallucination risk. There's a more reliable way: generate yes/no questions and score by the fraction answered correctly.

Research·7 min read·April 10, 2026

Evaluating Multimodal AI: Text Is Just the Beginning

Most eval frameworks were built for text. But production AI systems generate images, process documents, and understand vision. The tooling hasn't caught up — yet.