Evaluating Multimodal AI: Text Is Just the Beginning

LLM evaluation has made real progress on text. We have reasonable metrics for faithfulness, relevance, and coherence. We have benchmark datasets. We have judge models.

Multimodal evaluation is a different story. Teams building image generation pipelines, vision-language systems, or document understanding models are largely flying blind — stitching together metrics from research papers and hoping they correlate with what users actually care about.

The text eval assumption doesn't hold

Text evals can often fall back on string matching, embedding similarity, or asking a language model to compare two passages. These techniques are well-understood and scale cheaply.

For images, none of this applies. There's no meaningful way to "compare" a generated image to a reference by diffing characters. Embedding similarity helps but captures different things depending on the model — CLIP embeddings measure semantic alignment between image and text, not perceptual quality. LPIPS (Learned Perceptual Image Patch Similarity) measures how similar two images look to a neural network trained on human perceptual judgments — better for quality, not alignment.

You need both, and they tell you different things.

What matters for image generation

Three dimensions, none of which fully capture the others:

Perceptual quality — Does the image look good? Is it sharp, coherent, artifact-free? LPIPS and similar perceptual metrics track this. Human MOS (Mean Opinion Score) studies confirm that deep feature similarity correlates better with human judgment than SSIM or PSNR.

Prompt adherence — Does the image actually show what the prompt asked for? CLIP-based metrics score this by measuring alignment between the image embedding and the text embedding. A technically high-quality image of a red car fails if the prompt asked for a blue truck.

Task-specific accuracy — For product image generation, are the correct logos present? For medical imaging, is the anatomical structure correct? These require custom evaluators because no general metric knows your domain.

The honest answer is that no single metric handles all three, and teams that optimize one often sacrifice another.

Vision-language is harder to evaluate than it looks

VQA (Visual Question Answering) benchmarks like MMBench and MMMU measure broad capability — can the model reason about spatial relationships, identify objects, interpret charts? These are useful for model selection.

They're less useful for evaluating whether your specific system is working. A model that scores well on MMMU might still hallucinate objects in your invoice processing pipeline or misread the legend on your chart type.

The most useful eval is a closed-loop one: generate questions about what a correct response should contain, then check whether the model's output satisfies them. This is QAG applied to vision — and it generalizes better than benchmark-chasing.

What "hallucination" means in vision

Text hallucination is well-studied: the model asserts something false. Vision hallucination has a specific flavor: the model describes objects that aren't in the image.

CHAIR (Caption Hallucination Assessment with Image Relevance) measures what fraction of objects mentioned in a caption don't appear in the image. It's a useful signal, but it only catches one failure mode — a model can produce a perfectly accurate object list while completely misreading spatial relationships or quantities.

Evaluating multimodal systems well means deciding upfront which failure modes matter most for your use case, then building targeted checks for those.

The right frame for multimodal evals

Don't look for a single metric that covers everything. Pick the dimensions that matter for your use case, build specific evaluators for each, and track them independently. A drop in prompt adherence with stable perceptual quality tells you something specific about your pipeline. A drop in both is a different kind of problem.

The tooling for this is early. Most existing eval frameworks were built for text first, multimodal second — and it shows. The abstractions don't quite fit. That's exactly the gap we're building OmniEval to close.

OmniEval is Multivon's multimodal evaluation framework — text, images, vision-language, and beyond. Join the early access list.