Open Source · AI Evaluation

Know if your AI actually works.

Catch regressions before your users do. Evaluate models, agents, and pipelines with deterministic checks, LLM judges, and agent trace analysis — in one SDK.

$pip install multivon-eval

Start free on GitHub Talk to us →

scroll

Works withClaudeOpenAIGeminiLangChainLlamaIndexAny model

OmniEval

One SDK. Every evaluation type your team needs.

Open sourceApache 2.0Python 3.10+32 evaluators

Deterministic evaluators

String matching, regex, JSON schema validation, BLEU, ROUGE, latency — instant, free, no LLM calls.

LLM-as-judge with QAG scoring

Faithfulness, hallucination, relevance, coherence, and custom rubrics. Binary yes/no questions instead of unreliable 1-10 ratings.

Agent trace evaluation

Tool call accuracy, plan quality, step faithfulness, and task completion. Works with any agent framework.

Conversation evaluation

Knowledge retention, relevance, consistency, and completeness across multi-turn sessions.

View on GitHub →Enterprise support

eval.py

from multivon_eval import (

EvalSuite, EvalCase,

Faithfulness, Relevance,

ToolCallAccuracy

)

suite = EvalSuite("Prod")

suite.add_cases(load("cases.jsonl"))

suite.add_evaluators(

Faithfulness(),

Relevance(),

ToolCallAccuracy(),

)

report = suite.run(model)

✓ 48/50 passed

faithfulness 0.91

relevance 0.96

tool_accuracy 0.88

Why teams choose OmniEval

Built from hard lessons shipping AI to production.

QAG scoring

We generate yes/no questions about the output instead of asking a judge to rate 1-10. Binary answers are more reliable, fully auditable, and cheaper to run.

Agent-native

Purpose-built evaluators for tool call accuracy, multi-step plan quality, and step faithfulness. Most eval tools were built for text — we were built for agents.

Mix tiers freely

Run fast deterministic checks first, then LLM judges only where it matters. Pay for compute where it makes the difference, not everywhere.

CI/CD first

One line blocks deployment when pass rate drops below your threshold. Evals that don't run in CI catch nothing.

Up and running in minutes

No infrastructure to manage. No account required.

Install and define cases

pip install multivon-eval # cases.jsonl {"input": "Summarize this", "context": "..."}

Choose your evaluators

suite.add_evaluators( NotEmpty(), Faithfulness(), ToolCallAccuracy(), )

Run in CI, block regressions

report = suite.run( model_fn, fail_threshold=0.85, ) # exits 1 if < 85%

Why We Exist

“Shipping AI without evals is flying blind. We're building the instruments.”

Most teams building with AI have no reliable way to know if their model is getting better or worse across releases. Numeric scores drift, judges hallucinate, and regressions reach users before anyone notices. Multivon exists to fix that — with evaluation tooling that is fast to adopt, cheap to run, and honest about what it measures.

Latest insights

Practical thinking on AI evaluation, model quality, and shipping with confidence.

View all posts →

Engineering·6 min read

Start evaluating today

Open source and free to use. Need enterprise support, custom evaluators, or a hosted dashboard? Talk to us.

Get started free hello@multivon.ai →