Evaluations (Evals)
v3.8.1Last updated: 2026-05-13
Was this page helpful?
Loading OmniRoute...
Source of truth: , ,
Last updated: 2026-05-13 β v3.8.0
, captures latency
and outputs, and persists the run.).and one or more cases. Suites come from two sources:
| at boot | ||
+ |
):
| targeting | |
β sent to |
|
| β scoring rubric (see below) | |
, , |
in :
| field | ||
| ) | ||
and , the field is required (enforced by Zod
). When is provided, both targets must differ β
the runner persists both runs under the same for A/B comparison.
(evalRunner.ts):
| is truthy | |
| returns truthy (built-in only) |
Note: Custom-function scoring is reserved for code-defined (built-in)
suites because functions cannot be serialized through the API. The
only accepts for
user-created suites.
.
and
):
, , |
|
, , |
|
, , |
not stored in the DB. They live in memory and are
re-registered every time is imported.
) β they are not part of the public proxy surface.
curl -X POST http://localhost:20128/api/evals \
-H "Cookie: auth_token=..." \
-H "Content-Type: application/json" \
-d '{
"suiteId": "golden-set",
"target": { "type": "combo", "id": "my-combo" },
"apiKeyId": "optional-api-key-uuid"
}'
of pre-computed outputs. When provided,
the runner skips dispatch and only scores the cached outputs (useful for
offline evaluation).
- β second target to run in parallel; both runs share a
generated
for head-to-head viewing.
- β internal API key used to authenticate the dispatched
calls. Required when is enabled.
curl -X POST http://localhost:20128/api/evals/suites \
-H "Cookie: auth_token=..." \
-H "Content-Type: application/json" \
-d '{
"name": "Production smoke",
"description": "Quick sanity check before deploy",
"cases": [
{
"name": "JSON shape",
"model": "gpt-4o",
"input": { "messages": [{ "role": "user", "content": "Reply with {\"ok\": true}" }] },
"expected": { "strategy": "regex", "value": "\"ok\"\\s*:\\s*true" }
}
]
}'
():
with the case's
, the resolved , , and
(or the case override). payload..sequentially. There is no concurrency flag today.
(). From there you
can:
(see also AUTO-COMBO.md for the live scoring engine). That subsystem targets the Auto Combo engine β automatically scoring providers and models so combos can self-heal when upstreams fail. It uses its own runner, its own categorizer, and its own scoring logic.
broader, general-purpose testing surface. Prefer it for arbitrary regression suites, A/B comparisons, and per-release smoke tests. Use the Auto-Assessment subsystem when you need real-time provider health to influence routing decisions.
npm script today. Two paths if you want to gate releases on eval results:
with a known
+ , and assert in the
response.
- In-process path: import
from
from a script, run against a test DB, and check the
returned .
and .
block in () and widen in
plus in .
- New built-in suite β define a suite object and call
at
the bottom of . It will be auto-discovered by .
- Run with concurrency β change the sequential
loop in
to a bounded (no concurrency
control exists today).
- Stream/tool-call cases β currently the runner forces
.
Streaming or tool-aware evaluation would require changes in
(capture and aggregate SSE chunks before scoring).
,