`mandu ai eval`
Non-interactive prompt evaluator — runs one prompt across one or more providers and prints JSON to stdout. Exit 0 if every provider succeeded; exit 1 if any errored. Ideal for CI diff tests.
On this page
mandu ai eval
Non-interactive prompt evaluator. Runs one prompt across one or more providers, prints JSON to stdout. Ideal for CI diff tests and cross-provider comparison.
# One provider
mandu ai eval --provider=local --prompt="hello"
# Fan out across providers
mandu ai eval \
--prompt="Summarize Mandu in one paragraph." \
--providers=local,claude,openai
# Via stdin
cat prompt.md | mandu ai eval --providers=claude,gemini
# In CI: snapshot compare
mandu ai eval --provider=local --prompt="known input" > out.json
diff out.json expected.json
Output shape
{
"prompt": "Summarize Mandu in one paragraph.",
"providers": [
{
"provider": "local",
"model": "echo",
"ok": true,
"text": "echo: Summarize Mandu in one paragraph.",
"duration_ms": 3
},
{
"provider": "claude",
"model": "claude-sonnet-4-0",
"ok": true,
"text": "Mandu is a Bun-native meta-framework...",
"duration_ms": 1247,
"usage": { "input_tokens": 12, "output_tokens": 84 }
}
],
"exit_code": 0
}
When any provider fails, its entry carries ok: false and error:
{
"provider": "openai",
"model": "gpt-4o",
"ok": false,
"error": {
"code": "CLI_E301",
"message": "stream failed: 503 Service Unavailable"
},
"duration_ms": 412
}
The root-level exit_code is 0 if every provider succeeded, 1 if
any failed.
Flags
| Flag | Default | Notes |
|---|---|---|
--provider |
— | Single provider (shortcut for --providers=<one>) |
--providers |
local |
Comma-separated list: claude,openai,gemini,local |
--model |
provider-specific | Override model for every provider in --providers |
--prompt |
— | Prompt text. If omitted, read from stdin. |
--system |
— | Absolute or CWD-relative file path as the system prompt |
--preset |
— | Name of a docs/prompts/<name>.md file |
--timeout |
60000 ms | Per-stream wall-clock budget |
--json |
true |
JSON output (the default) |
--help |
off | Prints help without touching the network |
Per-provider model overrides (when you want different models per provider in the same eval):
# Map each provider → specific model
mandu ai eval \
--prompt="Rank these" \
--providers=claude:claude-sonnet-4-0,openai:gpt-4o
CI patterns
Smoke test — local only
# .github/workflows/ai-smoke.yml
- run: mandu ai eval --provider=local --prompt="hello" > ai-smoke.json
- run: |
test "$(jq -r '.providers[0].ok' ai-smoke.json)" = "true"
Determinism check
Because the local provider is deterministic, you can snapshot the
output and diff:
- run: mandu ai eval --provider=local --prompt="known input" > actual.json
- run: diff -u expected.json actual.json
Cross-provider sanity
When you want to confirm a production-critical prompt still works across the three cloud providers:
- run: mandu ai eval \
--preset=mandu-conventions \
--prompt="Where does guard.preset=cqrs live?" \
--providers=claude,openai,gemini > eval.json
- run: |
jq -e '.exit_code == 0' eval.json
jq '.providers | map({provider, ok}) | .[]' eval.json
Using presets and system prompts
# A project-local preset
mandu ai eval \
--preset=mandu-conventions \
--prompt="Where does the guard preset cqrs live?" \
--providers=claude
# Arbitrary system prompt file
mandu ai eval \
--system=./prompts/reviewer.md \
--prompt="$(cat pr-description.md)" \
--provider=openai
The --preset loader looks for docs/prompts/<name>.md. Presets live
in your project; Mandu ships a few reference presets you can import as
starting points.
Offline eval
The default local provider works with no API key:
mandu ai eval --prompt="hello"
# → uses --provider=local by default
This makes mandu ai eval --provider=local a reliable CI canary —
it exercises the same code path as real providers (prompt plumbing,
JSON shape, timeout handling) without any external dependency.
Exit codes
| Code | Meaning |
|---|---|
0 |
every provider in --providers returned ok: true |
1 |
at least one provider returned ok: false OR a stream failed |
2 |
usage error (unknown provider, malformed --providers) |
Common errors
CLI_E300: API key missing for provider 'claude' — export
MANDU_CLAUDE_API_KEY or use --providers=local for offline runs.
CLI_E307: timeout — a provider exceeded
--timeout / MANDU_AI_TIMEOUT_MS. The offender is marked with
ok: false; other providers still run to completion.
JSON output interleaved with stderr logs — stderr is free-form
(human-readable). Pipe only stdout to jq: mandu ai eval ... | jq
or 2>/dev/null if you need to suppress stderr entirely.
🤖 Agent Prompt
Apply the guidance from the Mandu docs page at https://mandujs.com/docs/ai/eval to my project.
Summary of the page:
`mandu ai eval` is the batch / non-interactive counterpart to `mandu ai chat`. Accepts `--providers=a,b,c` to fan out, `--prompt=...` or stdin, streams to JSON on stdout. Use with `jq` or snapshot tests in CI. Exit 0 = all providers OK, exit 1 = any provider failed.
Required invariants — must hold after your changes:
- Output is JSON to stdout — `jq` friendly, snapshot-friendly
- Exit 0 if every provider in `--providers` succeeded; exit 1 if any returned an error
- Prompts can be passed via `--prompt` OR piped over stdin
- Each provider call honors `MANDU_AI_TIMEOUT_MS` — slow ones fail the whole eval
- No history — every invocation is a fresh context
Then:
1. Make the change in my codebase consistent with the page.
2. Run `bun run guard` and `bun run check` to verify nothing
in src/ or app/ breaks Mandu's invariants.
3. Show me the diff and any guard violations.
Related
- AI — Chat — interactive REPL counterpart.
- AI — Prompts — prompt template system.
- AI — MCP tools — the tool surface exposed to agents.
For Agents
{
"schema": "mandu.ai.eval/v0.25",
"command": "mandu ai eval",
"output": "JSON on stdout",
"providers_flag": "--providers=<comma-separated list>",
"prompt_sources": ["--prompt=<text>", "stdin"],
"history": "none — each invocation is a fresh context",
"exit_codes": {
"0": "every provider ok",
"1": "at least one provider failed",
"2": "usage error"
},
"rules": [
"Use `--provider=local` for offline CI — deterministic, no API key",
"Pipe stdout only to `jq` — stderr is free-form log output",
"Per-provider model overrides: `--providers=a:model,b:model`"
]
}For Agents
`mandu ai eval` is the batch / non-interactive counterpart to `mandu ai chat`. Accepts `--providers=a,b,c` to fan out, `--prompt=...` or stdin, streams to JSON on stdout. Use with `jq` or snapshot tests in CI. Exit 0 = all providers OK, exit 1 = any provider failed.
- Output is JSON to stdout — `jq` friendly, snapshot-friendly
- Exit 0 if every provider in `--providers` succeeded; exit 1 if any returned an error
- Prompts can be passed via `--prompt` OR piped over stdin
- Each provider call honors `MANDU_AI_TIMEOUT_MS` — slow ones fail the whole eval
- No history — every invocation is a fresh context