LangENKO

`mandu ai eval`

Non-interactive prompt evaluator — runs one prompt across one or more providers and prints JSON to stdout. Exit 0 if every provider succeeded; exit 1 if any errored. Ideal for CI diff tests.

since v0.25
On this page

mandu ai eval

Non-interactive prompt evaluator. Runs one prompt across one or more providers, prints JSON to stdout. Ideal for CI diff tests and cross-provider comparison.

# One provider
mandu ai eval --provider=local --prompt="hello"

# Fan out across providers
mandu ai eval \
  --prompt="Summarize Mandu in one paragraph." \
  --providers=local,claude,openai

# Via stdin
cat prompt.md | mandu ai eval --providers=claude,gemini

# In CI: snapshot compare
mandu ai eval --provider=local --prompt="known input" > out.json
diff out.json expected.json

Output shape

{
  "prompt": "Summarize Mandu in one paragraph.",
  "providers": [
    {
      "provider": "local",
      "model": "echo",
      "ok": true,
      "text": "echo: Summarize Mandu in one paragraph.",
      "duration_ms": 3
    },
    {
      "provider": "claude",
      "model": "claude-sonnet-4-0",
      "ok": true,
      "text": "Mandu is a Bun-native meta-framework...",
      "duration_ms": 1247,
      "usage": { "input_tokens": 12, "output_tokens": 84 }
    }
  ],
  "exit_code": 0
}

When any provider fails, its entry carries ok: false and error:

{
  "provider": "openai",
  "model": "gpt-4o",
  "ok": false,
  "error": {
    "code": "CLI_E301",
    "message": "stream failed: 503 Service Unavailable"
  },
  "duration_ms": 412
}

The root-level exit_code is 0 if every provider succeeded, 1 if any failed.

Flags

Flag Default Notes
--provider Single provider (shortcut for --providers=<one>)
--providers local Comma-separated list: claude,openai,gemini,local
--model provider-specific Override model for every provider in --providers
--prompt Prompt text. If omitted, read from stdin.
--system Absolute or CWD-relative file path as the system prompt
--preset Name of a docs/prompts/<name>.md file
--timeout 60000 ms Per-stream wall-clock budget
--json true JSON output (the default)
--help off Prints help without touching the network

Per-provider model overrides (when you want different models per provider in the same eval):

# Map each provider → specific model
mandu ai eval \
  --prompt="Rank these" \
  --providers=claude:claude-sonnet-4-0,openai:gpt-4o

CI patterns

Smoke test — local only

# .github/workflows/ai-smoke.yml
- run: mandu ai eval --provider=local --prompt="hello" > ai-smoke.json
- run: |
    test "$(jq -r '.providers[0].ok' ai-smoke.json)" = "true"

Determinism check

Because the local provider is deterministic, you can snapshot the output and diff:

- run: mandu ai eval --provider=local --prompt="known input" > actual.json
- run: diff -u expected.json actual.json

Cross-provider sanity

When you want to confirm a production-critical prompt still works across the three cloud providers:

- run: mandu ai eval \
    --preset=mandu-conventions \
    --prompt="Where does guard.preset=cqrs live?" \
    --providers=claude,openai,gemini > eval.json
- run: |
    jq -e '.exit_code == 0' eval.json
    jq '.providers | map({provider, ok}) | .[]' eval.json

Using presets and system prompts

# A project-local preset
mandu ai eval \
  --preset=mandu-conventions \
  --prompt="Where does the guard preset cqrs live?" \
  --providers=claude

# Arbitrary system prompt file
mandu ai eval \
  --system=./prompts/reviewer.md \
  --prompt="$(cat pr-description.md)" \
  --provider=openai

The --preset loader looks for docs/prompts/<name>.md. Presets live in your project; Mandu ships a few reference presets you can import as starting points.

Offline eval

The default local provider works with no API key:

mandu ai eval --prompt="hello"
# → uses --provider=local by default

This makes mandu ai eval --provider=local a reliable CI canary — it exercises the same code path as real providers (prompt plumbing, JSON shape, timeout handling) without any external dependency.

Exit codes

Code Meaning
0 every provider in --providers returned ok: true
1 at least one provider returned ok: false OR a stream failed
2 usage error (unknown provider, malformed --providers)

Common errors

CLI_E300: API key missing for provider 'claude' — export MANDU_CLAUDE_API_KEY or use --providers=local for offline runs.

CLI_E307: timeout — a provider exceeded --timeout / MANDU_AI_TIMEOUT_MS. The offender is marked with ok: false; other providers still run to completion.

JSON output interleaved with stderr logs — stderr is free-form (human-readable). Pipe only stdout to jq: mandu ai eval ... | jq or 2>/dev/null if you need to suppress stderr entirely.

🤖 Agent Prompt

🤖 Agent Prompt — `mandu ai eval`
Apply the guidance from the Mandu docs page at https://mandujs.com/docs/ai/eval to my project.

Summary of the page:
`mandu ai eval` is the batch / non-interactive counterpart to `mandu ai chat`. Accepts `--providers=a,b,c` to fan out, `--prompt=...` or stdin, streams to JSON on stdout. Use with `jq` or snapshot tests in CI. Exit 0 = all providers OK, exit 1 = any provider failed.

Required invariants — must hold after your changes:
- Output is JSON to stdout — `jq` friendly, snapshot-friendly
- Exit 0 if every provider in `--providers` succeeded; exit 1 if any returned an error
- Prompts can be passed via `--prompt` OR piped over stdin
- Each provider call honors `MANDU_AI_TIMEOUT_MS` — slow ones fail the whole eval
- No history — every invocation is a fresh context

Then:
1. Make the change in my codebase consistent with the page.
2. Run `bun run guard` and `bun run check` to verify nothing
   in src/ or app/ breaks Mandu's invariants.
3. Show me the diff and any guard violations.

For Agents

{
  "schema": "mandu.ai.eval/v0.25",
  "command": "mandu ai eval",
  "output": "JSON on stdout",
  "providers_flag": "--providers=<comma-separated list>",
  "prompt_sources": ["--prompt=<text>", "stdin"],
  "history": "none — each invocation is a fresh context",
  "exit_codes": {
    "0": "every provider ok",
    "1": "at least one provider failed",
    "2": "usage error"
  },
  "rules": [
    "Use `--provider=local` for offline CI — deterministic, no API key",
    "Pipe stdout only to `jq` — stderr is free-form log output",
    "Per-provider model overrides: `--providers=a:model,b:model`"
  ]
}

For Agents

AI hint

`mandu ai eval` is the batch / non-interactive counterpart to `mandu ai chat`. Accepts `--providers=a,b,c` to fan out, `--prompt=...` or stdin, streams to JSON on stdout. Use with `jq` or snapshot tests in CI. Exit 0 = all providers OK, exit 1 = any provider failed.

Invariants
  • Output is JSON to stdout — `jq` friendly, snapshot-friendly
  • Exit 0 if every provider in `--providers` succeeded; exit 1 if any returned an error
  • Prompts can be passed via `--prompt` OR piped over stdin
  • Each provider call honors `MANDU_AI_TIMEOUT_MS` — slow ones fail the whole eval
  • No history — every invocation is a fresh context
Guard scope
ai