Cost Benchmark
Runs scripts/bench.mjs against the structural+adversarial corpus and writes per-case + summary results to docs/benchmarks/runs/. This is the verification gate that backs every measurable claim in cost-booster-edit / cost-booster-route.
When to use
- Before publishing a release — verify booster win rate didn't regress.
- After expanding
bench/booster-corpus.json— confirm new cases route correctly. - When auditing a "claimed upstream" tag — flip it to "verified" once the bench supports it.
- On a cost question ("is Sonnet 4.6 cheaper than Opus 4.7 for these tasks?") — re-run with
BENCH_ANTHROPIC=1.
Steps
Run the bench from
v3/(whereagent-boosterresolves):( cd v3 && node ../plugins/ruflo-cost-tracker/scripts/bench.mjs ) # booster only — free, ~85 ms ( cd v3 && BENCH_LLM_BASELINE=1 node ../plugins/ruflo-cost-tracker/scripts/bench.mjs ) # + Gemini 2.0 Flash (cheap) ( cd v3 && BENCH_LLM_BASELINE=1 BENCH_ANTHROPIC=1 \ node ../plugins/ruflo-cost-tracker/scripts/bench.mjs ) # + Sonnet 4.6 + Opus 4.7Inspect the markdown summary printed to stdout. The gate metric is
winRate(Tier 1 cases). Adversarial cases are tracked separately asescalationRate.Persisted output lands at:
docs/benchmarks/runs/latest.json— pointer to the most recent rundocs/benchmarks/runs/<ISO-timestamp>.json— historical record
Read it back in subsequent skills (e.g.
cost-reportstep 2 readslatest.jsonfor live tier-spend numbers).
Smoke gates
winRate ≥ 0.80on Tier 1 cases (smoke step 23). Lower the threshold by editingscripts/smoke.sh.escalationRateis reported but ungated — adversarial cases are diagnostic.
Env overrides
| Env var | Default | Purpose |
|---|---|---|
BENCH_LLM_BASELINE |
unset | =1 runs the OpenAI-compat baseline |
BENCH_LLM_MODEL |
models/gemini-2.0-flash |
Override the OpenAI-compat model |
BENCH_LLM_BASE_URL |
Gemini OpenAI shim | Override endpoint |
BENCH_ANTHROPIC |
unset | =1 runs Anthropic baseline (Sonnet 4.6 + Opus 4.7) |
BENCH_ANTHROPIC_MODELS |
claude-sonnet-4-6,claude-opus-4-7 |
Comma-separated Claude IDs |
BENCH_OUT |
timestamped file | Override output path |
BENCH_QUIET=1 |
unset | Suppress markdown summary |
API keys auto-pulled from gcloud secrets (GOOGLE_AI_API_KEY, ANTHROPIC_API_KEY); override with BENCH_LLM_API_KEY / BENCH_ANTHROPIC_API_KEY.
Cross-references
ADR-0002 §"Decision 1" / §"Riskiest assumption" · cost-booster-edit/SKILL.md (verification table consumes this skill's output) · cost-report/SKILL.md step 2 (reads runs/latest.json).