GAIA Debugging Skill
When a GAIA question fails, systematically diagnose the root cause and propose a targeted fix.
When to use
- A specific
task_idreturns the wrong answer or times out - Pass-rate dropped between two runs and you need to find the regression
- You want to understand why a particular question class is consistently failing
Failure mode taxonomy
| Code | Mode | Symptom | Fix direction |
|---|---|---|---|
| TG | Tool Gap | Agent lacks a required tool (no image OCR, no PDF reader) | Add tool to catalogue |
| RM | Reasoning Miss | Agent has the right data but draws wrong conclusion | Improve system prompt, add CoT instruction |
| EB | Extraction Bug | Answer is in the trace but FINAL_ANSWER: regex fails |
Fix answer extraction pattern |
| LI | Loop Issue | Agent loops (re-asks same tool call) and hits turn limit | Increase max-turns or add loop-detection |
| DS | Dataset Shift | Ground truth differs from what web currently shows | Flag for HAL dataset audit |
| AT | API Timeout | Tool call times out; agent never gets the result | Increase per-turn timeout |
Diagnostic workflow
Step 1 — Load the question trace
# Find the result for the task_id in the latest run
RESULTS=~/.cache/ruflo/gaia/results-latest.json
node -e "
const r = JSON.parse(require('fs').readFileSync('$RESULTS'));
const q = r.results.find(x => x.task_id === '$TASK_ID');
console.log(JSON.stringify(q, null, 2));
"
Step 2 — Classify the failure
Look at the trace output:
- No tools called at all → RM or configuration issue
- Tool called but returned error → TG or AT
- Tool returned data, wrong answer → RM or EB
- Correct answer in trace but marked wrong → EB
- max-turns hit → LI or question too hard for current model
Step 3 — Re-run with extended logging
node v3/@claude-flow/cli/bin/cli.js gaia-bench run \
--level 1 --limit 1 \
--task-id $TASK_ID \
--models claude-sonnet-4-6 \
--max-turns 20 \
--output json
Step 4 — Apply targeted fix
| Failure | Action |
|---|---|
| TG — missing web_browse | Verify gaia-tools/index.ts exports web_browse; check tool registration |
| TG — missing image OCR | Add image_describe tool call; verify GOOGLE_AI_API_KEY |
| RM — reasoning | Add a system prompt instruction: "Before answering, list all facts you have gathered" |
| EB — extraction | Test the FINAL_ANSWER_RE regex against the trace manually |
| LI — loop | Add a tool-call deduplication guard in gaia-agent.ts |
| AT — timeout | Set DEFAULT_PER_TURN_TIMEOUT_MS higher or use --max-turns flag |
Step 5 — Verify fix and store pattern
# Re-run the single question
node … gaia-bench run --task-id $TASK_ID --models $MODEL --output json
# If now passing, store the pattern
npx @claude-flow/cli@latest memory store \
--namespace gaia-debug-patterns \
--key "fix-$FAILURE_CODE-$(date +%Y%m%d)" \
--value "task_id=$TASK_ID, mode=$FAILURE_CODE, fix=$FIX_DESCRIPTION"
Quick reference: tool catalogue check
node -e "
const { createDefaultToolCatalogue } = require('./v3/@claude-flow/cli/src/benchmarks/gaia-tools/index.js');
const cat = createDefaultToolCatalogue({});
console.log('Tools registered:', cat.definitions.map(t => t.name));
"
Expected: web_search, file_read, web_browse, image_describe, python_exec
Pattern storage
After resolving a debugging session, store the finding:
npx @claude-flow/cli@latest memory store \
--namespace gaia-debug-patterns \
--key "session-$(date +%Y%m%d-%H%M)" \
--value '{"task_id":"$TASK_ID","failure_mode":"$CODE","fix":"$FIX","verified":true}'
Search for similar past failures:
npx @claude-flow/cli@latest memory search \
--namespace gaia-debug-patterns \
--query "extraction bug final answer regex"