Skip to main content
AI/MLruvnet

gaia-debugging

Diagnose why a GAIA question failed — extract trace, classify failure mode, and propose a fix

Stars
56,726
Source
ruvnet/claude-flow
Updated
2026-05-31
Slug
ruvnet--claude-flow--gaia-debugging
View on GitHubRaw SKILL.md

// install — copy + paste into any project

mkdir -p .claude/skills && curl -fsSL https://raw.githubusercontent.com/ruvnet/claude-flow/HEAD/plugins/ruflo-workflows/skills/gaia-debugging/SKILL.md -o .claude/skills/gaia-debugging.md

Drops the SKILL.md into .claude/skills/gaia-debugging.md. Works with Claude Code, Cursor, and any agent that loads SKILL.md files from .claude/skills/.

GAIA Debugging Skill

When a GAIA question fails, systematically diagnose the root cause and propose a targeted fix.

When to use

  • A specific task_id returns the wrong answer or times out
  • Pass-rate dropped between two runs and you need to find the regression
  • You want to understand why a particular question class is consistently failing

Failure mode taxonomy

Code Mode Symptom Fix direction
TG Tool Gap Agent lacks a required tool (no image OCR, no PDF reader) Add tool to catalogue
RM Reasoning Miss Agent has the right data but draws wrong conclusion Improve system prompt, add CoT instruction
EB Extraction Bug Answer is in the trace but FINAL_ANSWER: regex fails Fix answer extraction pattern
LI Loop Issue Agent loops (re-asks same tool call) and hits turn limit Increase max-turns or add loop-detection
DS Dataset Shift Ground truth differs from what web currently shows Flag for HAL dataset audit
AT API Timeout Tool call times out; agent never gets the result Increase per-turn timeout

Diagnostic workflow

Step 1 — Load the question trace

# Find the result for the task_id in the latest run
RESULTS=~/.cache/ruflo/gaia/results-latest.json
node -e "
  const r = JSON.parse(require('fs').readFileSync('$RESULTS'));
  const q = r.results.find(x => x.task_id === '$TASK_ID');
  console.log(JSON.stringify(q, null, 2));
"

Step 2 — Classify the failure

Look at the trace output:

  1. No tools called at all → RM or configuration issue
  2. Tool called but returned error → TG or AT
  3. Tool returned data, wrong answer → RM or EB
  4. Correct answer in trace but marked wrong → EB
  5. max-turns hit → LI or question too hard for current model

Step 3 — Re-run with extended logging

node v3/@claude-flow/cli/bin/cli.js gaia-bench run \
  --level 1 --limit 1 \
  --task-id $TASK_ID \
  --models claude-sonnet-4-6 \
  --max-turns 20 \
  --output json

Step 4 — Apply targeted fix

Failure Action
TG — missing web_browse Verify gaia-tools/index.ts exports web_browse; check tool registration
TG — missing image OCR Add image_describe tool call; verify GOOGLE_AI_API_KEY
RM — reasoning Add a system prompt instruction: "Before answering, list all facts you have gathered"
EB — extraction Test the FINAL_ANSWER_RE regex against the trace manually
LI — loop Add a tool-call deduplication guard in gaia-agent.ts
AT — timeout Set DEFAULT_PER_TURN_TIMEOUT_MS higher or use --max-turns flag

Step 5 — Verify fix and store pattern

# Re-run the single question
node … gaia-bench run --task-id $TASK_ID --models $MODEL --output json

# If now passing, store the pattern
npx @claude-flow/cli@latest memory store \
  --namespace gaia-debug-patterns \
  --key "fix-$FAILURE_CODE-$(date +%Y%m%d)" \
  --value "task_id=$TASK_ID, mode=$FAILURE_CODE, fix=$FIX_DESCRIPTION"

Quick reference: tool catalogue check

node -e "
  const { createDefaultToolCatalogue } = require('./v3/@claude-flow/cli/src/benchmarks/gaia-tools/index.js');
  const cat = createDefaultToolCatalogue({});
  console.log('Tools registered:', cat.definitions.map(t => t.name));
"

Expected: web_search, file_read, web_browse, image_describe, python_exec

Pattern storage

After resolving a debugging session, store the finding:

npx @claude-flow/cli@latest memory store \
  --namespace gaia-debug-patterns \
  --key "session-$(date +%Y%m%d-%H%M)" \
  --value '{"task_id":"$TASK_ID","failure_mode":"$CODE","fix":"$FIX","verified":true}'

Search for similar past failures:

npx @claude-flow/cli@latest memory search \
  --namespace gaia-debug-patterns \
  --query "extraction bug final answer regex"