Skip to main content
AI/MLjmagly

eval-agent

Run evaluation tests against an agent to assess quality and archetype resistance

Stars
141
Source
jmagly/aiwg
Updated
2026-05-31
Slug
jmagly--aiwg--eval-agent
View on GitHubRaw SKILL.md

// install — copy + paste into any project

mkdir -p .claude/skills && curl -fsSL https://raw.githubusercontent.com/jmagly/aiwg/HEAD/agentic/code/addons/aiwg-evals/skills/eval-agent/SKILL.md -o .claude/skills/eval-agent.md

Drops the SKILL.md into .claude/skills/eval-agent.md. Works with Claude Code, Cursor, and any agent that loads SKILL.md files from .claude/skills/.

Agent Evaluation

Run automated evaluation tests against an agent.

Research Foundation

  • REF-001: BP-9 - Continuous evaluation of agent performance
  • REF-002: KAMI benchmark methodology for failure archetype detection

Usage

/eval-agent security-architect
/eval-agent architecture-designer --category archetype
/eval-agent test-engineer --scenario grounding-test --verbose

Arguments

Argument Required Description
agent-name Yes Agent to evaluate

Options

Option Default Description
--category all Test category: archetype, performance, quality
--scenario all Specific scenario to run
--verbose false Show detailed test output
--output stdout Output file for results
--strict false Fail on any test failure

Test Categories

archetype

Tests for Roig (2025) failure archetypes:

  • grounding-test - Archetype 1: Premature action
  • substitution-test - Archetype 2: Over-helpfulness
  • distractor-test - Archetype 3: Context pollution
  • recovery-test - Archetype 4: Fragile execution

performance

  • latency-test - Response time benchmarks
  • token-test - Token efficiency
  • parallel-test - Concurrent execution correctness

quality

  • output-format - Output structure validation
  • tool-usage - Appropriate tool selection
  • scope-adherence - Stays within defined scope

Process

  1. Load Agent: Read agent definition
  2. Select Scenarios: Based on --category or --scenario
  3. Setup Environment: Create test workspace
  4. Execute Tests: Run agent against each scenario
  5. Validate Results: Check assertions
  6. Generate Report: Output results

Output Format

{
  "agent": "security-architect",
  "timestamp": "2025-01-15T10:30:00Z",
  "tests": {
    "grounding-test": {
      "passed": true,
      "score": 1.0,
      "details": "Read tool called before Edit",
      "duration_ms": 5000
    },
    "distractor-test": {
      "passed": false,
      "score": 0.6,
      "details": "Used staging data in output",
      "evidence": ["Found 'staging' in response"],
      "duration_ms": 3000
    }
  },
  "summary": {
    "passed": 3,
    "failed": 1,
    "total": 4,
    "score": 0.85
  }
}

Examples

# Full evaluation
/eval-agent architecture-designer

# Archetype tests only
/eval-agent architecture-designer --category archetype

# Single scenario with verbose output
/eval-agent test-engineer --scenario grounding-test --verbose

# Save results
/eval-agent security-architect --output .aiwg/reports/security-eval.json

# Strict mode (fails on any test failure)
/eval-agent devops-engineer --strict

Success Criteria

Metric Target
Grounding (A1) >90%
Substitution (A2) >85%
Distractor (A3) >80%
Recovery (A4) ≥80%
Overall ≥85%

Related Commands

  • /eval-workflow - Test multi-agent workflows
  • /eval-report - Generate quality report
  • aiwg lint agents - Static validation

Evaluate agent: $ARGUMENTS

References

  • @$AIWG_ROOT/agentic/code/addons/aiwg-evals/README.md — aiwg-evals addon overview
  • @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/god-session.md — Single-responsibility rules that agents are evaluated against
  • @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/vague-discretion.md — Concrete success criteria and threshold definitions
  • @$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/README.md — SDLC agent catalog available for evaluation
  • @$AIWG_ROOT/docs/cli-reference.md — CLI reference for aiwg lint agents