Agent Evaluation

Run automated evaluation tests against an agent.

Research Foundation

REF-001: BP-9 - Continuous evaluation of agent performance
REF-002: KAMI benchmark methodology for failure archetype detection

Usage

/eval-agent security-architect
/eval-agent architecture-designer --category archetype
/eval-agent test-engineer --scenario grounding-test --verbose

Arguments

Argument	Required	Description
agent-name	Yes	Agent to evaluate

Options

Option	Default	Description
--category	all	Test category: archetype, performance, quality
--scenario	all	Specific scenario to run
--verbose	false	Show detailed test output
--output	stdout	Output file for results
--strict	false	Fail on any test failure

Test Categories

archetype

Tests for Roig (2025) failure archetypes:

grounding-test - Archetype 1: Premature action
substitution-test - Archetype 2: Over-helpfulness
distractor-test - Archetype 3: Context pollution
recovery-test - Archetype 4: Fragile execution

performance

latency-test - Response time benchmarks
token-test - Token efficiency
parallel-test - Concurrent execution correctness

quality

output-format - Output structure validation
tool-usage - Appropriate tool selection
scope-adherence - Stays within defined scope

Process

Load Agent: Read agent definition
Select Scenarios: Based on --category or --scenario
Setup Environment: Create test workspace
Execute Tests: Run agent against each scenario
Validate Results: Check assertions
Generate Report: Output results

Output Format

{
  "agent": "security-architect",
  "timestamp": "2025-01-15T10:30:00Z",
  "tests": {
    "grounding-test": {
      "passed": true,
      "score": 1.0,
      "details": "Read tool called before Edit",
      "duration_ms": 5000
    },
    "distractor-test": {
      "passed": false,
      "score": 0.6,
      "details": "Used staging data in output",
      "evidence": ["Found 'staging' in response"],
      "duration_ms": 3000
    }
  },
  "summary": {
    "passed": 3,
    "failed": 1,
    "total": 4,
    "score": 0.85
  }
}

Examples

# Full evaluation
/eval-agent architecture-designer

# Archetype tests only
/eval-agent architecture-designer --category archetype

# Single scenario with verbose output
/eval-agent test-engineer --scenario grounding-test --verbose

# Save results
/eval-agent security-architect --output .aiwg/reports/security-eval.json

# Strict mode (fails on any test failure)
/eval-agent devops-engineer --strict

Success Criteria

Metric	Target
Grounding (A1)	>90%
Substitution (A2)	>85%
Distractor (A3)	>80%
Recovery (A4)	≥80%
Overall	≥85%

Related Commands

/eval-workflow - Test multi-agent workflows
/eval-report - Generate quality report
aiwg lint agents - Static validation

Evaluate agent: $ARGUMENTS

References

@$AIWG_ROOT/agentic/code/addons/aiwg-evals/README.md — aiwg-evals addon overview
@$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/god-session.md — Single-responsibility rules that agents are evaluated against
@$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/vague-discretion.md — Concrete success criteria and threshold definitions
@$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/README.md — SDLC agent catalog available for evaluation
@$AIWG_ROOT/docs/cli-reference.md — CLI reference for aiwg lint agents