Workflow Evaluation
Run automated evaluation tests against a multi-agent workflow.
Research Foundation
- REF-001: BP-9 - Continuous evaluation of agent performance
- REF-002: KAMI benchmark methodology for real agentic task evaluation
Usage
/eval-workflow flow-security-review-cycle
/eval-workflow flow-inception-to-elaboration --scenario distractor-test
/eval-workflow flow-deploy-to-production --verbose --strict
Arguments
| Argument |
Required |
Description |
| workflow-name |
Yes |
Workflow (flow command) to evaluate |
Options
| Option |
Default |
Description |
| --scenario |
all |
Specific scenario to run |
| --verbose |
false |
Show detailed test output |
| --output |
stdout |
Output file for results |
| --strict |
false |
Fail on any test failure |
| --timeout |
300 |
Maximum seconds per scenario |
What Gets Evaluated
Orchestration Quality
- Agent coordination: Parallel agents launched correctly in single message
- Handoff fidelity: Artifacts pass correctly between phases
- Gate enforcement: Phase gates checked before transition
Archetype Resistance
grounding-test — Archetype 1: Premature action without reading state
distractor-test — Archetype 3: Context pollution from irrelevant artifacts
recovery-test — Archetype 4: Fragile execution when subagent fails
Output Validation
- Required artifacts created in correct
.aiwg/ paths
- Document structure matches templates
- Traceability links intact
Process
- Load Workflow: Read flow command definition
- Select Scenarios: Based on --scenario flag or all applicable
- Setup Workspace: Create isolated
.aiwg/working/ test space
- Execute Flow: Run workflow against each scenario
- Validate Outputs: Check artifact presence, structure, and content
- Generate Report: Output results with pass/fail per assertion
- Cleanup: Remove test workspace
Output Format
{
"workflow": "flow-security-review-cycle",
"timestamp": "2026-04-01T10:30:00Z",
"scenarios": {
"grounding-test": {
"passed": true,
"score": 1.0,
"assertions": [
{"name": "threat-model-created", "passed": true},
{"name": "security-gate-run", "passed": true}
],
"duration_ms": 45000
},
"distractor-test": {
"passed": false,
"score": 0.7,
"assertions": [
{"name": "correct-assets-only", "passed": false, "evidence": "Distractor file referenced in output"}
],
"duration_ms": 38000
}
},
"summary": {
"passed": 4,
"failed": 1,
"total": 5,
"score": 0.80
}
}
Examples
# Full evaluation of a workflow
/eval-workflow flow-security-review-cycle
# Single scenario with verbose output
/eval-workflow flow-inception-to-elaboration --scenario grounding-test --verbose
# Strict mode with output saved
/eval-workflow flow-deploy-to-production --strict --output .aiwg/reports/deploy-eval.json
Success Criteria
| Metric |
Target |
| Artifact creation |
100% |
| Grounding compliance |
>90% |
| Distractor resistance |
>80% |
| Recovery success |
≥80% |
| Overall |
≥85% |
Related Commands
/eval-agent - Test individual agents
/eval-report - Generate aggregate quality report
aiwg lint agents - Static validation
Evaluate workflow: $ARGUMENTS
References
- @$AIWG_ROOT/agentic/code/addons/aiwg-evals/README.md — aiwg-evals addon overview
- @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/subagent-scoping.md — Parallel agent coordination rules evaluated in workflows
- @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/vague-discretion.md — Concrete success thresholds and test criteria
- @$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/README.md — SDLC flow commands available for workflow evaluation
- @$AIWG_ROOT/docs/cli-reference.md — CLI reference for aiwg lint and eval commands