Run Judges Skill
Purpose
Execute specialized judge agents in parallel to evaluate implementation plan quality (16 judges, 4 batches), code quality (11 judges, 3 batches), PRD quality (5 judges, 2 batches), or Feature quality (3 judges, 1 batch). All batches respect the Task tool's 4-concurrent-agent limit. Aggregates results into $CLOSEDLOOP_WORKDIR/plan-judges.json (plan), $CLOSEDLOOP_WORKDIR/code-judges.json (code), $CLOSEDLOOP_WORKDIR/prd-judges.json (prd), or $CLOSEDLOOP_WORKDIR/feature-judges.json (feature) with validated output format.
Parameters
--workdir: Path to the working directory containing judge artifacts (optional)
- Resolved in order:
--workdirargument →$CLOSEDLOOP_WORKDIRenvironment variable →.closedloop-ai/judges(default, relative to current working directory) - The directory is created automatically if it does not exist
- All output files (
plan-judges.json,code-judges.json,prd-judges.json,judge-input.json,perf.jsonl, etc.) are written to this resolved directory
--artifact-type: Artifact category to evaluate (plan | code | prd | feature), default: plan
- plan (default): Evaluate implementation plan with 16 judges, 4 batches, output to plan-judges.json
- code: Evaluate implemented code with 11 judges, 3 batches, output to code-judges.json
- prd: Evaluate PRD document with 5 judges across 2 sequential batches (3 + 2, max 4 concurrent per batch), output to prd-judges.json
- feature: Evaluate Feature artifact with 3 judges, 1 batch, output to feature-judges.json
Judge Input Contract (judge-input.json)
The judge input contract is maintained in:
skills/run-judges/references/judge-input-contract.md (resolve to an absolute path at runtime via Glob)
This keeps orchestration flow readable while preserving a single source of truth for contract fields and semantics.
run-judges is the producer chokepoint for judge-input.json. After mode-specific context preparation and before launching any judge agent, invoke the deterministic mapper:
uv run "${CLAUDE_PLUGIN_ROOT}/skills/run-judges/scripts/judge_input_mapping.py" \
--workdir "$CLOSEDLOOP_WORKDIR" \
--artifact-type "$ARTIFACT_TYPE" \
--schema "${CLAUDE_PLUGIN_ROOT}/schemas/judge-input.schema.json"
The mapper builds from the runtime workdir contract: primary artifacts under <runDir>, supporting context under <runDir>/.closedloop-ai/context, and attachments under <runDir>/.closedloop-ai/work/attachments. It validates the generated envelope against schemas/judge-input.schema.json before judge launch. If mapping fails, emit a clear warning and use the documented one-run legacy fallback paths (prd.md, plan.md, or existing compatibility artifacts) only for that run.
Task Context
You are orchestrating quality evaluation for a ClosedLoop artifact (implementation plan, code, or PRD). Your responsibilities:
For plan artifacts (default):
- Launch context-manager-for-judges agent to prepare compressed plan context
- Build
judge-input.jsonwith plan task/context mapping - Launch all 16 judge agents in parallel batches
- Aggregate their CaseScore outputs into a valid EvaluationReport
- Write the report to
$CLOSEDLOOP_WORKDIR/plan-judges.json - Validate output structure and completeness
For code artifacts (--artifact-type code):
- Launch context-manager-for-judges agent to prepare compressed context
- Build
judge-input.jsonwith code task/context mapping - Launch 11 judge agents in parallel batches
- Aggregate their CaseScore outputs into a valid EvaluationReport
- Write the report to
$CLOSEDLOOP_WORKDIR/code-judges.json - Validate output structure and completeness
For PRD artifacts (--artifact-type prd):
- Check
$CLOSEDLOOP_WORKDIR/prd.mdexists (graceful exit if missing) - Build and schema-validate
judge-input.jsonby invokingscripts/judge_input_mapping.py - Launch the 5 PRD judges in 2 sequential batches (3 + 2, max 4 concurrent per batch)
- Aggregate all 5 CaseScores into a valid EvaluationReport
- Write the report to
$CLOSEDLOOP_WORKDIR/prd-judges.json - Validate output structure and completeness
For Feature artifacts (--artifact-type feature):
- Check
$CLOSEDLOOP_WORKDIR/feature.mdexists, or$CLOSEDLOOP_WORKDIR/prd.mdexists for legacy Feature inputs (graceful exit code 0 if both are missing) - Build and schema-validate
judge-input.jsonby invokingscripts/judge_input_mapping.py - Launch 3 judges in 1 batch (feature-completeness-judge + prd-testability-judge + prd-dependency-judge)
- Aggregate 3 CaseScores into a valid EvaluationReport
- Write the report to
$CLOSEDLOOP_WORKDIR/feature-judges.json - Validate output structure and completeness
Feature mode judge selection rationale:
prd-auditoris excluded because it assumes US-###/AC-#.# numbering and multi-story traceability, which Feature artifacts do not followprd-scope-judgeis excluded because it assumes In/Out-of-Scope sections that are not present in Feature artifacts
Feature mode preamble: Feature mode uses the dedicated feature_preamble.md so judges receive a Feature-shaped contract (evaluation_type=feature, lightweight structure, no PRD-only sections). Do NOT substitute prd_preamble.md — it would frame the input as a full PRD and contradict the envelope's evaluation_type.
Success criteria:
- All judges executed (or error CaseScores generated for failures)
- Valid JSON written to appropriate output file
- Validation script passes with zero errors
Threshold Overrides
The run-judges skill supports per-artifact-type threshold customization via JSON configuration files. This allows you to adjust evaluation strictness for different artifact types (e.g., applying a lower threshold for test-judge when evaluating code vs plan).
Configuration Schema
Threshold overrides are defined in a JSON file with the following structure:
{
"overrides": {
"artifact_type:judge_name": <threshold_float>
}
}
Where:
- Key format:
"artifact_type:judge_name"(e.g.,"code:test-judge","plan:technical-accuracy-judge") - Value: Threshold as a float in range
[0.0, 1.0]
Example configuration:
{
"overrides": {
"code:test-judge": 0.75,
"plan:technical-accuracy-judge": 0.85
}
}
Loading Precedence
The skill checks the following locations in order, using the first valid configuration found:
Run-specific overrides (highest precedence):
- Path:
$CLOSEDLOOP_WORKDIR/.closedloop-ai/settings/threshold-overrides.json - Use case: Override thresholds for a specific ClosedLoop run
- Path:
Repo-level defaults (fallback):
- Path:
<project-root>/.closedloop-ai/settings/threshold-overrides.json - Use case: Set project-wide threshold defaults
- Path:
Hardcoded defaults (graceful degradation):
- If no configuration file exists at any location, use built-in defaults
- No error is raised for missing configuration files
Default Overrides
The following default overrides apply when evaluating code artifacts:
| Judge | Code Threshold | Plan Threshold | Rationale |
|---|---|---|---|
test-judge |
0.75 | 0.8 | Code may have tests written separately from implementation, lower threshold accounts for incremental test development |
All other judges use the same threshold (typically 0.8) across artifact types.
Validation and Error Handling
When loading threshold overrides, the skill applies the following validation rules:
Schema Validation:
- Configuration must contain an
"overrides"key - Each key must match the pattern
artifact_type:judge_name - Each value must be a float in range
[0.0, 1.0] - Keys must reference valid artifact types (
plan,code,prd) and judge names
Error Behavior:
- Malformed JSON: Log warning and continue with hardcoded defaults
Warning: Invalid threshold-overrides.json, skipping overrides: {error} - Invalid schema: Log warning and continue with hardcoded defaults
- File not found: Silently use defaults (no warning logged)
Error recovery ensures the skill always completes judge execution, even if threshold configuration is incorrect.
Integration with Judge Execution
When executing judges:
- Before launching judge batches: Load threshold overrides from the precedence chain
- Merge with defaults: Loaded overrides take precedence over hardcoded defaults
- Apply per-judge: Each judge receives its artifact-type-specific threshold via the evaluation context
- CaseScore validation: Thresholds are used to determine
final_status(pass/fail) based on metric scores
When artifact type is code:
- Load threshold overrides before executing judge batches
- Apply code-specific thresholds to each judge's evaluation criteria
- Merge loaded overrides with defaults (loaded values take precedence)
Performance Instrumentation (Mandatory)
You MUST emit a pipeline_step event to $CLOSEDLOOP_WORKDIR/perf.jsonl at the end of each phase below. This keeps perf telemetry in the canonical schema and adds nested metadata for judge/sub-agent work.
Context: CLOSEDLOOP_WORKDIR, CLOSEDLOOP_RUN_ID, and CLOSEDLOOP_ITERATION are set by the run-loop. CLOSEDLOOP_PARENT_STEP and CLOSEDLOOP_PARENT_STEP_NAME are set as env vars on the claude invocation by run-loop; they are inherited by all Bash tool calls — no sourcing needed.
Use sub_step as numeric phase order and optional sub_step_name to capture the judge/sub-agent name when applicable (for batch-level phases where many judges run, use the batch label).
Sub-step numbering:
| Artifact | sub_step | sub_step_name |
|---|---|---|
| plan | 0 | context_manager |
| plan | 1–4 | batch_1 … batch_4 |
| plan | 5 | aggregate |
| plan | 6 | validate |
| code | 0 | context_manager |
| code | 1–3 | batch_1 … batch_3 |
| code | 4 | aggregate |
| code | 5 | validate |
| prd | 0 | context_prep (skipped — prd mode does not use context-manager-for-judges) |
| prd | 1–2 | batch_1, batch_2 |
| prd | 3 | aggregate |
| prd | 4 | validate |
| feature | 0 | context_prep (skipped — feature mode does not use context-manager-for-judges) |
| feature | 1 | batch_1 |
| feature | 2 | aggregate |
| feature | 3 | validate |
Start of phase (run Bash once at the beginning of each phase): Set the two sub-step variables at the top for the current phase, then run the block. It writes start time to a temp file so the end-of-phase Bash can compute duration. CLOSEDLOOP_PARENT_STEP and CLOSEDLOOP_PARENT_STEP_NAME are already in the environment (set by run-loop on the claude invocation).
# Set these two values for the current phase:
SUB_STEP_NUM=0
SUB_STEP_LABEL="context_manager" # context_manager | batch_1 … | aggregate | validate
mkdir -p "$CLOSEDLOOP_WORKDIR/.closedloop-ai"
{
echo "SUB_STEP=${SUB_STEP_NUM}"
echo "SUB_STEP_NAME=${SUB_STEP_LABEL}"
echo "PARENT_STEP=${CLOSEDLOOP_PARENT_STEP:-0}"
echo "PARENT_STEP_NAME=${CLOSEDLOOP_PARENT_STEP_NAME:-unknown}"
echo "STARTED_AT=$(date -u +%Y-%m-%dT%H:%M:%SZ)"
echo "START_EPOCH=$(date +%s)"
} > "$CLOSEDLOOP_WORKDIR/.closedloop-ai/perf-substep-start.env"
End of phase (run Bash once at the end of each phase, after the phase work is done): Read start time, compute duration, append one line to perf.jsonl, then remove the temp file.
source "$CLOSEDLOOP_WORKDIR/.closedloop-ai/perf-substep-start.env"
END_EPOCH=$(date +%s)
ENDED_AT=$(date -u +%Y-%m-%dT%H:%M:%SZ)
DURATION=$((END_EPOCH - START_EPOCH))
jq -n -c \
--arg event "pipeline_step" \
--arg run_id "${CLOSEDLOOP_RUN_ID:-unknown}" \
--argjson iteration "${CLOSEDLOOP_ITERATION:-0}" \
--argjson step "$PARENT_STEP" \
--arg step_name "$PARENT_STEP_NAME" \
--argjson sub_step "$SUB_STEP" \
--arg sub_step_name "$SUB_STEP_NAME" \
--arg started_at "$STARTED_AT" \
--arg ended_at "$ENDED_AT" \
--argjson duration_s "$DURATION" \
--argjson exit_code 0 \
--argjson skipped false \
'{event:$event,run_id:$run_id,iteration:$iteration,step:$step,step_name:$step_name,sub_step:$sub_step,sub_step_name:$sub_step_name,started_at:$started_at,ended_at:$ended_at,duration_s:$duration_s,exit_code:$exit_code,skipped:$skipped}' >> "$CLOSEDLOOP_WORKDIR/perf.jsonl"
rm -f "$CLOSEDLOOP_WORKDIR/.closedloop-ai/perf-substep-start.env"
Order of operations per phase: Run the "start of phase" Bash first (set SUB_STEP_NUM and SUB_STEP_LABEL at the top, then run the block), then perform the phase work, then run the "end of phase" Bash.
Execution Workflow
Working Directory Resolution
Before any other step, resolve the working directory and export it as CLOSEDLOOP_WORKDIR:
# Resolve working directory (precedence: --workdir arg > env var > default)
if [ -n "$ARG_WORKDIR" ]; then
WORKDIR="$ARG_WORKDIR"
elif [ -n "$CLOSEDLOOP_WORKDIR" ]; then
WORKDIR="$CLOSEDLOOP_WORKDIR"
else
WORKDIR="$(pwd)/.closedloop-ai/judges"
fi
mkdir -p "$WORKDIR"
export CLOSEDLOOP_WORKDIR="$WORKDIR"
Where $ARG_WORKDIR is the value passed via --workdir in the invocation prompt. All subsequent references to $CLOSEDLOOP_WORKDIR use this resolved value.
Agents Snapshot (Pre-Step)
Before any judge execution, ensure a snapshot of judge agent definitions exists in $CLOSEDLOOP_WORKDIR/agents-snapshot/. This preserves the exact agent versions used for each evaluation run.
Action: Run the snapshot script via Bash:
bash "${CLAUDE_PLUGIN_ROOT}/skills/run-judges/scripts/ensure_agents_snapshot.sh" "$CLOSEDLOOP_WORKDIR"
The script is idempotent — it skips if manifest.json already exists.
Error handling: If the script fails or is not found, log a warning and continue — snapshot failure must not block judge execution.
Agent Registry Validation (Pre-Flight Check)
Before any judge execution, validate the agent registry to ensure all judge agents required for the current artifact type are resolvable. This prevents launching batches only to discover agents are missing mid-run.
Action: Run validate_agent_registry.py via Bash:
uv run "${CLAUDE_PLUGIN_ROOT}/tools/python/validate_agent_registry.py" \
--artifact-type "$ARTIFACT_TYPE" \
--workdir "$CLOSEDLOOP_WORKDIR"
Exit behavior:
- Exit code
0— all required agents are registered; proceed with judge execution - Exit code non-zero — one or more required agents are missing or unresolvable; abort immediately and do NOT proceed to judge batches
On failure:
- Log the validation error output in full
- Exit the skill with a non-zero status code
- Do NOT generate partial error CaseScores for this failure mode (the workflow should not proceed at all)
Step 0: Mandatory Contract Pre-Read
Before any prerequisite checks or judge launches:
- Resolve the contract file path using
Globwith:**/skills/run-judges/references/judge-input-contract.md
- Read the resolved
judge-input-contract.mdfile in full. - Apply the contract requirements when constructing
$CLOSEDLOOP_WORKDIR/judge-input.json. - If the file is missing, ambiguous (multiple matches), or unreadable, fail fast with a clear error (do not proceed with judge execution).
Prerequisites Check
Performance: At the start of this phase run the "start of phase" Bash with SUB_STEP_NUM=0 and SUB_STEP_LABEL=context_manager for both plan and code modes. For prd and feature modes, emit sub_step=0 with SUB_STEP_LABEL=context_prep and skipped=true immediately (no context manager runs). At the end of the phase run the "end of phase" Bash.
Before starting, verify required inputs exist:
For plan artifacts (default):
# Validate input files exist
if [ ! -f "$CLOSEDLOOP_WORKDIR/prd.md" ]; then
echo "WARNING: $CLOSEDLOOP_WORKDIR/prd.md not found. Skipping judges."
exit 0 # Graceful skip - do not fail workflow
fi
if [ ! -f "$CLOSEDLOOP_WORKDIR/plan.json" ]; then
echo "WARNING: $CLOSEDLOOP_WORKDIR/plan.json not found. Skipping judges."
exit 0
fi
Investigation log resolution (plan mode):
After validating prd.md and plan.json, resolve supporting context for plan judges:
Use existing file first
- If
$CLOSEDLOOP_WORKDIR/investigation-log.mdexists, use it as-is.
- If
Check
@code:pre-exploreravailability before invoking- Perform an explicit capability probe for
@code:pre-explorerin the active Claude/plugin environment. - Treat "unknown agent", "agent not found", or plugin resolution errors as pre-explorer unavailable.
- Recommended probe pattern:
- Attempt a minimal
Task()call targeting@code:pre-explorer. - If the platform rejects the agent type before execution, classify as unavailable and continue to internal fallback.
- Attempt a minimal
- Perform an explicit capability probe for
If available, invoke pre-explorer
- Launch
@code:pre-explorerwithWORKDIR=$CLOSEDLOOP_WORKDIRto generate missing pre-exploration artifacts. - Re-check for
$CLOSEDLOOP_WORKDIR/investigation-log.mdafter completion.
- Launch
If unavailable or invocation failed, run internal fallback
- Generate
investigation-log.mdwith a lightweight local-only investigation. - Keep it fast and deterministic (no external web research).
- Internal fallback should:
- Read
prd.mdand extract top entities/actions as search seeds. - Run targeted
Glob/Grepagainst the local repository for likely implementation files. - Record top relevant files and short rationale under
Files Discovered/Key Findings. - Add requirement-to-code evidence links under
Requirements Mapping.
- Read
- Use the canonical sections:
## Search Strategy## Files Discovered## Key Findings## Requirements Mapping## Uncertainties
- Generate
Never block plan context preparation on investigation context
- If log generation still fails, emit a warning and continue.
Prepare plan-context.json via context-manager-for-judges
- Launch
@judges:context-manager-for-judgeswithartifact_type=plan. - Verify
$CLOSEDLOOP_WORKDIR/plan-context.jsonexists. - If missing after invocation, log warning and activate compatibility mode for this run:
- Compatibility mode allows one emergency fallback to raw
plan.json+prd.md. - Use compatibility mode only when context generation fails.
- Compatibility mode allows one emergency fallback to raw
- Launch
Plan-mode source-of-truth policy
- Normal mode:
plan-context.jsonis primary and required. - Compatibility mode:
plan.json+prd.mdmay be used for this run only.
- Normal mode:
Build plan-mode
judge-input.json- Invoke
scripts/judge_input_mapping.pywith--artifact-type plan. - The mapper sets
evaluation_type,task,primary_artifact,supporting_artifacts,source_of_truth,fallback_mode, and metadata from the runtime workdir contract. - In compatibility mode, allow the mapper to produce a schema-valid fallback envelope for existing plan compatibility artifacts and include
prd.mdas supporting evidence when available. - If the mapper exits non-zero, log the error and use the one-run legacy fallback only if
prd.mdplusplan.mdor the existing compatibility artifact is readable.
- Invoke
For code artifacts (--artifact-type code):
# Resolve investigation context for code judges (best effort)
if [ ! -f "$CLOSEDLOOP_WORKDIR/investigation-log.md" ]; then
echo "INFO: investigation-log.md missing. Attempting best-effort generation via @code:pre-explorer..."
# Launch @code:pre-explorer with WORKDIR=$CLOSEDLOOP_WORKDIR
# If unavailable/fails, continue with warning (non-blocking for code judges)
fi
# Launch context-manager-for-judges agent to prepare compressed context
# This agent reads code artifacts (git diff, changed-files.json, etc.)
# and produces .closedloop-ai/context/code-context.json with token-budgeted compression
# investigation-log.md is optional secondary context for code judging
if [ ! -f "$CLOSEDLOOP_WORKDIR/investigation-log.md" ]; then
echo "WARNING: investigation-log.md unavailable. Continuing code judges with canonical code context only."
fi
# Verify canonical code context exists after context manager completes. The root
# code-context.json path is fallback-only for old runs.
if [ ! -f "$CLOSEDLOOP_WORKDIR/.closedloop-ai/context/code-context.json" ] && [ ! -f "$CLOSEDLOOP_WORKDIR/code-context.json" ]; then
echo "ERROR: Context preparation failed - .closedloop-ai/context/code-context.json not found"
# Abort with error CaseScore for all judges
# Generate error report with final_status=3, justification="Context preparation failed"
exit 1
fi
# Build and validate code-mode judge-input.json with scripts/judge_input_mapping.py.
# The mapper prefers .closedloop-ai/context/code-context.json as primary and
# preserves root code-context.json as a one-run legacy fallback when needed.
For PRD artifacts (--artifact-type prd):
PRD mode does NOT use context-manager-for-judges. Context preparation is lightweight: verify the PRD document exists, then build judge-input.json directly from it.
# PRD mode context prep: check prd.md exists
if [ ! -f "$CLOSEDLOOP_WORKDIR/prd.md" ]; then
echo "WARNING: $CLOSEDLOOP_WORKDIR/prd.md not found. Skipping PRD judges."
exit 0 # Graceful exit — do not fail parent workflow
fi
# Build and validate prd-mode judge-input.json with scripts/judge_input_mapping.py.
# The mapper sets primary_artifact to primary_prd and includes mapped context,
# prompt, repo metadata, prior summaries, and attachments in source_of_truth order.
PRD context prep notes:
- Missing
prd.mdresults in a WARNING and graceful exit (code 0), not an error - No context manager is launched;
judge-input.jsonis built byscripts/judge_input_mapping.pyand validated againstschemas/judge-input.schema.json - Performance: emit sub_step=0 (context_prep, skipped=true) perf event immediately, then proceed to sub_step=1 (batch_1) and sub_step=2 (batch_2)
For Feature artifacts (--artifact-type feature):
Feature mode does NOT use context-manager-for-judges. Context preparation is lightweight: verify feature.md exists, or prd.md exists for legacy Feature inputs, then build judge-input.json from the mapper.
# Feature mode context prep: check feature.md or legacy prd.md exists
if [ ! -f "$CLOSEDLOOP_WORKDIR/feature.md" ] && [ ! -f "$CLOSEDLOOP_WORKDIR/prd.md" ]; then
echo "WARNING: neither $CLOSEDLOOP_WORKDIR/feature.md nor legacy $CLOSEDLOOP_WORKDIR/prd.md found. Skipping Feature judges."
exit 0 # Graceful exit — do not fail parent workflow
fi
# Build and validate feature-mode judge-input.json with scripts/judge_input_mapping.py.
# The mapper prefers feature.md and marks fallback_mode.active=true when it must
# use the legacy prd.md Feature path.
Feature context prep notes:
- Missing both
feature.mdand legacyprd.mdresults in a WARNING and graceful exit (code 0), not an error - No context manager is launched;
judge-input.jsonis built byscripts/judge_input_mapping.pywithevaluation_type="feature" - Performance: emit sub_step=0 (context_prep, skipped=true) perf event immediately, then proceed to sub_step=1 (batch_1), sub_step=2 (aggregate), sub_step=3 (validate)
- Preamble: use
feature_preamble.mdfor all 3 feature judges
If required files are missing:
- Plan mode: Exit gracefully with code 0 (do not fail parent workflow)
- Code mode: Exit with error if context preparation fails
- PRD mode: Exit gracefully with code 0 if prd.md is not found
- Feature mode: Exit gracefully with code 0 if prd.md is not found
Artifact Type Configuration
The run-judges skill supports three artifact types with different judge configurations:
Plan Artifacts (Default)
- Judges: 16 total
- Batches: 4 sequential batches (max 4 concurrent per batch)
- Output:
plan-judges.json - Report ID:
{RUN_ID}-plan-judges - Validation:
--category plan(16 judges expected)
Code Artifacts (--artifact-type code)
- Judges: 11 total (excludes goal-alignment-judge, verbosity-judge)
- Batches: 3 sequential batches (max 4 concurrent per batch)
- Output:
code-judges.json - Report ID:
{RUN_ID}-code-judges - Validation:
--category code(11 judges expected)
Code Judge Batches:
Batch 1: Core Principles (4 judges)
judges:dry-judgejudges:ssot-judgejudges:kiss-judgejudges:code-organization-judge
Batch 2: Best Practices + SOLID Principles (4 judges)
judges:custom-best-practices-judgejudges:readability-judgejudges:solid-isp-dip-judgejudges:solid-liskov-substitution-judge
Batch 3: Technical Quality + Testing (3 judges)
judges:solid-open-closed-judgejudges:technical-accuracy-judgejudges:test-judge
PRD Artifacts (--artifact-type prd)
- Judges: 5 total
- Batches: 2 sequential batches (max 4 concurrent per batch)
- Output:
prd-judges.json - Report ID:
{RUN_ID}-prd-judges - Validation:
--category prd(5 judges expected) - Canonical input:
$CLOSEDLOOP_WORKDIR/judge-input.jsonproduced byscripts/judge_input_mapping.py, withprimary_prdnormally pointing to$CLOSEDLOOP_WORKDIR/prd.md
Feature Artifacts (--artifact-type feature)
- Judges: 3 total (feature-completeness-judge, prd-testability-judge, prd-dependency-judge)
- Batches: 1 batch (max 4 concurrent per batch)
- Output:
feature-judges.json - Report ID:
{RUN_ID}-feature-judges - Validation:
--category feature(3 judges expected) - Canonical input:
$CLOSEDLOOP_WORKDIR/judge-input.jsonproduced byscripts/judge_input_mapping.py, withprimary_featurenormally pointing tofeature.mdand legacy fallback toprd.md - Preamble: use
feature_preamble.md(Feature-shaped contract; do NOT substituteprd_preamble.md)
Feature Mode Execution:
Batch 1: Feature Quality (sub_step=1)
judges:feature-completeness-judge— evaluates Feature request completeness and clarityjudges:prd-testability-judge— evaluates requirement testabilityjudges:prd-dependency-judge— evaluates dependency clarity and completeness
PRD Mode Execution:
Batch 1: Structure & Completeness (sub_step=1)
judges:feature-completeness-judge— evaluates Feature request completeness and clarityjudges:prd-auditor— structural completeness audit of the PRDjudges:prd-scope-judge— evaluates scope definition and boundary clarity
Batch 2: Quality Gates (sub_step=2)
judges:prd-dependency-judge— evaluates dependency clarity and completenessjudges:prd-testability-judge— evaluates requirement testability
Step 1: Launch Judge Agents in Parallel
Performance: For each batch/phase, run "start of phase" Bash before launching the batch and "end of phase" Bash after the batch completes. Plan: batch_1=sub_step 1, batch_2=sub_step 2, batch_3=sub_step 3, batch_4=sub_step 4. Code: batch_1=sub_step 1, batch_2=sub_step 2, batch_3=sub_step 3. PRD: batch_1=sub_step 1, batch_2=sub_step 2. Feature: batch_1=sub_step 1.
Constraint: The Task tool supports maximum 4 concurrent agents per batch.
Action: Launch judges in sequential batches based on artifact type.
Plan Artifact Judge Batches (16 judges, 4 batches)
Batch 1: Core Principles (DRY/SSOT/KISS + Organization)
| Agent Type | Evaluates |
|---|---|
judges:dry-judge |
Don't Repeat Yourself violations |
judges:ssot-judge |
Single Source of Truth violations |
judges:kiss-judge |
Keep It Simple violations |
judges:code-organization-judge |
File and folder structure organization |
Batch 2: Best Practices + Response Quality
| Agent Type | Evaluates |
|---|---|
judges:custom-best-practices-judge |
Adherence to custom best practices documents |
judges:goal-alignment-judge |
Alignment with stated health goals |
judges:readability-judge |
Plan readability, clarity, structure, template adherence |
judges:verbosity-judge |
Verbosity calibration to problem complexity |
Batch 3: SOLID Principles
| Agent Type | Evaluates |
|---|---|
judges:solid-isp-dip-judge |
Interface Segregation & Dependency Inversion Principles |
judges:solid-liskov-substitution-judge |
Liskov Substitution Principle adherence |
judges:solid-open-closed-judge |
Open/Closed Principle adherence |
judges:technical-accuracy-judge |
Technical accuracy (API usage, algorithms) |
Batch 4: Plan Grounding + Testing
| Agent Type | Evaluates |
|---|---|
judges:test-judge |
Test coverage, assertions, structure, best practices |
judges:brownfield-accuracy-judge |
Reuse vs reimplementation, integration-point accuracy, scope accuracy against investigation findings |
judges:codebase-grounding-judge |
File-path/module-reference accuracy and existing-code awareness grounded in investigation findings |
judges:convention-adherence-judge |
Alignment with established naming, structural, and tooling conventions in the codebase |
PRD Artifact Judge Batches (5 judges, 2 batches)
Batch 1: Structure & Completeness (sub_step=1)
| Agent Type | Evaluates |
|---|---|
judges:feature-completeness-judge |
Feature request completeness and clarity |
judges:prd-auditor |
Structural completeness, section coverage, clarity |
judges:prd-scope-judge |
Scope definition and boundary clarity |
Batch 2: Quality Gates (sub_step=2)
| Agent Type | Evaluates |
|---|---|
judges:prd-dependency-judge |
Dependency clarity and completeness |
judges:prd-testability-judge |
Requirement testability and measurability |
Feature Artifact Judge Batches (3 judges, 1 batch)
Batch 1: Feature Quality (sub_step=1)
| Agent Type | Evaluates |
|---|---|
judges:feature-completeness-judge |
Feature request completeness and clarity |
judges:prd-testability-judge |
Requirement testability and measurability |
judges:prd-dependency-judge |
Dependency clarity and completeness |
Excluded judges (feature mode):
judges:prd-auditor— excluded because it assumes US-###/AC-#.# numbering and multi-story traceability that Feature artifacts do not followjudges:prd-scope-judge— excluded because it assumes In/Out-of-Scope sections that are not present in Feature artifacts
Preamble Injection
Before invoking each judge, prepend the common and artifact-specific preambles:
Locate preamble files:
skills/artifact-type-tailored-context/preambles/common_input_preamble.mdskills/artifact-type-tailored-context/preambles/{artifact_type}_preamble.md- Use Glob tool to find:
**/artifact-type-tailored-context/preambles/*.md - Validate both files exist (fail with error CaseScore if either is missing)
Read preamble content:
- Read
common_input_preamble.md - Read
{artifact_type}_preamble.md - Validate combined preamble size is reasonable for judge context (target: < 8000 characters)
- Read
Concatenate:
common_input_preamble + "\n\n---\n\n" + artifact_preamble + "\n\n---\n\n" + judge_promptcommon_input_preamble.mdis the only runtime source of judge input-loading contract text; judge-specific agent files should not duplicate that contract.
Pass to judge: Use concatenated prompt as judge's full prompt
If either preamble file is missing:
- Generate error CaseScore with
final_status=3,justification="Preamble file not found: {path}" - Continue with other judges
NOTE — Feature Mode: When
--artifact-type featureis used, resolve{artifact_type}_preamble.mdasfeature_preamble.md(notprd_preamble.md). The Feature preamble frames the input as a Feature artifact (evaluation_type=feature, lightweight structure, no PRD-only sections such as US-###/AC-#.# numbering or In/Out-of-Scope) and aligns with the envelope built by feature mode. Substitutingprd_preamble.mdwould inject contradictory contract instructions and may cause judges to error or evaluate against PRD-only expectations.
Prompt Templates
For plan artifacts:
WORKDIR=$CLOSEDLOOP_WORKDIR. Read $CLOSEDLOOP_WORKDIR/judge-input.json first.
Evaluate according to `task` and `source_of_truth` ordering.
Treat the envelope's `primary_artifact` as authoritative.
If `fallback_mode.active=true`, use fallback artifacts specified in the envelope.
For code artifacts:
WORKDIR=$CLOSEDLOOP_WORKDIR. Read $CLOSEDLOOP_WORKDIR/judge-input.json first.
Evaluate according to `task` and `source_of_truth` ordering.
Treat the envelope's `primary_artifact` as authoritative.
Apply your {judge_name} criteria to assess code quality.
For PRD artifacts:
WORKDIR=$CLOSEDLOOP_WORKDIR. Read $CLOSEDLOOP_WORKDIR/judge-input.json first.
Evaluate according to `task` and `source_of_truth` ordering.
Treat the envelope's `primary_artifact` as the authoritative PRD document and load supporting descriptors as source-of-truth evidence.
Apply your {judge_name} criteria to assess PRD quality.
For Feature artifacts:
WORKDIR=$CLOSEDLOOP_WORKDIR. Read $CLOSEDLOOP_WORKDIR/judge-input.json first.
Evaluate according to `task` and `source_of_truth` ordering.
Treat the envelope's `primary_artifact` as the authoritative Feature document and load supporting descriptors as source-of-truth evidence.
Apply your {judge_name} criteria to assess Feature quality.
Expected Output Format
{
"type": "case_score",
"case_id": "dry-judge",
"final_status": 1,
"metrics": [
{
"metric_name": "dry_score",
"threshold": 0.8,
"score": 0.85,
"justification": "Plan follows DRY principles..."
}
]
}
Status Code Semantics:
| Code | Meaning | When to Use |
|---|---|---|
1 |
Pass | Score meets or exceeds threshold |
2 |
Fail | Score below threshold |
3 |
Error | Judge execution failed |
Error Handling Protocol
CRITICAL REQUIREMENT: If a judge Task call fails, you MUST construct an error CaseScore.
Error CaseScore Template:
{
"type": "case_score",
"case_id": "{judge-name}",
"final_status": 3,
"error_reason": "Brief human-readable description of what failed",
"metrics": [
{
"metric_name": "{metric}_score",
"threshold": 0.8,
"score": 0.0,
"justification": "Judge execution failed: {error message}"
}
]
}
error_reason field guidance:
When to set it: Set
error_reasonwheneverfinal_status=3. Common cases include:- Tool failures (e.g., Task tool returned an error, agent invocation rejected)
- Parse errors (e.g., judge output could not be parsed as valid CaseScore JSON)
- Timeouts (e.g., judge agent did not respond within the allowed time)
- Preamble file not found (e.g., required
{artifact_type}_preamble.mdmissing) - Context preparation failures passed down to individual judge error scores
What to put in it: A brief, human-readable string describing the specific failure. Examples:
"Task tool error: agent not found""Parse error: response was not valid JSON""Timeout: judge did not complete within 5 minutes""Preamble file not found: plan_preamble.md"
Effect on aggregation: CaseScores with
final_status=3are excluded bycompute_average_excluding_errors, which then averagesMetricStatistics.scoreacross every metric of every remaining (non-errored) CaseScore.error_reasonis informational and does not control exclusion (see field docstring atvalidate_judge_report.py:46). Errored judges do not drag down the aggregate score for judges that did execute successfully.
Aggregation rules when errors are present:
- If SOME judges have
final_status=3,compute_average_excluding_errorsreturns the average ofMetricStatistics.scoreacross only the non-errored judges (return typeOptional[float]). Callers rendering this for humans should annotate the value as "avg of N/M judges" by separately computing N (non-errored CaseScore count) and M (total CaseScore count) from the input list — the function itself does not return the annotation. - If ALL judges have
final_status=3, or no non-errored judge contributes any metric,compute_average_excluding_errorsreturnsNone— no meaningful average can be computed.
Continue-on-failure semantics:
- Even if ALL judges fail, you MUST aggregate error CaseScores
- Always produce a complete report with 16 CaseScore entries (plan), 11 CaseScore entries (code), 5 CaseScore entries (prd), or 3 CaseScore entries (feature)
- Never abort the workflow due to judge failures
Summary Table Formatting
When displaying the evaluation results summary (e.g., in the final output or any human-readable report), follow these conventions for errored scores:
Errored score display:
- Use the
ERRmarker in place of a numeric score for any judge whose CaseScore hasfinal_status=3.error_reason, when present, can be displayed in a hover/tooltip or separate column but does not control whetherERRis shown.
Example summary table:
| Judge | Score | Status |
|---|---|---|
| dry-judge | 0.92 | PASS |
| ssot-judge | ERR | ERROR |
| kiss-judge | 0.75 | FAIL |
| readability-judge | ERR | ERROR |
Average annotation:
- When some judges are excluded due to errors, annotate the aggregate average as
"avg of N/M judges", where N is the number of non-errored judges and M is the total number of judges. - Example:
avg of 14/16 judges
Footer line:
When one or more judges are excluded, add a footer line to the summary:
X of Y judges excluded due to errorswhere X is the count of errored judges and Y is the total expected judge count.
Example:
2 of 16 judges excluded due to errors
When ALL judges errored:
- Display
ERRfor every judge row - Display
N/A(not a number) for the aggregate average — do not attempt to compute or display an average - Footer:
Y of Y judges excluded due to errors
Step 2: Aggregate Results into EvaluationReport
Performance: Run "start of phase" with sub_step 5 (plan), 4 (code), 3 (prd), or 2 (feature), sub_step_name=aggregate. Emit 'end of phase' after the aggregation step regardless of file write outcome.
Task: Collect all CaseScore outputs and structure them into an EvaluationReport.
Output file logic:
if artifact_type == 'code':
report_filename = 'code-judges.json'
report_id = f'{RUN_ID}-code-judges'
elif artifact_type == 'prd':
report_filename = 'prd-judges.json'
report_id = f'{RUN_ID}-prd-judges'
elif artifact_type == 'feature':
report_filename = 'feature-judges.json'
report_id = f'{RUN_ID}-feature-judges'
else:
report_filename = 'plan-judges.json'
report_id = f'{RUN_ID}-plan-judges'
output_path = $CLOSEDLOOP_WORKDIR / report_filename
Plan artifact report structure (plan-judges.json):
{
"report_id": "{RUN_ID}-plan-judges",
"timestamp": "2024-02-03T15:45:30Z",
"stats": [
{ /* CaseScore from dry-judge */ },
{ /* CaseScore from ssot-judge */ },
{ /* CaseScore from kiss-judge */ },
{ /* CaseScore from code-organization-judge */ },
{ /* CaseScore from custom-best-practices-judge */ },
{ /* CaseScore from goal-alignment-judge */ },
{ /* CaseScore from readability-judge */ },
{ /* CaseScore from verbosity-judge */ },
{ /* CaseScore from solid-isp-dip-judge */ },
{ /* CaseScore from solid-liskov-substitution-judge */ },
{ /* CaseScore from solid-open-closed-judge */ },
{ /* CaseScore from technical-accuracy-judge */ },
{ /* CaseScore from test-judge */ },
{ /* CaseScore from brownfield-accuracy-judge */ },
{ /* CaseScore from codebase-grounding-judge */ },
{ /* CaseScore from convention-adherence-judge */ }
]
}
Code artifact report structure (code-judges.json):
{
"report_id": "{RUN_ID}-code-judges",
"timestamp": "2024-02-03T15:45:30Z",
"stats": [
{ /* CaseScore from dry-judge */ },
{ /* CaseScore from ssot-judge */ },
{ /* CaseScore from kiss-judge */ },
{ /* CaseScore from code-organization-judge */ },
{ /* CaseScore from custom-best-practices-judge */ },
{ /* CaseScore from readability-judge */ },
{ /* CaseScore from solid-isp-dip-judge */ },
{ /* CaseScore from solid-liskov-substitution-judge */ },
{ /* CaseScore from solid-open-closed-judge */ },
{ /* CaseScore from technical-accuracy-judge */ },
{ /* CaseScore from test-judge */ }
]
}
PRD artifact report structure (prd-judges.json):
{
"report_id": "{RUN_ID}-prd-judges",
"timestamp": "2024-02-03T15:45:30Z",
"stats": [
{ /* CaseScore from feature-completeness-judge */ },
{ /* CaseScore from prd-auditor */ },
{ /* CaseScore from prd-dependency-judge */ },
{ /* CaseScore from prd-testability-judge */ },
{ /* CaseScore from prd-scope-judge */ }
]
}
Feature artifact report structure (feature-judges.json):
{
"report_id": "{RUN_ID}-feature-judges",
"timestamp": "2024-02-03T15:45:30Z",
"stats": [
{ /* CaseScore from feature-completeness-judge */ },
{ /* CaseScore from prd-testability-judge */ },
{ /* CaseScore from prd-dependency-judge */ }
]
}
Field requirements:
| Field | Format | How to Derive |
|---|---|---|
report_id |
{RUN_ID}-plan-judges, {RUN_ID}-code-judges, {RUN_ID}-prd-judges, or {RUN_ID}-feature-judges |
Extract RUN_ID from $CLOSEDLOOP_WORKDIR directory name, append suffix based on artifact type |
timestamp |
ISO 8601 | Generate with date -u +%Y-%m-%dT%H:%M:%SZ |
stats |
Array[CaseScore] | 16 CaseScore objects for plan, 11 for code, 5 for prd, 3 for feature (one per judge) |
Step 3: Validate Output (MANDATORY)
Performance: Run "start of phase" with sub_step 6 (plan), 5 (code), 4 (prd), or 3 (feature), sub_step_name=validate. Emit 'end of phase' after each validation attempt regardless of exit code, then apply failure recovery logic.
CRITICAL: You MUST run the validation script after writing the judge report. Do not consider the task complete until validation passes.
Step 3.1: Locate the Validation Script
The script is in this skill's scripts/ directory:
SCRIPT_PATH="scripts/validate_judge_report.py"
Step 3.2: Ensure uv is Installed
if ! command -v uv &> /dev/null; then
# Install uv — alternatives: brew install uv, pip install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
fi
Step 3.3: Run Validation
# CRITICAL: Run from script's directory so uv can find inline dependencies
cd "$(dirname "$SCRIPT_PATH")"
# Determine category based on artifact type
CATEGORY="plan" # default
if [ "$ARTIFACT_TYPE" = "code" ]; then
CATEGORY="code"
elif [ "$ARTIFACT_TYPE" = "prd" ]; then
CATEGORY="prd"
elif [ "$ARTIFACT_TYPE" = "feature" ]; then
CATEGORY="feature"
fi
# Run validation with appropriate category
uv run "$SCRIPT_PATH" --workdir "$CLOSEDLOOP_WORKDIR" --category "$CATEGORY"
Argument requirements:
--workdirmust be the absolute path to$CLOSEDLOOP_WORKDIR--categorymust beplan(16 judges),code(11 judges),prd(5 judges), orfeature(3 judges)- This is where
plan-judges.json,code-judges.json,prd-judges.json, orfeature-judges.jsonis located
Validation Checks
The script validates using strict Pydantic models:
| Check | Requirement |
|---|---|
| JSON syntax | Valid JSON format |
| Required fields | report_id, timestamp, stats array |
| Judge coverage | All expected judges present (16 for plan, 11 for code, 5 for prd, 3 for feature) |
| Status values | final_status ∈ {1, 2, 3} |
| Metric completeness | Each judge has ≥1 metric |
| Report ID format | Ends with '-judges' (plan), '-code-judges' (code), '-prd-judges' (prd), or '-feature-judges' (feature) |
Expected judge case_ids for plan artifacts (16 total):
brownfield-accuracy-judge
code-organization-judge
codebase-grounding-judge
convention-adherence-judge
custom-best-practices-judge
dry-judge
goal-alignment-judge
kiss-judge
readability-judge
solid-isp-dip-judge
solid-liskov-substitution-judge
solid-open-closed-judge
ssot-judge
technical-accuracy-judge
test-judge
verbosity-judge
Expected judge case_ids for code artifacts (11 total):
code-organization-judge
custom-best-practices-judge
dry-judge
kiss-judge
readability-judge
solid-isp-dip-judge
solid-liskov-substitution-judge
solid-open-closed-judge
ssot-judge
technical-accuracy-judge
test-judge
Note: Code artifacts exclude: goal-alignment-judge, verbosity-judge
Expected judge case_ids for PRD artifacts (5 total):
feature-completeness-judge
prd-auditor
prd-dependency-judge
prd-testability-judge
prd-scope-judge
Note: PRD judges run in 2 sequential batches (3 + 2) to respect the Task tool's 4-concurrent-agent limit.
Expected judge case_ids for Feature artifacts (3 total):
feature-completeness-judge
prd-dependency-judge
prd-testability-judge
Note: Feature judges run in 1 batch. prd-auditor and prd-scope-judge are excluded — see Feature mode judge selection rationale in Task Context section.
Validation Exit Codes
| Code | Meaning | Action |
|---|---|---|
0 |
Valid | Task complete ✓ |
1 |
Invalid | Read error, fix report JSON, re-validate |
If Validation Fails
Follow this sequence:
- Read error message - Understand what failed
- Fix report JSON - Correct the specific validation error
- Re-run validation - Repeat until exit code 0
- Never skip validation - Do not mark task complete until validation passes
Reference: Pydantic Models
The validation script uses these strict Pydantic models:
class MetricStatistics(BaseModel):
"""A single metric evaluation result."""
metric_name: str
threshold: Optional[float] = None
score: float
justification: str
class CaseScore(BaseModel):
"""Score for a single judge evaluation."""
type: Optional[str] = "case_score"
case_id: str
final_status: int # 1=pass, 2=fail, 3=error
metrics: List[MetricStatistics]
error_reason: Optional[str] = None # set when final_status=3; excluded from aggregation averages
class EvaluationReport(BaseModel):
"""Top-level report containing all judge evaluations."""
report_id: str
timestamp: str
stats: List[CaseScore]
Model constraints:
ConfigDict(strict=True)enforces exact type matchingfinal_statusvalidator rejects values outside {1, 2, 3}
Success Checklist
Before marking this task complete, verify:
For all artifact types:
- Agents snapshot -
agents-snapshot/manifest.jsonexists in$CLOSEDLOOP_WORKDIR(created if missing, skipped if present)
For plan artifacts (default):
- Input validation - prd.md and plan.json exist (or graceful skip)
- Context preparation - context-manager-for-judges launched with
artifact_type=plan - Plan context validation -
plan-context.jsonexists, or compatibility mode explicitly activated - Judge input contract -
judge-input.jsonexists with required fields - Investigation context resolution -
investigation-log.mdreused, generated via pre-explorer, or best-effort generated internally - Parallel execution - All 16 judges launched in 4 batches (max 4 per batch)
- Result aggregation - Valid EvaluationReport with 16 CaseScore entries
- File output -
plan-judges.jsonwritten to$CLOSEDLOOP_WORKDIR - Validation passed - Script exits with code 0 using
--category plan
For code artifacts (--artifact-type code):
- Context preparation - context-manager-for-judges agent launched successfully
- Context validation - canonical
.closedloop-ai/context/code-context.jsonexists at$CLOSEDLOOP_WORKDIR, or rootcode-context.jsonfallback is explicitly used for an old run - Judge input contract -
judge-input.jsonexists with required fields - Investigation context resolution -
investigation-log.mdreused or generated best-effort; missing file does not block code judging - Preamble injection - common_input_preamble.md + code_preamble.md prepended to all judge prompts
- Parallel execution - All 11 judges launched in 3 batches (max 4 per batch)
- Result aggregation - Valid EvaluationReport with 11 CaseScore entries
- File output -
code-judges.jsonwritten to$CLOSEDLOOP_WORKDIR - Report ID format - report_id ends with '-code-judges'
- Validation passed - Script exits with code 0 using
--category code
For PRD artifacts (--artifact-type prd):
- prd.md existence check -
$CLOSEDLOOP_WORKDIR/prd.mdfound, or graceful exit with WARNING (code 0) - No context manager - context-manager-for-judges is NOT launched for prd mode
- Judge input contract -
scripts/judge_input_mapping.pywrote schema-validjudge-input.jsonwithevaluation_type="prd"andprimary_artifact.id="primary_prd" - Parallel execution - 5 PRD judges launched in 2 sequential batches: batch_1 (sub_step=1, 3 judges) and batch_2 (sub_step=2, 2 judges), max 4 concurrent per batch
- Result aggregation - Valid EvaluationReport with 5 CaseScore entries (sub_step=3)
- File output -
prd-judges.jsonwritten to$CLOSEDLOOP_WORKDIR - Report ID format - report_id ends with '-prd-judges'
- Validation passed - Script exits with code 0 using
--category prd(sub_step=4)
For Feature artifacts (--artifact-type feature):
- Feature input existence check -
$CLOSEDLOOP_WORKDIR/feature.mdor legacy$CLOSEDLOOP_WORKDIR/prd.mdfound, or emit sub_step=0 (skipped=true) perf event, emit WARNING, and graceful exit with WARNING (code 0) - No context manager - context-manager-for-judges is NOT launched for feature mode
- Judge input contract -
scripts/judge_input_mapping.pywrote schema-validjudge-input.jsonwithevaluation_type="feature"andprimary_artifact.id="primary_feature" - Preamble - feature_preamble.md used for all 3 feature judges (Feature-shaped contract; do NOT substitute prd_preamble.md)
- Parallel execution - 3 feature judges launched in 1 batch (sub_step=1): feature-completeness-judge + prd-testability-judge + prd-dependency-judge
- Result aggregation - Valid EvaluationReport with 3 CaseScore entries (sub_step=2)
- File output -
feature-judges.jsonwritten to$CLOSEDLOOP_WORKDIR - Report ID format - report_id ends with '-feature-judges'
- Validation passed - Script exits with code 0 using
--category feature(sub_step=3)
Troubleshooting Guide
| Error Message | Root Cause | Solution |
|---|---|---|
| "Report file does not exist" | File not written to correct location | Verify $CLOSEDLOOP_WORKDIR is set; check write path matches artifact type (plan-judges.json, code-judges.json, prd-judges.json, or feature-judges.json) |
| "Invalid JSON" | Syntax error in output file | Run python3 -m json.tool "$CLOSEDLOOP_WORKDIR/{plan,code,prd,feature}-judges.json" to identify syntax error |
| "Missing expected judges" | Incomplete batch execution | Verify all batches launched (4 for plan, 3 for code, 2 for prd, 1 for feature); check error CaseScores for failures; plan expects 16 judges, code expects 11, prd expects 5, feature expects 3 |
| "final_status must be 1, 2, or 3" | Invalid status code | Use only: 1 (pass), 2 (fail), 3 (error) |
| "report_id should end with '-plan-judges'" | Incorrect ID format for plan | Use pattern: {RUN_ID}-plan-judges for plan artifacts |
| "report_id should end with '-code-judges'" | Incorrect ID format for code | Use pattern: {RUN_ID}-code-judges for code artifacts |
| "Judge {name} has no metrics" | Empty metrics array | Each CaseScore must have ≥1 MetricStatistics entry |
| "Context preparation failed" | context-manager-for-judges failed | Check context-manager agent output; verify artifact files exist |
| "judge-input.json missing" | Orchestrator did not generate envelope | Run scripts/judge_input_mapping.py before launching judges |
| "judge-input schema invalid" | Missing required envelope fields | Re-run scripts/judge_input_mapping.py; it validates required fields: evaluation_type, task, primary_artifact, supporting_artifacts, source_of_truth, fallback_mode, metadata |
| "plan-context.json not found" | plan context manager did not produce output | Run @judges:context-manager-for-judges with artifact_type=plan; if still missing, activate one-run compatibility fallback to plan.json + prd.md |
| "Preamble file not found" | Missing common or artifact preamble .md file | Verify both skills/artifact-type-tailored-context/preambles/common_input_preamble.md and skills/artifact-type-tailored-context/preambles/{artifact_type}_preamble.md exist |
| "pre-explorer unavailable" | @code:pre-explorer not installed/resolvable |
Log warning and use internal fallback investigation to create investigation-log.md |
| "investigation-log.md missing after fallback" | Both pre-explorer and internal fallback failed | Log warning and continue; do not block context preparation |
| "investigation-log.md missing in code mode" | pre-explorer unavailable or generation failed during code preflight | Log warning and continue with .closedloop-ai/context/code-context.json only (non-blocking), using root code-context.json only as legacy fallback |
| "Invalid --artifact-type value" | Unsupported artifact type | Use only 'plan', 'code', 'prd', or 'feature' |
| "prd.md not found" | PRD document missing from workdir | Emit WARNING and exit gracefully (code 0); do not fail the parent workflow |
| "report_id should end with '-prd-judges'" | Incorrect ID format for prd | Use pattern: {RUN_ID}-prd-judges for PRD artifacts |
| "report_id should end with '-feature-judges'" | Incorrect ID format for feature | Use pattern: {RUN_ID}-feature-judges for Feature artifacts |
| "feature_preamble.md not found" | feature_preamble.md missing from preambles directory | Verify skills/artifact-type-tailored-context/preambles/feature_preamble.md exists; do NOT fall back to prd_preamble.md (it injects contradictory contract instructions for feature mode) |
| "Missing expected judges (feature)" | Incomplete batch execution for feature mode | Verify batch_1 launched all 3 judges: feature-completeness-judge, prd-testability-judge, prd-dependency-judge |
Error Handling Requirements
Invalid Artifact Type
If --artifact-type value is not 'plan', 'code', 'prd', or 'feature':
- Fail immediately with clear error message
- Do not attempt judge execution
- Exit with non-zero status
Context Manager Timeout (Code Mode)
If context-manager-for-judges agent exceeds 5 minutes:
- Abort judge execution
- Generate error CaseScores for all 11 judges
- Each error CaseScore:
final_status=3,error_reason="Timeout: context preparation exceeded 5 minutes",justification="Context preparation timeout"(seeerror_reasonguidance above) - Write complete report with all error CaseScores
Context Manager Timeout (Plan Mode)
If context-manager-for-judges agent exceeds 5 minutes in plan mode:
- Attempt one emergency compatibility fallback to raw
plan.json+prd.md - If fallback files are unavailable, abort plan judge execution and emit clear error
Individual Judge Failures
If a single judge Task call fails during execution:
- Do not abort the entire workflow
- Generate error CaseScore for that judge only, with
final_status=3and a populatederror_reasondescribing the specific failure (e.g."Task tool error: agent not found","Parse error: response was not valid JSON") per theerror_reasonguidance above - Continue with remaining judges in batch and subsequent batches
- Include error CaseScore in final aggregated report
Plan Mode Execution Flow
When --artifact-type is not specified or equals 'plan':
- Execute standard 16-judge plan logic
- Launch 4 batches with existing judge assignments
- Write to
plan-judges.json(notcode-judges.json) - Launch context-manager-for-judges for plan context preparation
- Use
plan-context.jsonas primary input; use one-run compatibility fallback only if context preparation fails - Build and pass
judge-input.jsonenvelope to judges - Prepend preambles to judge prompts
- Use default validation with
--category plan
This is the standard plan mode flow; orchestrators must support context-manager launch, judge-input.json construction, and preamble injection. The compatibility fallback (raw plan.json + prd.md) activates only when context preparation fails (e.g., context-manager timeout), not for orchestrators that have not been updated.
PRD Mode Execution Flow
When --artifact-type prd is specified:
- Check
$CLOSEDLOOP_WORKDIR/prd.mdexists; emit WARNING and exit gracefully (code 0) if missing - Do NOT launch context-manager-for-judges
- Build and schema-validate
judge-input.jsonwithscripts/judge_input_mapping.py --artifact-type prd - Launch the 5 PRD judges in 2 sequential batches (sub_step=1: feature-completeness-judge + prd-auditor + prd-scope-judge; sub_step=2: prd-dependency-judge + prd-testability-judge) to respect the 4-concurrent-agent Task limit
- Aggregate all 5 CaseScores (sub_step=3) and write to
prd-judges.json - Validate with
--category prd(sub_step=4)
Feature Mode Execution Flow
When --artifact-type feature is specified:
- Check
$CLOSEDLOOP_WORKDIR/feature.mdexists, or legacy$CLOSEDLOOP_WORKDIR/prd.mdexists; emit sub_step=0 (context_prep, skipped=true) perf event, emit WARNING, and exit gracefully (code 0) if both are missing - Do NOT launch context-manager-for-judges
- Build and schema-validate
judge-input.jsonwithscripts/judge_input_mapping.py --artifact-type feature - Use
feature_preamble.mdfor all 3 feature judges (Feature-shaped contract; do NOT substituteprd_preamble.md) - Launch the 3 feature judges in 1 batch (sub_step=1: feature-completeness-judge + prd-testability-judge + prd-dependency-judge) to respect the 4-concurrent-agent Task limit
- Aggregate all 3 CaseScores (sub_step=2) and write to
feature-judges.json - Validate with
--category feature(sub_step=3)