Quality Gate Pipeline
Overview
Single quality gate that every agent passes through before output is accepted. Replaces three formerly separate skills (accuracy-validation, verification-before-completion, task-review-correction) with a unified pipeline of three phases that run in sequence.
Constraints
- Do not deliver output containing unverified claims; pause and verify first
- Do not claim completion without fresh verification evidence from this session
- Do not reference test results or tool output from earlier in the conversation — re-run and show current output
- Do not substitute reasoning or explanation for actual evidence
- Max 3 review-correction cycles before escalating to the Orchestrator
- Each correction cycle must reduce total defect count; flat or increasing defects trigger escalation
The Three Phases
Phase 1: Self-Validation (before delivery)
Every agent runs this checklist mentally before presenting output.
Factual Accuracy
- All file paths referenced actually exist (verify with tool, don't assume)
- All function/class/variable names match what's in the codebase
- Version numbers, API signatures, and config values are verified, not recalled from training
- No statistics or citations are fabricated
Instruction Fidelity
- Output addresses what the user actually asked, not a reinterpretation
- All acceptance criteria from the task are met
- No scope creep beyond the request
- Constraints from the agent persona are respected
Internal Consistency
- No contradictions within the output
- Code samples compile/run conceptually (correct syntax, valid imports)
- Referenced earlier decisions are accurately recalled (if unsure, re-read from memory/)
Confidence Assessment
| Confidence | Meaning | Action |
|---|---|---|
| High | Verified via tool output or direct file read | Deliver as-is |
| Medium | Inferred from context, not directly verified | Flag with caveat |
| Low | Recalled from training or guessed | Verify before delivering, or mark as unverified |
Hallucination Detection Signals
Strong signals (likely hallucination):
- Referencing a file, function, or API that was never read in this session
- Quoting specific numbers without a source
- Describing behavior that contradicts tool observations
- Generating imports for packages not in dependencies
When a signal fires: Pause → Verify (use tools) → Correct → Log (hallucination_detected: true in metrics)
Phase 2: Verification Evidence (before completion claims)
Iron Law: No completion claims without fresh verification evidence. Skipping any step is falsification, not verification.
The Gate Function:
- IDENTIFY — What command proves your claim? (e.g.,
npm test,cargo build) - RUN — Execute the command fresh and completely. Not from cache, not from memory.
- READ — Read the complete output and exit code. Don't skim.
- VERIFY — Does the output actually confirm your claim? If not, report actual status.
- ONLY THEN — Make the claim, with supporting evidence pasted.
Required Evidence (all tasks):
- Tests pass: Run the test suite. Paste output with pass/fail counts.
- Build succeeds: If the project has a build step, run it. Paste output.
- Lint clean: If the project has a linter, run it. Paste output.
- No regressions: Test count should not decrease.
Additional Evidence by task type:
| Task Type | Additional Evidence |
|---|---|
| Bug fix | Red-green cycle: failing test → passing test |
| New feature | Feature working via test output or demo command |
| Refactor | Same test count, same pass count |
| Config change | Config loads without error |
| Documentation | Commands/code blocks in doc actually work |
| Agent work | Inspect VCS diff independently — don't trust self-report |
Evidence Format:
## Verification
- Tests: `npm test` → 47 passed, 0 failed (output below)
- Build: `npm run build` → success, 0 warnings
- Lint: `npm run lint` → 0 errors, 0 warnings
Red Flag Language — stop and verify when you catch yourself saying:
- "should work now" / "should be fixed" / "probably" / "I believe"
- Expressing satisfaction before running verification
- Preparing commits without verification output
Phase 3: Review-Correction Loop (post-delivery rework)
Activated when output is returned for rework, during peer review, or when self-reviewing before delivery (complements Phase 1).
Defect Severity:
| Severity | Definition | Required Action |
|---|---|---|
| Critical | Wrong, breaks functionality, contradicts requirements | Immediate correction, block delivery |
| Major | Significant gap in completeness or correctness | Correct before acceptance |
| Minor | Small inaccuracy, suboptimal approach | Correct if time permits |
| Cosmetic | Formatting, naming, style | Bundle with next change |
Correction Scope:
- Isolated fix: Self-contained; fix doesn't affect other outputs
- Cascading fix: Implies other outputs may also be wrong; verify related work
- Rework: Fundamental approach flawed; redo from requirements
Review Checklist:
- Requirements compliance — all acceptance criteria addressed, no silent scope changes
- Correctness — intended result, edge cases handled, no regressions
- Completeness — all files touched, no TODOs remain, integration points addressed
- Consistency — matches project conventions, no contradictions
- Quality — appropriately simple, readable, sufficient documentation
Iteration Rules:
- Max 3 review-correction cycles before escalation
- Each cycle must reduce total defect count
- If defects increase or stay flat after 2 cycles, escalate to Orchestrator
- Exit criteria: all critical and major defects resolved; minor defects logged
Escalation: Summarize defect pattern and attempted corrections → escalate to Orchestrator for re-routing → log with escalation_reason in task metrics.
When to Apply Each Phase
| Situation | Phases to run |
|---|---|
| Initial development, about to deliver | Phase 1 → Phase 2 |
| Claiming task completion | Phase 2 (at minimum) |
| Output returned for rework | Phase 3 → Phase 2 |
| Peer-reviewing another agent's output | Phase 1 (as reviewer) → Phase 3 if defects found |
| Trivial one-line fix | Phase 2 only (verify it works) |
Output
Phase 1: Confidence-scored validation — report failures only. Phase 2: Verification evidence block with tool output. Phase 3: Defect table (severity, scope, status: fixed/deferred/escalated).
Integration
- Performance Metrics: Log
hallucination_detected,rework_cycles,defects_foundon task completion - Context Summarization: When context is high, increase Phase 1 rigor
- Human Oversight Protocol: Escalation from Phase 3 feeds into the approval gate system