Human Oversight Protocol
Constraints
- Approval gates cannot be skipped; do not proceed past a gate without explicit human sign-off.
- Ethical concerns are never auto-resolved; always escalate.
- Intervention commands (
override,pause,stop) take immediate effect with no debate. - Overrides accumulate; 3+ overrides on the same topic must trigger a config amend.
Plan Review as Primary Quality Gate
The implementation plan is the primary review artifact, not the code. Traditional line-by-line code review is replaced by plan review for AI-generated work — 200 lines of plan is far more reviewable than 2,000 lines of generated code, and if the plan is correct and tests pass, the code is trustworthy.
Plan review checklist
- Does the research accurately describe how the system works? (File paths, data flows, dependencies)
- Does the plan address the right problem?
- Are the specified changes complete — no missing files or edge cases?
- Is the test strategy sufficient to verify correctness?
- Are there architectural concerns the plan missed?
When to still review code
- Security-sensitive paths (authentication, authorization, crypto)
- Performance-critical paths
- When tests are insufficient to verify correctness
- When the plan was ambiguous about implementation details
Approval Gates
Gate classification
Every agent action falls into one of three categories:
| Category | Description | Human involvement |
|---|---|---|
| Autonomous | Routine work within agent's defined scope | None — deliver output directly |
| Notify | Significant but within scope; human should be aware | Deliver output + flag what was decided and why |
| Approve | Outside routine scope or high-impact; human must sign off | Present proposal, wait for explicit approval |
Standard approval gates
These actions always require human approval:
| Action | Rationale |
|---|---|
| Research findings (Phase 1 → 2) | Misunderstanding cascades into bad plans and bad code |
| Implementation plan (Phase 2 → 3) | Plan correctness determines code correctness |
| Production deployment | Irreversible, affects users |
| Architecture change | High-impact, hard to reverse |
| Database schema migration | Data integrity risk |
| Security-sensitive code | Vulnerability risk |
| Scope change | May affect timeline/budget |
| New external dependency | Supply chain risk |
| Delete files or data | Potentially irreversible |
| Team structure change | Affects all agents |
Agent-specific gates
Each agent defines additional gates in its ## Behavioral Guidelines > Decision Making section. The Orchestrator consolidates these when coordinating multi-agent tasks.
Intervention Mechanisms
1. Feedback (real-time correction)
amend: [modify existing behavior]
learn: [teach something new]
remember: [persist a preference]
forget: [remove a preference]
- Does NOT stop the current task
- Agent incorporates the feedback and continues
- Full procedure: Feedback & Learning
2. Override (decision reversal)
override: [what was decided] → [what should be done instead]
- Stops the current approach; agent adopts the human's decision without debate
- Logged as override in the audit trail
- 3+ overrides on the same topic should trigger a config amend
3. Pause (temporary halt)
pause
- Agent stops and presents current state
- Human reviews and either resumes or redirects
- No output is discarded
4. Stop (emergency halt)
stop
- All agents halt immediately
- Current output preserved but not delivered
- Orchestrator presents a summary of what was in progress
- Human decides: resume, redirect, or abandon
Transparency Requirements
Decision logging
| Log entry | Where | When |
|---|---|---|
| Agent selected for task | Task metrics entry | At task start |
| Routing rationale | Orchestrator metrics entry | At task start |
| Approval gate triggered | Task metrics entry | When gate fires |
| Human approval/rejection | Config changelog | When human responds |
| Override applied | Config changelog | When override issued |
Decision visibility (Notify level)
Decision: [what was decided]
Rationale: [why]
Alternatives considered: [what else was evaluated]
Audit trail
All oversight events logged to metrics/config-changelog.jsonl with:
type:approval|override|pause|stoptrigger:userdescription: what happened and why
Output
Gate classification (autonomous / notify / approve) with rationale, or escalation summary with severity and recommended action. One decision per output; no restating of protocol rules.
Escalation Paths
Agent → Orchestrator → Human
- Agent identifies the issue and flags it to the Orchestrator.
- Orchestrator classifies severity:
- Low: route to another agent with appropriate expertise
- Medium: present options to human with recommendation
- High: present to human with full context, no recommendation (avoid anchoring)
- Human decides.
- Decision logged and fed back to the requesting agent.