Test-Driven Development
Overview
Enforces strict RED-GREEN-REFACTOR discipline with verifiable gates. LLMs are especially prone to skipping tests or writing them after implementation — this skill exists because that tendency produces code that looks tested but isn't actually validated.
Iron Law
No production code without a failing test first. If you didn't watch the test fail, you don't know if it tests the right thing. Code written before tests must be deleted and reimplemented from the test — no exceptions.
Constraints
- Do not write implementation code without a failing test first
- Do not move to the next unit of work until all tests pass
- Do not skip the refactor step — it's where design quality happens
- Do not rationalize exceptions to the cycle (see Rationalization Prevention below)
- Do not use mocks when real code is feasible — mocks test your assumptions, not your code
The Cycle
Each unit of work follows three phases with hard gates between them:
1. RED — Write a failing test
- Write the smallest test that describes the next behavior
- Use real code, not mocks, whenever avoidable
- Run the test suite — the new test must fail
- Hard gate: paste the failing test output. No output = no proceeding.
- Verify the failure is for the expected reason (missing feature, not a typo or import error)
- If the test passes without new code, the behavior already exists — pick a different test
2. GREEN — Make it pass
- Write the minimum implementation to make the failing test pass
- Run the test suite — all tests must pass with no errors or warnings
- Hard gate: paste the passing test output. No output = no proceeding.
- Do not add behavior beyond what the test requires
- Do not refactor yet
3. REFACTOR — Clean up
- Improve structure, naming, duplication — without changing behavior
- Run the test suite — all tests must still pass
- If tests break during refactor, undo and try a smaller change
Then return to RED for the next behavior.
Rationalization Prevention
LLMs generate plausible excuses for skipping TDD. These are the common ones and why they're wrong:
| Excuse | Reality |
|---|---|
| "I'll add tests after the implementation" | You won't. And if you do, you'll write tests that pass by definition — they test what you wrote, not what should work. |
| "This is too simple to test" | Simple code breaks too. Testing takes 30 seconds. The one-line change that caused the most expensive bug looked simple too. |
| "Writing the test first would be slower" | TDD is faster than debugging. It catches errors at the cheapest possible moment. |
| "I need to see the implementation shape first" | That's called a spike. Do the spike, throw it away, then TDD the real implementation. |
| "The test framework isn't set up yet" | Set it up. That's the first task, not a reason to skip testing. |
| "I'm just refactoring, not adding behavior" | Then existing tests should pass throughout. If there are no existing tests, write characterization tests first. |
| "This is glue code / config / boilerplate" | Glue code that breaks takes down the system. If it can break, it needs a test. |
| "I already tested it manually" | Manual testing lacks systematic, re-runnable verification. It doesn't cover edge cases and you re-test every change. |
| "Deleting my existing code is wasteful" | Sunk cost fallacy. Unverified code is technical debt, not an asset. |
| "Let me keep my code as a reference and write tests first" | You'll adapt it instead of TDD-ing. That becomes testing-after with extra steps. |
| "The test is hard to write — I'll come back to it" | Hard-to-test code is hard-to-use code. The test is telling you the design needs work. Listen to it. |
| "TDD slows me down / I'm being pragmatic" | TDD is the pragmatic choice. Truly pragmatic means test-first because debugging costs more than testing. |
If you catch yourself composing an excuse not on this list, it's still an excuse. Write the test first.
Red Flags Requiring Restart
Stop immediately and restart from RED if you notice:
- Writing implementation code before tests
- Adding tests after implementation
- Tests passing immediately without new implementation (testing existing behavior)
- Inability to explain why a test failed
- Tests deferred to "later"
- Any rationalization beginning with "just this once"
- Manual testing claims replacing automated verification
- "Keep as reference" or "adapt existing code" language
- Sunk cost justifications for keeping pre-test code
Response: Delete the code written without tests. Start over with RED.
Verification Checklist
Before completing a unit of work:
- Every new function/method has a test
- Each test was watched failing before implementation
- Each failure occurred for the expected reason (missing feature, not typo)
- Minimal code written to pass each test
- All tests passing with clean output (no errors, no warnings)
- Tests use real code (mocks only when unavoidable)
- Edge cases and error conditions covered
Missing any checkbox = TDD was skipped. Restart from RED.
Exception Permissions
Ask your human partner before skipping TDD for:
- Throwaway prototypes (spike-and-discard)
- Generated code (scaffolding tools, codegen output)
- Configuration files with no behavioral logic
Even with permission, document the exception.
Anti-Pattern: Horizontal Slicing
DO NOT write all tests first, then all implementation. This is "horizontal slicing" — treating RED as "write all tests" and GREEN as "write all code."
This produces bad tests:
- Tests written in bulk test imagined behavior, not actual behavior
- You end up testing the shape of things (data structures, signatures) rather than user-facing behavior
- Tests become insensitive to real changes — passing when behavior breaks, failing when behavior is fine
- You outrun your headlights, committing to test structure before understanding the implementation
Correct approach: vertical slices via tracer bullets. One test → one implementation → repeat. Each test responds to what you learned from the previous cycle.
WRONG (horizontal):
RED: test1, test2, test3, test4, test5
GREEN: impl1, impl2, impl3, impl4, impl5
RIGHT (vertical / tracer bullet):
RED→GREEN: test1→impl1
RED→GREEN: test2→impl2
RED→GREEN: test3→impl3
The first vertical slice is the tracer bullet — it proves the path works end-to-end before you invest in breadth. If the tracer bullet reveals a bad design assumption, you've wasted one cycle, not five.
Integration with Phases
- Phase 2 (Plan): Test strategy is part of the plan — identify what tests will be written for each unit
- Phase 3 (Implement): Every unit of work follows RED-GREEN-REFACTOR. The inline review checkpoint runs after GREEN, not during RED.
- Acceptance tests: Feature file scenarios (Gherkin) define the outer loop. TDD operates within each scenario's implementation.
Output
Verified RED-GREEN-REFACTOR cycle evidence: failing test output, passing test output, and refactored code with passing tests for each unit of work.