CI Debugging
Overview
CI failures are diagnosed systematically, not by guessing. This skill prevents the common anti-pattern of "just re-run it" by enforcing a hypothesis-first approach.
Constraints
- Never suggest "retry the build" as a first action
- Never add retry logic to "fix" a flaky test
- Never push a speculative fix without reproducing locally first
- Always form a hypothesis before making changes
Process
1. Read the failure
Read the full CI log output. Identify:
- What failed: test, lint, typecheck, build, deploy, or infrastructure
- When it started failing: this commit, recent merge, or intermittent
- Error message: exact error text, not a summary
2. Form a hypothesis
Before touching any code, state a hypothesis: "I believe X is failing because Y, and I can verify by Z."
Good hypotheses are specific and falsifiable:
- "The test fails because the database migration added a NOT NULL column and the test fixture doesn't set it"
- "The build fails because Node 20 dropped support for this API and CI upgraded last week"
Bad hypotheses:
- "Something is wrong with the tests"
- "CI is flaky"
3. Environment delta analysis
Compare local vs CI environments using this checklist:
| Factor | Check |
|---|---|
| Runtime version | Node, Python, Go, JDK, Ruby, .NET — exact version match? |
| OS and architecture | macOS vs Linux, x86 vs ARM? |
| Dependency resolution | Lockfile committed? Registry differences? Private packages accessible? |
| Environment variables | All required vars set? Secrets populated? |
| Parallelism | Tests run in parallel in CI but serial locally? Shared state? |
| Memory/CPU | CI runner constrained? OOM killer? |
| Network | External service dependencies? DNS resolution? Firewall rules? |
| Filesystem | Case sensitivity (macOS vs Linux)? Temp directory permissions? Path length? |
| Timezone/locale | Date formatting tests? Locale-dependent string comparisons? |
| Docker | Image tag latest drifted? Layer cache stale? Build context different? |
4. Reproduce locally
Before fixing, reproduce the failure locally:
- Match the CI environment as closely as possible (same Node version, same env vars)
- Run the exact failing command from the CI log
- If it passes locally, the delta from step 3 is the cause
5. Fix and verify
Fix the root cause. Verify the fix locally. Push and confirm CI passes.
Anti-Patterns
| Anti-pattern | Why it's wrong | What to do instead |
|---|---|---|
| Retry the build | Masks the root cause, wastes CI minutes | Read the logs, form a hypothesis |
| Add retry/sleep to test | Hides race conditions, makes tests slow | Fix the shared state or ordering issue |
| Skip the failing test | Reduces coverage, hides regressions | Fix the test or the code |
| Push a speculative fix | Pollutes git history, might not work | Reproduce locally first |
| Blame "flaky CI" | Normalizes unreliable pipelines | Every failure has a cause — find it |
Integration
- Platform Engineer agent: Collaborates on infrastructure-level CI issues
- QA Engineer agent: Collaborates on test isolation and fixture problems
- Systematic Debugging skill: Uses the same hypothesis-first methodology