Test Design Reviewer

Overview

Evaluates test quality using the 8 properties of good tests as described by Andrea Laforgia, based on Dave Farley's testing principles. Produces a quantitative "Farley Score" that teams can track over time.

Attribution: Andrea Laforgia / Dave Farley — properties of good automated tests.

The 8 Properties

Each property is scored 1-10:

#	Property	Weight	Description
1	Understandable	1.5	Can a new team member read the test and understand what behavior is verified? Clear names, obvious arrange-act-assert structure, no hidden setup.
2	Maintainable	1.5	Can the test be updated without deep knowledge of the implementation? Minimal coupling to internals, no fragile selectors, uses abstractions (page objects, builders).
3	Repeatable	1.2	Does it produce the same result every time? No time-dependence, no external service calls, no shared mutable state, deterministic data.
4	Atomic	1.0	Does it test exactly one behavior? Single assertion concept (multiple asserts on one object are fine), no test interdependency, independent setup/teardown.
5	Necessary	1.0	Does it verify behavior that matters? Not testing framework code, not duplicating another test, covers a real scenario or edge case.
6	Granular	1.0	Does it fail with a clear, specific message? Pinpoints the failure location, doesn't require debugging to understand what broke.
7	Fast	0.8	Does it run quickly enough for the feedback loop? Unit tests <100ms, integration tests <5s, E2E tests <30s.
8	First	1.0	Was it written before or alongside the implementation (TDD)? Evidence: test commit predates implementation, test names describe behavior not implementation.

Scoring

Per-test score

Farley Score = (sum of property_score × weight) / (sum of weights)

Total weight: 9.0. Maximum score: 10.0.

Score interpretation

Range	Rating	Action
9.0 - 10.0	Exemplary	Reference test — share as an example
7.0 - 8.9	Good	Minor improvements possible
5.0 - 6.9	Adequate	Specific improvements recommended
3.0 - 4.9	Poor	Significant rework needed
< 3.0	Critical	Test provides false confidence — fix or delete

Suite-level score

Average the per-test scores. Report the distribution (how many Exemplary, Good, etc.).

Output Format

## Test Quality Report — Farley Score

**Suite**: `path/to/tests/`
**Tests scored**: 12
**Suite score**: 7.4 (Good)

### Distribution
- Exemplary (9+): 2
- Good (7-8.9): 6
- Adequate (5-6.9): 3
- Poor (3-4.9): 1

### Top Issues
1. **Maintainability** (avg 5.2): 4 tests coupled to implementation details — use behavior-based assertions
2. **Repeatability** (avg 6.0): 2 tests use `Date.now()` — inject time dependency
3. **First** (avg 6.5): Test names describe implementation ("calls handleSubmit") not behavior ("submits form data")

### Per-Test Scores (lowest first)
| Test | Score | Weakest Property | Suggestion |
|------|-------|-----------------|------------|
| `should call the API` | 4.2 | Understandable (2) | Rename to describe behavior, not mechanism |
| `test edge case` | 5.1 | Necessary (3) | Unclear what edge case — specify the condition |

Integration

test-review agent: Checks coverage and assertion quality; this skill adds quantitative scoring
QA Engineer agent: Uses Farley Score in quality reports
Mutation testing skill: Farley Score complements mutation score — high Farley + low mutation = assertions too weak