Experiment Decision Framework: When to A/B Test vs Ship
Quick Start
/experiment-decision
Then provide:
- What you're considering building (feature, change, or experiment)
- Expected impact (metric + estimated improvement)
- Your concern (is this risky? reversible? controversial?)
I'll walk you through the decision tree: reversibility, hypothesis strength, detectable impact, and risk level. You'll get a clear recommendation: A/B test, ship + monitor, or just ship.
Output: Decision documented inline or saved to thoughts/shared/product/decisions/
Time: ~5 min for clear-cut cases, ~15 min for nuanced decisions
When to use: Before building any feature, when stakeholders demand "data-driven" decisions, or when unsure if testing is worth the effort
Framework source: Aakash Gupta's "When to A/B Test vs Just Ship"
The Decision Framework
Use this decision tree:
Question 1: Is it reversible?
If YES → Ship it
- CSS changes
- Messaging tweaks
- UI polish
- Non-destructive features
Why: Reversible changes have low risk. Ship, monitor, rollback if needed.
If NO → Continue to Question 2
Question 2: Do you have a hypothesis with measurable impact?
If NO → Don't test
- Building "nice to haves"
- No clear success metric
- Can't measure the outcome
Why: Testing without a hypothesis is wasteful. Either clarify the hypothesis or don't build it.
If YES → Continue to Question 3
Question 3: Is the expected impact large enough to detect?
Run a power calculation:
Minimum Detectable Effect (MDE) = Effect you need to see to justify the work
If your feature is expected to improve conversion by 0.5%, but you need 10M users to detect it → Don't test, just ship and monitor
If impact is too small to detect → Ship without test
If impact is detectable → Continue to Question 4
Question 4: Is the risk of being wrong high?
High risk scenarios:
- Affects revenue directly (pricing, checkout)
- Impacts core user experience (onboarding, core flows)
- Controversial decision (stakeholder disagreement)
- Large engineering investment
If HIGH risk → A/B test
If LOW risk → Ship without test
Decision Matrix
| Risk Level | Impact Size | Reversible? | Decision |
|---|---|---|---|
| High | Large | No | A/B Test |
| High | Large | Yes | A/B Test (or ship with kill switch) |
| High | Small | No | Don't build |
| High | Small | Yes | Ship + Monitor |
| Low | Large | No | Ship + Monitor |
| Low | Large | Yes | Just Ship |
| Low | Small | No | Just Ship |
| Low | Small | Yes | Just Ship |
When to A/B Test
✅ Test When:
1. High-stakes decisions
- Pricing changes
- Checkout flow modifications
- Core product changes
- Revenue-impacting features
2. Controversial hypotheses
- Team is divided on approach
- Stakeholders disagree
- User research is conflicting
3. Long-term bets
- Features that are expensive to reverse
- Architectural decisions
- Platform changes
4. Optimization work
- Conversion rate improvements
- Engagement optimization
- Retention experiments
When to Just Ship
✅ Ship When:
1. Fast iteration needed
- Competitive pressure
- Time-sensitive opportunities
- Market windows closing
2. Low risk, high certainty
- Bug fixes
- Obvious improvements
- User-requested features (with clear demand)
3. Qualitative insights are strong
- Clear user pain validated through research
- Competitive parity features
- Accessibility improvements
4. Testing would take too long
- Small user base (can't reach significance)
- Slow conversion cycles (months to convert)
- Complex setup (weeks to build test infrastructure)
The Cost of A/B Testing
Time costs:
- Engineering: 2-4 weeks to build test infrastructure
- Analysis: 1-2 weeks to run experiment + analyze
- Total: 3-6 weeks delay
Engineering costs:
- Feature flagging system
- Analytics instrumentation
- A/A test validation
- Test maintenance
Opportunity costs:
- Could have shipped 3-5 other features
- Delayed value delivery to users
- Competitors may ship first
When testing costs exceed value → Just ship
Real-World Examples
Example 1: Amazon's "Add to Cart" Button Color
Decision: A/B Test
- High traffic (millions of users)
- Direct revenue impact
- Easy to detect small improvements
- Result: +2% conversion = $100M+ annually
Example 2: Slack's Message Threading
Decision: Just Ship
- Highly requested feature
- Strong qualitative signal from users
- Reversible (users can ignore threads)
- Result: Successful launch, became core feature
Example 3: Netflix's "Are you still watching?" prompt
Decision: A/B Test
- Controversial (could annoy users)
- Impact on engagement unclear
- Risk of hurting retention
- Result: Test showed improved engagement (prevented zombie sessions)
Common Mistakes
❌ Testing everything "to be data-driven"
- Problem: Slows down velocity
- Fix: Reserve tests for high-stakes decisions
❌ Shipping without monitoring
- Problem: Bad changes go unnoticed
- Fix: Ship with dashboards and alerts
❌ Running underpowered tests
- Problem: Waste time on inconclusive results
- Fix: Calculate sample size before starting
❌ Testing when qualitative data is clear
- Problem: Delays obvious improvements
- Fix: Trust strong user research signals
Quick Reference Checklist
Before building any feature, ask:
- Is this reversible? (If yes → ship)
- Do I have a clear hypothesis? (If no → don't build)
- Can I measure the impact? (If no → don't test)
- Is the expected impact large enough to detect? (Power calculation)
- What's the risk of being wrong? (High risk → test)
- What's the cost of testing vs shipping? (ROI check)
- Do I have strong qualitative data? (If yes → consider shipping)
Statistical Power Guidance
Before committing to an A/B test, estimate whether you have enough traffic to detect a meaningful difference.
Power Calculation Essentials
Three inputs you need:
Minimum Detectable Effect (MDE) -- what's the smallest improvement worth detecting?
- For checkout conversion: 1-2% relative change matters (high revenue impact)
- For feature adoption: 5-10% relative change is typical MDE
- For engagement metrics: 3-5% relative change is reasonable
Baseline conversion rate -- what's the current rate you're trying to improve?
- Higher baselines need more samples to detect small changes
- Lower baselines are easier to move (but may need larger sample)
Daily traffic to the experiment -- how many users will enter the test per day?
Rule of Thumb
You need approximately 1,000 conversions per variant to detect a 5% relative change at 80% power (95% confidence).
| Baseline Rate | MDE (Relative) | Conversions Needed Per Variant | At 1K daily visitors, days needed |
|---|---|---|---|
| 50% | 5% | ~3,200 | ~7 days |
| 20% | 5% | ~12,500 | ~63 days |
| 5% | 10% | ~15,000 | ~300 days |
| 2% | 10% | ~40,000 | ~800 days |
When Traffic Is Too Low
If your power calculation shows the test would take longer than 4-6 weeks:
- Accept a larger MDE -- only test if you expect a big swing (15%+ improvement)
- Use a composite metric -- combine multiple success signals into one metric for higher sensitivity
- Run a qualitative test -- 5-10 user tests instead of a statistical A/B test
- Just ship and monitor -- launch with clear success criteria, compare before/after with caveats
- Use Bayesian methods -- more forgiving with small samples, give probability ranges instead of p-values
Common Pitfalls
- Peeking at results early -- checking before reaching sample size inflates false positive rate. Commit to a runtime upfront.
- Stopping at first significant result -- random fluctuations can look significant early. Use sequential testing if you must peek.
- Testing too many variants -- each variant divides your traffic. Stick to 2-3 variants max.
When to Skip the Framework
Some decisions don't need the full decision tree:
1. Regulatory/Compliance Requirement
Action: Just ship it. You don't have a choice. But: Document the change, set up monitoring, track any user impact.
2. Bug Fix
Action: Just fix it. No one A/B tests bug fixes. But: If the "bug fix" changes user behavior significantly, monitor post-fix metrics.
3. CEO/Board Mandate
Action: Document the decision and ship. Set up measurement so you can report on impact. But: Frame your measurement as "proving the impact" rather than "testing whether to do it." This builds credibility for future data-driven decisions.
4. Competitive Response
Action: If a competitor just shipped a similar feature and your users are asking for it, speed matters more than experimentation. Ship fast, measure after. But: Don't use "competitive pressure" as an excuse for every feature. Reserve this for genuine market urgency.
5. Sunset/Deprecation
Action: If you're removing a feature that <1% of users touch, just remove it with advance notice. But: If the feature has any paying customers relying on it, communicate early and provide alternatives.
Output Quality Self-Check
Before delivering the experiment decision, verify:
- Decision is clear -- the recommendation is explicitly "A/B test," "Ship + Monitor," or "Just Ship"
- Reversibility is assessed with specific reasoning (not just "yes/no")
- Hypothesis is stated in If/Then/Because format
- Power calculation is included if recommending a test (MDE, baseline, sample size, duration)
- Risk level is justified with specific stakes (revenue impact, user count affected)
- Cost of testing is weighed against cost of being wrong
- Edge cases are checked (compliance, bug fix, mandate, competitive response)
- Stakeholder consensus is noted -- does the team agree on the approach?
- Monitoring plan exists regardless of decision (even "just ship" needs dashboards)
- Next step is clear -- if testing, what metrics? If shipping, what success criteria?
- Connected to past decisions -- have we made similar decisions before? What happened?
Related Skills
/experiment-metrics- Choose the right metrics to measure/activation-analysis- Test activation improvements/metrics-framework- Understand leading vs lagging metrics/define-north-star- Align tests to North Star
Framework credit: Adapted from Aakash Gupta's experiment decision frameworks. Read: https://www.news.aakashg.com/p/when-to-ab-test
Context Routing Strategy
When the PM uses /experiment-decision, I automatically:
1. Check Historical Reversibility Precedent
Source: thoughts/shared/product/decisions/, past decisions
- What I look for: Similar decisions, how reversibility was judged
- How I use it: Ensure consistent reversibility assessment
- Example: "Last time we shipped CSS changes without testing; this is similar"
2. Extract Success Metrics Framework
Source: thoughts/shared/pm/metrics/, active PRDs
- What I look for: What metrics you typically measure, variance patterns
- How I use it: Calculate minimum detectable effect (MDE) more accurately
- Example: "Based on your metrics history, conversion rate variance is 3%, so MDE = 2%"
3. Route to Experiment Metrics if Testing
Source: Connection to /experiment-metrics skill
- What I look for: Whether decision routes to testing
- How I use it: If decision is "test", auto-suggest next step with
/experiment-metrics - Example: "Now that you've decided to test, let's pick the right metrics using STEDII"
4. Check Stakeholder Consensus on Risk
Source: thoughts/shared/pm/context/stakeholder-template.md, recent discussions
- What I look for: Stakeholder risk tolerance, veto power
- How I use it: Surface if high-risk decision needs executive approval
- Example: "CEO is risk-averse, so even medium-risk decisions should be tested"
5. Calculate Cost of Testing vs Shipping
Source: Team capacity, past experiment timelines
- What I look for: How long experiments take, engineering cost
- How I use it: ROI calculation in the framework
- Example: "Last experiment took 3 weeks; if we ship in 1 week and monitor, ROI favors shipping"