Experiment Decision Framework: When to A/B Test vs Ship

Quick Start

/experiment-decision

Then provide:

What you're considering building (feature, change, or experiment)
Expected impact (metric + estimated improvement)
Your concern (is this risky? reversible? controversial?)

I'll walk you through the decision tree: reversibility, hypothesis strength, detectable impact, and risk level. You'll get a clear recommendation: A/B test, ship + monitor, or just ship.

Output: Decision documented inline or saved to thoughts/shared/product/decisions/ Time: ~5 min for clear-cut cases, ~15 min for nuanced decisions

When to use: Before building any feature, when stakeholders demand "data-driven" decisions, or when unsure if testing is worth the effort

Framework source: Aakash Gupta's "When to A/B Test vs Just Ship"

The Decision Framework

Use this decision tree:

Question 1: Is it reversible?

If YES → Ship it

CSS changes
Messaging tweaks
UI polish
Non-destructive features

Why: Reversible changes have low risk. Ship, monitor, rollback if needed.

If NO → Continue to Question 2

Question 2: Do you have a hypothesis with measurable impact?

If NO → Don't test

Building "nice to haves"
No clear success metric
Can't measure the outcome

Why: Testing without a hypothesis is wasteful. Either clarify the hypothesis or don't build it.

If YES → Continue to Question 3

Question 3: Is the expected impact large enough to detect?

Run a power calculation:

Minimum Detectable Effect (MDE) = Effect you need to see to justify the work

If your feature is expected to improve conversion by 0.5%, but you need 10M users to detect it → Don't test, just ship and monitor

If impact is too small to detect → Ship without test

If impact is detectable → Continue to Question 4

Question 4: Is the risk of being wrong high?

High risk scenarios:

Affects revenue directly (pricing, checkout)
Impacts core user experience (onboarding, core flows)
Controversial decision (stakeholder disagreement)
Large engineering investment

If HIGH risk → A/B test

If LOW risk → Ship without test

Decision Matrix

Risk Level	Impact Size	Reversible?	Decision
High	Large	No	A/B Test
High	Large	Yes	A/B Test (or ship with kill switch)
High	Small	No	Don't build
High	Small	Yes	Ship + Monitor
Low	Large	No	Ship + Monitor
Low	Large	Yes	Just Ship
Low	Small	No	Just Ship
Low	Small	Yes	Just Ship

When to A/B Test

✅ Test When:

1. High-stakes decisions

Pricing changes
Checkout flow modifications
Core product changes
Revenue-impacting features

2. Controversial hypotheses

Team is divided on approach
Stakeholders disagree
User research is conflicting

3. Long-term bets

Features that are expensive to reverse
Architectural decisions
Platform changes

4. Optimization work

Conversion rate improvements
Engagement optimization
Retention experiments

When to Just Ship

✅ Ship When:

1. Fast iteration needed

Competitive pressure
Time-sensitive opportunities
Market windows closing

2. Low risk, high certainty

Bug fixes
Obvious improvements
User-requested features (with clear demand)

3. Qualitative insights are strong

Clear user pain validated through research
Competitive parity features
Accessibility improvements

4. Testing would take too long

Small user base (can't reach significance)
Slow conversion cycles (months to convert)
Complex setup (weeks to build test infrastructure)

The Cost of A/B Testing

Time costs:

Engineering: 2-4 weeks to build test infrastructure
Analysis: 1-2 weeks to run experiment + analyze
Total: 3-6 weeks delay

Engineering costs:

Feature flagging system
Analytics instrumentation
A/A test validation
Test maintenance

Opportunity costs:

Could have shipped 3-5 other features
Delayed value delivery to users
Competitors may ship first

When testing costs exceed value → Just ship

Real-World Examples

Example 1: Amazon's "Add to Cart" Button Color

Decision: A/B Test

High traffic (millions of users)
Direct revenue impact
Easy to detect small improvements
Result: +2% conversion = $100M+ annually

Example 2: Slack's Message Threading

Decision: Just Ship

Highly requested feature
Strong qualitative signal from users
Reversible (users can ignore threads)
Result: Successful launch, became core feature

Example 3: Netflix's "Are you still watching?" prompt

Decision: A/B Test

Controversial (could annoy users)
Impact on engagement unclear
Risk of hurting retention
Result: Test showed improved engagement (prevented zombie sessions)

Common Mistakes

❌ Testing everything "to be data-driven"

Problem: Slows down velocity
Fix: Reserve tests for high-stakes decisions

❌ Shipping without monitoring

Problem: Bad changes go unnoticed
Fix: Ship with dashboards and alerts

❌ Running underpowered tests

Problem: Waste time on inconclusive results
Fix: Calculate sample size before starting

❌ Testing when qualitative data is clear

Problem: Delays obvious improvements
Fix: Trust strong user research signals

Quick Reference Checklist

Before building any feature, ask:

Is this reversible? (If yes → ship)
Do I have a clear hypothesis? (If no → don't build)
Can I measure the impact? (If no → don't test)
Is the expected impact large enough to detect? (Power calculation)
What's the risk of being wrong? (High risk → test)
What's the cost of testing vs shipping? (ROI check)
Do I have strong qualitative data? (If yes → consider shipping)

Statistical Power Guidance

Before committing to an A/B test, estimate whether you have enough traffic to detect a meaningful difference.

Power Calculation Essentials

Three inputs you need:

Minimum Detectable Effect (MDE) -- what's the smallest improvement worth detecting?
- For checkout conversion: 1-2% relative change matters (high revenue impact)
- For feature adoption: 5-10% relative change is typical MDE
- For engagement metrics: 3-5% relative change is reasonable
Baseline conversion rate -- what's the current rate you're trying to improve?
- Higher baselines need more samples to detect small changes
- Lower baselines are easier to move (but may need larger sample)
Daily traffic to the experiment -- how many users will enter the test per day?

Rule of Thumb

You need approximately 1,000 conversions per variant to detect a 5% relative change at 80% power (95% confidence).

Baseline Rate	MDE (Relative)	Conversions Needed Per Variant	At 1K daily visitors, days needed
50%	5%	~3,200	~7 days
20%	5%	~12,500	~63 days
5%	10%	~15,000	~300 days
2%	10%	~40,000	~800 days

When Traffic Is Too Low

If your power calculation shows the test would take longer than 4-6 weeks:

Accept a larger MDE -- only test if you expect a big swing (15%+ improvement)
Use a composite metric -- combine multiple success signals into one metric for higher sensitivity
Run a qualitative test -- 5-10 user tests instead of a statistical A/B test
Just ship and monitor -- launch with clear success criteria, compare before/after with caveats
Use Bayesian methods -- more forgiving with small samples, give probability ranges instead of p-values

Common Pitfalls

Peeking at results early -- checking before reaching sample size inflates false positive rate. Commit to a runtime upfront.
Stopping at first significant result -- random fluctuations can look significant early. Use sequential testing if you must peek.
Testing too many variants -- each variant divides your traffic. Stick to 2-3 variants max.

When to Skip the Framework

Some decisions don't need the full decision tree:

1. Regulatory/Compliance Requirement

Action: Just ship it. You don't have a choice. But: Document the change, set up monitoring, track any user impact.

2. Bug Fix

Action: Just fix it. No one A/B tests bug fixes. But: If the "bug fix" changes user behavior significantly, monitor post-fix metrics.

3. CEO/Board Mandate

Action: Document the decision and ship. Set up measurement so you can report on impact. But: Frame your measurement as "proving the impact" rather than "testing whether to do it." This builds credibility for future data-driven decisions.

4. Competitive Response

Action: If a competitor just shipped a similar feature and your users are asking for it, speed matters more than experimentation. Ship fast, measure after. But: Don't use "competitive pressure" as an excuse for every feature. Reserve this for genuine market urgency.

5. Sunset/Deprecation

Action: If you're removing a feature that <1% of users touch, just remove it with advance notice. But: If the feature has any paying customers relying on it, communicate early and provide alternatives.

Output Quality Self-Check

Before delivering the experiment decision, verify:

Decision is clear -- the recommendation is explicitly "A/B test," "Ship + Monitor," or "Just Ship"
Reversibility is assessed with specific reasoning (not just "yes/no")
Hypothesis is stated in If/Then/Because format
Power calculation is included if recommending a test (MDE, baseline, sample size, duration)
Risk level is justified with specific stakes (revenue impact, user count affected)
Cost of testing is weighed against cost of being wrong
Edge cases are checked (compliance, bug fix, mandate, competitive response)
Stakeholder consensus is noted -- does the team agree on the approach?
Monitoring plan exists regardless of decision (even "just ship" needs dashboards)
Next step is clear -- if testing, what metrics? If shipping, what success criteria?
Connected to past decisions -- have we made similar decisions before? What happened?

Related Skills

/experiment-metrics - Choose the right metrics to measure
/activation-analysis - Test activation improvements
/metrics-framework - Understand leading vs lagging metrics
/define-north-star - Align tests to North Star

Framework credit: Adapted from Aakash Gupta's experiment decision frameworks. Read: https://www.news.aakashg.com/p/when-to-ab-test

Context Routing Strategy

When the PM uses /experiment-decision, I automatically:

1. Check Historical Reversibility Precedent

Source: thoughts/shared/product/decisions/, past decisions

What I look for: Similar decisions, how reversibility was judged
How I use it: Ensure consistent reversibility assessment
Example: "Last time we shipped CSS changes without testing; this is similar"

2. Extract Success Metrics Framework

Source: thoughts/shared/pm/metrics/, active PRDs

What I look for: What metrics you typically measure, variance patterns
How I use it: Calculate minimum detectable effect (MDE) more accurately
Example: "Based on your metrics history, conversion rate variance is 3%, so MDE = 2%"

3. Route to Experiment Metrics if Testing

Source: Connection to /experiment-metrics skill

What I look for: Whether decision routes to testing
How I use it: If decision is "test", auto-suggest next step with /experiment-metrics
Example: "Now that you've decided to test, let's pick the right metrics using STEDII"

4. Check Stakeholder Consensus on Risk

Source: thoughts/shared/pm/context/stakeholder-template.md, recent discussions

What I look for: Stakeholder risk tolerance, veto power
How I use it: Surface if high-risk decision needs executive approval
Example: "CEO is risk-averse, so even medium-risk decisions should be tested"

5. Calculate Cost of Testing vs Shipping

Source: Team capacity, past experiment timelines

What I look for: How long experiments take, engineering cost
How I use it: ROI calculation in the framework
Example: "Last experiment took 3 weeks; if we ship in 1 week and monitor, ROI favors shipping"