Experiment Metrics Selection: STEDII Framework

When to use: Before launching any experiment, when metrics feel unreliable, or when experiment results are confusing

Framework source: Aakash Gupta's "How to Choose the Right Metrics to Evaluate Experiments"

The STEDII Framework

Choose experiment metrics that are:

Sensitive
Timely
Efficient
Debuggable
Interpretable
Isolated

1. Sensitive (Detects Small But Meaningful Changes)

What it means: The metric moves when your feature actually improves the experience

Bad example:

Metric: Monthly Active Users (MAU)
Problem: Too coarse. A good onboarding improvement might not move MAU for months.

Good example:

Metric: Day 7 activation rate
Why: Sensitive enough to detect onboarding improvements within a week

How to check: Ask: "If this experiment succeeds, will this metric move within the experiment window?"

Common mistake: Using metrics that are too aggregated (MAU, total revenue) when you need something more granular (daily activation, conversion rate by cohort).

2. Timely (Results Available Quickly)

What it means: You get signal fast enough to make decisions

Bad example:

Metric: 90-day retention
Problem: Takes 90 days to know if your experiment worked

Good example:

Metric: Day 7 retention + leading indicators
Why: Faster feedback, correlates with long-term retention

Tradeoff alert: Sometimes you NEED slow metrics (LTV, annual retention). In those cases:

Use leading indicators to get fast signal
Run smaller experiments to validate
Accept longer experiment duration for critical decisions

How to check: Ask: "Can I get actionable results within [1 week / 2 weeks / 1 month]?"

3. Efficient (High Statistical Power)

What it means: You can detect the effect with reasonable sample size and time

Bad example:

Metric: Revenue per user
Problem: High variance, need massive sample sizes

Good example:

Metric: Conversion rate
Why: Lower variance, reaches significance faster

Statistical power explained:

Power = ability to detect a real effect
Higher variance metrics = lower power = longer experiments
Formula: Sample size needed ∝ (Variance / Expected Effect Size)²

How to check: Run a power calculation:

Minimum sample size = (Z + Z)² × (σ² / δ²)
Where:
- Z = confidence level (usually 1.96 for 95%)
- σ = standard deviation of metric
- δ = minimum detectable effect

Practical tip: If you need >1M users to detect a 5% lift, your metric isn't efficient enough.

4. Debuggable (Easy to Diagnose Issues)

What it means: When something goes wrong, you can figure out why

Bad example:

Metric: "Engagement score" (black box formula)
Problem: If it drops, you don't know what broke

Good example:

Metric: Click-through rate (CTR)
Why: Simple, transparent, easy to debug

How to check: Ask: "If this metric tanks, can I quickly understand what happened?"

What makes metrics debuggable:

✅ Simple calculations
✅ Can be broken down by segments
✅ Can view user-level data
✅ Clear numerator and denominator

Red flags:

❌ Proprietary "engagement scores"
❌ Complex weighted formulas
❌ Metrics with 5+ variables
❌ Black box ML model outputs

5. Interpretable (Easy to Understand and Explain)

What it means: Stakeholders can understand what the metric represents

Bad example:

Metric: "Quality-adjusted sessions per visitor"
Problem: What does "quality-adjusted" mean?

Good example:

Metric: "% of users who complete onboarding"
Why: Crystal clear what it measures

The grandma test: Can you explain this metric to your grandma? If not, it fails interpretability.

How to check:

Can you explain it in one sentence?
Would a new PM understand it immediately?
Can executives grasp it without training?

6. Isolated (Measures Only What You Changed)

What it means: The metric moves because of your experiment, not external factors

Bad example:

Metric: Total signups
Problem: Could move due to marketing campaigns, seasonality, competitor changes

Good example:

Metric: Signup conversion rate (for signup flow experiment)
Why: Isolated to the signup flow you're testing

Common isolation failures:

Network effects (social features affect all users)
Cross-contamination (treatment bleeds to control)
Seasonality (holiday effects)
Marketing campaigns running simultaneously

How to check: Ask: "Could something OTHER than my experiment cause this metric to move?"

How to Use This Framework

Step 1: List Your Candidate Metrics

Use /experiment-metrics

I'm running an experiment to: [describe your experiment]

Help me brainstorm 5-10 candidate metrics we could measure.

Step 2: Score Each Metric Against STEDII

Create a table:

Metric	Sensitive?	Timely?	Efficient?	Debuggable?	Interpretable?	Isolated?	Total Score
Metric 1	2/3	3/3	2/3	3/3	3/3	2/3	15/18
Metric 2	3/3	1/3	3/3	2/3	3/3	3/3	15/18

Scoring:

3 = Excellent
2 = Acceptable
1 = Poor
0 = Fails this criterion

Step 3: Select Primary + Guardrail Metrics

Primary metric: The ONE metric your experiment is designed to move

Should score 15+/18 on STEDII
The metric you'll make decisions on

Guardrail metrics (3-5): Metrics you DON'T want to hurt

Revenue (don't tank it)
Core engagement (don't break the product)
Quality metrics (don't hurt user experience)

Example:

Primary: Day 7 activation rate
Guardrails: Revenue per user, Daily active users, Customer satisfaction score, Page load time

Step 4: Run Pre-Experiment Checks

Before launching:

A:A Test - Run experiment with no actual change
- Both groups should be identical
- If metrics differ, you have a setup problem
Sample Ratio Check - Verify 50/50 split is actually 50/50
- If you see 52/48 or worse, investigate
Metric Stability - Check historical variance
- High variance = longer experiment needed

Common Metric Selection Mistakes

Mistake #1: Using Only One Metric

Problem: Optimize one thing, break another

Solution: Always have guardrail metrics

Primary: what you're trying to improve
Guardrails: what you don't want to hurt

Mistake #2: Confusing Leading and Lagging Metrics

Lagging metrics:

Slow to respond
Ultimate outcome you care about
Example: LTV, annual retention, NPS

Leading metrics:

Fast signal
Predictive of lagging metrics
Example: Day 7 retention, activation rate

Best practice: Use leading metrics to get fast signal, validate with lagging metrics on a sample.

Mistake #3: Metric Dilution

Problem: Testing a small feature but measuring site-wide metrics

Example:

Test: New checkout button color
Metric: Monthly revenue
Issue: Only 5% of users even see checkout, signal is too diluted

Solution: Measure metrics scoped to exposed users

Better metric: Revenue per checkout visitor
Or: Conversion rate (checkout started → completed)

Mistake #4: Simpson's Paradox

Problem: Aggregate metric moves one way, segments move the opposite way

Example:

Overall conversion rate: +5% ✅
Mobile conversion: -10% ❌
Desktop conversion: -5% ❌
Why? More cheap mobile traffic shifted the mix

Solution: Always segment your metrics (new vs returning, mobile vs desktop, etc.)

Real-World Examples

Example 1: Netflix Thumbnail Test

Experiment: Testing new thumbnail images

Bad metric: Monthly viewing hours

Not sensitive (too aggregated)
Not timely (takes too long)
Not isolated (affected by content releases)

Good metric: Click-through rate on thumbnails

Sensitive: Directly measures thumbnail appeal
Timely: Results in 1-2 days
Efficient: Lots of impressions = fast significance
Debuggable: Can see which thumbnails work
Interpretable: "% of people who click"
Isolated: Measures only thumbnail change

Example 2: Booking.com Pricing Test

Experiment: Showing "Only 2 rooms left!" urgency message

Bad metric: Bookings per visitor

Not efficient (high variance)
Not timely (slow conversion cycle)

Good metrics:

Primary: Booking conversion rate
Guardrail: Customer satisfaction (don't annoy users)
Guardrail: Return visit rate (don't hurt trust)

Result: +2.5% conversion, but -5% satisfaction and -3% return visits Decision: Don't ship. Guardrails caught a bad long-term tradeoff.

Quick Reference: Metric Selection Checklist

Before you launch an experiment, verify:

Primary metric clearly defined
- What are you measuring?
- How is it calculated?
- What's the minimum detectable effect?
STEDII checklist passed
- Sensitive enough to detect improvements
- Results available within [X] days
- Sample size achievable
- Can be debugged if issues arise
- Stakeholders understand it
- Isolated from external factors
Guardrails defined (3-5 metrics)
- Revenue metrics
- Engagement metrics
- Quality metrics
Statistical plan complete
- Significance level (usually 95%)
- Minimum sample size calculated
- Experiment duration estimated
- A:A test passed
Segmentation plan
- How will you break down results?
- New vs returning users
- Mobile vs desktop
- Geographic segments

Related Skills

/experiment-decision - Decide when to A/B test vs ship
/metrics-framework - Understand leading vs lagging metrics
/define-north-star - Choose your North Star Metric
/retention-analysis - Measure long-term impact

Framework credit: Adapted from Aakash Gupta's STEDII framework. Read the full article: https://www.news.aakashg.com/p/metrics-experiments

Context Routing Strategy

When the PM uses /experiment-metrics, I automatically:

1. Pull Metrics from PRDs & Strategy

Source: thoughts/shared/pm/prds/, success metrics defined there

What I look for: Feature's pre-defined success metrics, targets
How I use it: Pre-populate primary and secondary metrics for STEDII evaluation
Example: "Your PRD says success = conversion >60%, let's test if that's STEDII-compliant"

2. Query Analytics MCPs for Historical Data

Source: PostHog, PostHog, Posthog (if connected)

What I look for: Variance of potential metrics, time-to-signal data
How I use it: Validate metrics are Sensitive and Timely with real data
Example: "Metric X has 12% variance historically, so needs N=5000 sample size"

3. Check for Metric Conflicts with Guardrails

Source: thoughts/shared/pm/metrics/, company guardrails

What I look for: Metrics that must not decline, company KPIs
How I use it: Ensure secondary metrics include guardrails
Example: "NPS is a company guardrail, must include in secondary metrics"

4. Reference Past Experiments for Benchmarks

Source: thoughts/shared/pm/metrics/, A/B test results

What I look for: What worked in past experiments, surprising metric learnings
How I use it: Suggest metrics that detected real impacts before
Example: "In past experiments, page load time was poorly Sensitive, don't use it"

5. Route to Experiment Decision Framework

Source: Connection to /experiment-decision skill

What I look for: Is testing even the right call?
How I use it: If you should ship without testing, auto-flag before selecting metrics
Example: "CSS changes are reversible, don't need this full STEDII analysis"

Output Quality Self-Check

Before presenting output to the PM, verify:

Context was checked: Reviewed thoughts/shared/pm/metrics/ for existing experiments and baselines, and thoughts/shared/pm/prds/ for pre-defined success metrics
Each metric evaluated against all 6 STEDII dimensions: Every candidate metric has a score (0-3) for Sensitive, Timely, Efficient, Debuggable, Interpretable, and Isolated, with reasoning for each score
Sample size requirements calculated: The output includes a minimum sample size estimate for the primary metric based on expected effect size and variance
Metric sensitivity analysis included: The output states whether the expected change is detectable given current traffic, variance, and experiment duration
Guardrail metrics identified: At least 3 guardrail metrics are defined with acceptable ranges to prevent unintended harm
No vanity metrics without justification: If any metric could be considered a vanity metric (e.g., page views, total signups), the output explains why it is valid for this specific experiment