A/B Testing Statistics A/B 測試統計

Released已發布

methodology analytics

Design and analyze A/B tests with proper statistical methodology including sample size calculation, randomization, frequentist and Bayesian approaches, and sequential testing. Use this skill when the user needs to set up an experiment, calculate required sample size, interpret test results, or decide between testing methodologies — even if they say 'should we A/B test this', 'how many users do we need', 'is the test result conclusive', or 'can we stop the test early'.

統計方法論技能：A/B Testing Statistics 分析與應用。

View on GitHub在 GitHub 查看

Methodology 方法論

IRON LAW: Calculate Sample Size BEFORE Running the Test

Running a test without knowing the required sample size leads to two
failures: stopping too early (false positives) or running too long (waste).

Required inputs: baseline conversion rate, minimum detectable effect (MDE),
significance level (α), power (1-β). Calculate BEFORE starting.

Sample Size Formula (Proportions)

n per group ≈ (Z_α/2 + Z_β)² × [p₁(1-p₁) + p₂(1-p₂)] / (p₁ - p₂)²

Quick reference (α=0.05, power=0.8):

Baseline Rate	MDE (relative)	N per Group
5%	10% (→5.5%)	~58,000
5%	20% (→6.0%)	~15,000
10%	10% (→11%)	~15,000
10%	20% (→12%)	~4,000

Testing Approaches

Approach	How It Works	Best When
Frequentist (fixed-horizon)	Set sample size, run to completion, then analyze	Standard practice, well-understood
Bayesian	Update beliefs with data, compute probability of improvement	Want probability statements ("90% chance B is better")
Sequential testing	Check results at intervals with adjusted thresholds	Need to stop early if clear winner, or limit downside risk

Experiment Design Checklist

Hypothesis: What do you expect to happen and why?
Primary metric: ONE key metric (conversion, revenue, retention)
Guardrail metrics: Metrics that must NOT degrade (page load time, error rate)
Randomization unit: User, session, or device?
Sample size: Calculated from baseline, MDE, α, power
Duration: Account for weekly cycles (minimum 1-2 full weeks)
Stopping rules: Pre-defined — do NOT peek and stop early without correction

Analysis Steps

Check randomization balance (are groups comparable on pre-treatment metrics?)
Calculate observed difference and confidence interval
Run significance test (z-test for proportions, t-test for continuous)
Check guardrail metrics
Interpret with practical significance in mind

Output Format輸出格式

# A/B Test Design: {Experiment Name}

Gotchas注意事項

Peeking inflates false positives: Checking results daily and stopping when p < 0.05 can produce a 30%+ false positive rate. Use sequential testing methods if you need to peek.
Novelty effect: New features may show a lift that fades as users get used to them. Run tests long enough (2+ weeks) to stabilize.
Simpson's paradox: An overall positive result can be negative in every subgroup (or vice versa). Segment by key dimensions.
Network effects / interference: If treatment users interact with control users (social features, marketplace), independence is violated. Use cluster randomization.
Statistical significance threshold is arbitrary: α=0.05 is convention, not truth. For high-stakes decisions (pricing, major UX changes), consider α=0.01.

References參考資料

For Bayesian A/B testing methodology, see references/bayesian-ab.md
For multi-armed bandit approach, see references/bandits.md

Tags標籤

statisticsab-testingexperimentation