A/B testing looks simple — show two versions, measure the difference, pick the winner. But peeking at results, multiple comparisons, and ignored novelty effects cause teams to ship changes that don't actually work. Understanding these traps separates real data scientists from dashboard-watchers.

A/B testing is the workhorse of product data science, but the statistical traps are subtle and consequential. Peeking is the most common failure: checking results repeatedly during a test and stopping when significance appears. This inflates false positive rates dramatically because every check is a new opportunity to find a spurious result. Sequential testing methods like always-valid p-values or alpha spending corrections handle this properly, but most teams still use fixed-horizon tests and peek anyway. Multiple comparisons compound the problem: testing five metrics simultaneously at 5 percent significance gives roughly a 23 percent chance of at least one false positive. Bonferroni correction, false discovery rate control, or pre-registered primary metrics address this, but practitioners often ignore them. Novelty effects make new variants look better in the first week as users react to anything different, then performance regresses. Running tests for at least 1-2 full business cycles helps but lengthens iteration time. Sample ratio mismatch — when traffic doesn't split as expected — usually indicates a broken implementation that invalidates the test entirely; always check before analyzing. Heterogeneous treatment effects mean the average effect can mask important variation across user segments; segment-level analysis often tells a different story than headline numbers. Survivorship bias affects retention and engagement metrics: only users who stuck around contribute to later metrics, so apparent improvements may reflect filtering rather than treatment effects. Network effects break the independent-units assumption underlying most statistical tests. Companies like Microsoft, Google, Booking.com, and Airbnb have published extensively on disciplined experimentation; the underlying lesson is that getting A/B testing right requires statistical rigor that most teams underinvest in until they ship a few changes that hurt the business.

How A/B Testing Goes Wrong: The Most Common Statistical Pitfalls