A/B Testing Resources

Recommended Platforms

Eppo

Frameworks

Articles

Why overlapping confidence intervals don’t imply non-significance

Airbnb

Netflix

Twitter

Papers

Microsoft - almost all available from their excellent website

Best papers

Other good references

From Infrastructure to Culture: A/B Testing Challenges in Large Scale Social Networks

Google

Netflix

Facebook

Uncertainty in online experiments with dependent data: An evaluation of bootstrap methods

DoorDash

Meet Dash-AB — The Statistics Engine of Experimentation at DoorDash

Other

Good ideas for A/B test experiments from the papers

Combined experiments. Per-user with x% chance of seeing treatment
Metrics definitions as discussed in the Microsoft paper
Learning experiments where you intentionally degrade the experience to see how it affects the baseline
Digging deeper into adoption and retention
Surprising results should be replicated
Risk of focusing on small changes is incrementalism. Should be tried to get some high ROI but also some big bets for audacious goals
Changes rarely have a big impact to key metrics – corollary: only ~10% of experiments have any positive result
Metrics improvements should be diluted to their segment size (mobile -> overall MW)
Borderline significant results should be tentative and experiments rerun to verify
If the result is so good (e.g. 8 sd from mean) then check again even if statistically significant
Best to test yourself as many explanations for amazing A/B test results are wrong
Can quantify latency’s effects by artificially slowing down a site
Can we do a speed-up experiment by allowing response and delaying loading social column?
Reducing abandonment is hard, shifting clicks is easy
Delta method instead of bootstrap when data sizes are large (See Casella & Berger - Statistical Inference or Wasserman - All of Statistics)
ANOVA as an alternative to t-tests when comparing the means of more than 2 samples
Check experiment groups religiously for equal sizes and variances via A/A tests
Check for browser-specific bugs
Filter out users who didn’t even reach the page you have the treatment on for lower variance and better power (aka triggering)
Check for server-related caching issues

Recommended Platforms

Frameworks

Articles

Pinterest

Airbnb

Netflix

Twitter

Stripe

Etsy

Uber

LinkedIn

Stitch Fix

Booking.com

Spotify

Papers

Microsoft - almost all available from their excellent website

Best papers

Other good references

LinkedIn

Google

Netflix

Facebook

DoorDash

Other

Good ideas for A/B test experiments from the papers