Inside Facebook has a good post about how to not screw up your A:B testing that is a useful reminder about how many tests you need to run before you know that the results are statistically significant.
The author notes:
How many [tests] do we need to declare a statistically significant difference between [a design leading to action success rate] of p1 and one of p2? This is readily calculable:
* number of samples required per cell = 2.7 * (p1*(1-p1) + p2*(1-p2))/(p1-p2)^2
(By the way, the pre-factor of 2.7 has a one-sided confidence level of 95% and power of 50% baked into it. These have to do with the risk of choosing to switch when you shouldn’t and not switching when you should. We’re not running drug trials here so these two choices are fine for our purposes. The above calculation will determine the minimum and also the maximum you need to run.)
Thus, if you did this number of tests and found that the difference in action success was greater than (p1-p2), then you would have a 95% confidence level that the design being tested is responsible for the increase in success rate, and you would move to a new best practice.
The author reminds developers to adhere to A:B testing best practices, including:
# Running the two cells concurrently
# Randomly assigning an individual user to a cell and make sure they stay in that cell during the test
# Scheduling the test to neutralize time-of-day and day-of-week effects.
# Serving users from countries that are of interest.
One thing that immediately emerges from this formula is that you don’t need that many tests to determine if a new design is working. For example, testing a design that anticipates increasing success from .5% to .575% only needs about 52k tests. For apps and websites that are at scale, this does not take very long.
The danger is that, because of the overhead of putting up and taking down tests, “bad” test designs stay up for too long, exposing too many users to a worse experience than usual. While some people consider A:B testing to be splitting users into equal groups, there is no such requirement. I’d advise developers to size their test cells to be x% of their total traffic, where x% is a little more required to hit the minimums calculated above over a week. This neutralizes time of day and day of week effects, minimizes the overhead of test set up, and ensures that not too many users are exposed to bad designs. It also allows multiple, independent tests to be run simultaneously.