A/B testing is an extremely powerful method to improve your website or product. However, the process of A/B testing is rife with pitfalls which can limit its effectiveness. Most of the common ones are statistical in nature. Was my test statistically significant? How much did this variation change some key metric? While important, these are very apparent pitfalls and are relatively easy to avoid. However, the pitfalls subtly introduced by bias are not as apparent but can be equally problematic.

Experimenter’s bias is the main enemy. Experimenter’s bias simply means the person running the test has a preferred result. This bias is going to affect every test that every single person runs in your organization. It cannot be eliminated, because it is human nature; the reason people are running tests in the first place is because they believe they are going to improve something. However, if not managed properly, this bias can be quite pernicious.

Your organization is only as good as your worst test

Imagine a situation where your organize has run 10 tests. 9 of them were run properly and produced valid results (a definition I’ll get into more later), but 1 of them wasn’t. In the last test, the experimenter accidentally cheated and got a result that your organization wouldn’t have deemed as valid. If you make a decision based on this invalid test, you might be reversing the gains of the previous 9 tests, or even worse.

Your product isn’t just one test, it is a series of tests. These tests taken together form a knowledge base about how your customers reach to changes and new features. This knowledge base is a historical record of tests that improved a metric, tests that decreased a metric and tests that had no measurable changes. While certainly tests that improve a metric are great, the other two results are still very valuable additions. Sometimes knowing a path not to go down is as important as the alternative.

Knowing nothing is better than knowing the wrong thing

But, one invalid test can poison this well. And one bad decision based on a single invalid test can cause you to take a wrong path which leads to more bad decisions. If you can’t trust the validity of your knowledge base, you can’t use it to make good decisions.

Defining what makes a valid test

There is no absolute truth when determining whether a test was valid or not. What passes as a valid test for one organization might not work for another. But as an organization, you can trade off the risks and benefits of various approaches and pick one that works for you.

There are generally two risks that you are worried about:

The risk that a test shows a significant result when it wasn’t present. This is quantified by statistical significance.
The risk that a test doesn’t show a significant result when it was present. This is quantified by statistical power.

Each of these is independent, but when taken in combination they help provide guidelines to define what is a valid test. The higher the values for each one, the more certain you can be that the results of your tests accuratly reflect the true values. However, this comes at the cost of speed.

Picking a sample size

The speed of the test is based on two things: the traffic to the particular test, and the sample size required. The traffic is generally going to be some known rate, and you should use that to determine what is a maximum possible sample size given time constraints. For example, if a given page gets 1000 hits/day, and you don’t want a test to run longer than 2 weeks, you can’t run any tests on that page that require more than 14,000 samples.

Given this upper bound of possible samples based on time constraints, you now need to calculate the required sample size to meet your organization’s statistical constraints. This sample size is computed from four values:

The desired statistical significance
The desired statistical power
The current baseline conversion rate
The minimum detectable effect

The first two values you should already know, because as I’ve suggested above, they should be organization-wide values. The current baseline conversion rate is also known from historical data about a given metric.

The minimum detectable effect can be tweaked on a per-test basis. It should be as small as possible given the maximum possible number of samples in your time window. The more samples, the smaller effects you can detect.

To produce the actual number of samples you can use, I recommend using Evan Miller’s Sample Size Calculator.

Here are a few quick examples:

Significance	Power	Current baseline conversion rate	Minimum detectable effect (relative)	Required Sample size
95%	80%	20%	20%	1,602 samples/variation
99%	90%	20%	20%	3,044 samples/variation
99%	90%	20%	10%	12,046 samples/variation

Picking a sample size – in advance

Picking a proper sample size is important, but picking and publishing it in advance is just as important. Once a test is underway, the necessary sample size could be cheated by altering the minimum detectable effect to fit the early results of the test.

While it seems like only a malicious person would do that, when early results combine with the pre-existing experimenter’s bias, tweaking values to end a test early can seem like a sensible thing to do. But ending a test early means the values are not following the risk/benefit trade-offs your organization decided upon.

Putting it all together

Your organization should have the following policy: Valid tests can inform decision making and go into the knowledge base. Invalid tests have no (zip, zilch, nada, zero) value.

A valid test should have the following properties:

The required sample size per variation, along with the 4 values that computed it (significance, power, baseline conversation rate, minimum detectable effect), were published publicly before the test is started
The test was run until the minimum sample size was reached per variation
The results were analyzed for statistical signifiance using a trustworthy A/B test calculator, like Evan Miller’s or Thumbtack’s ABBA. This should be binary: either the test showed a statistically significant result or it didn’t.

If these properties were met, regardless of whether the test resulted in a desirable or undesirable outcome, you can feel confident that the inherent bias of the experimenter did not interfere with the results.

Is this overkill?

If you are serious about developing a proper culture of testing, then no. Testing is a situation where the short-term incentives for an experimenter (producing “winning” tests, looking smart, getting code into production) can be directly opposed to the long-term incentives for the organization (building a knowledge base of trustworthy test results, actually improving key metrics).

The above policy takes advantage of a key fact: if you create the rules which govern how a test will be run before it starts gettings results, you eliminate most common ways of introducing bias. The more decisions you make after results start coming in, the more traps you are setting for yourself.

So while this policy might seem onerous at first, when accounting for the hidden costs of untrustworthy test results, it actually is quite worthwhile. If you are serious about testing, I recommend you try to incorporate this policy, or aspects of it, into your organizations.

» Discuss on Hacker News

Special thanks to Shaggy, Fred & Velma for reading drafts of this post. More thanks to Chris Heiser, Eric Naeseth, Evan Miller and Steve Howard and for their contributions to tools mentioned in this post.

Dan Birken