Calculate the statistical power of A/B tests

Contributors

@paul-boag


Business Benefits

Implement revenue-generating changes to your site based on testing.


Establish a minimum sample size and test duration to avoid underpowered tests.

Calculate a sample size large enough to test each variation and segment you want to analyze. Most A/B Testing apps would be able to calculate the sample size according to your traffic. Run tests for a minimum of two weeks or one full business cycle, but no more than four weeks to avoid problems like sample pollution and cookie deletion.

Determine the Minimum Effect of Interest (MEI) for the differences in results you want to detect.

The Minimum Effect of Interest (MEI) is the magnitude or size, of the difference in results you want to detect. Smaller differences are harder to detect and require a larger sample size to retain the same power. While larger differences can be detected reliably with smaller sample sizes, improvements from small sample sizes may be unreliable. Increase your MEI if you can’t increase your sample size.

Determine the level of significance for your test.

For example, if you test with a 95% confidence level, it means you have a 5% level of significance. Using a 5% level of significance for your test means that you’re willing to accept a 5% probability of a false positive. Lowering your level of significance, increasing your confidence level, decreases the probability of a false positive but increases the probability of a false negative, assuming all else is equal. This, in turn, reduces the power of your test.

Identify your desired power level based on your acceptable risk of false negatives.

A 20% acceptable risk of false negatives is a common starting in conversion optimization, which returns a power level of 80%.

If 20% is too risky, you can lower this probability to something like 10% or 5%, which would increase your statistical power to 90% or 95%, respectively. However, each increase in power requires a corresponding increase in sample size and the amount of time the test needs to run.

Therefore, your power level ultimately depends on:

  • How much risk you’re willing to take when it comes to missing a real improvement.
  • The minimum sample size necessary for each variation to achieve your desired power.

Enter your values into a sample size calculator or G*power to calculate statistical power for your test.

If you know three of the inputs, you can calculate the fourth to find out what’s required to run an adequately powered test. For example, if you know the sample size for each variation, have a level of significance of 5%, and a desired power level of 80%, you could plug these values into G*Power to find the MEI you need to achieve that power, 19% in this case: