Estimate required sample size for an experiment for difference in conversion rates

Business Benefits

Design A/B tests based on binomial metrics like conversion rate per user, subscribe rate per user, and others.


Choose a reference period that includes recent data, and is sufficiently long enough to encompass enough variability that will be used to estimate the baseline of the primary test metric.

A reference period that’s too short or too long will capture information with little predictive capacity, which might distort the accuracy of the estimation. 2-8 weeks of data is usually sufficient. Avoid choosing periods with major shifts in the key metric of interest.

Use historical data from the chosen reference period as a baseline to predict future performance.

The simplest method is to predict that the metric will remain the same as it did in the reference period. Use more complex time series prediction methods like ARIMA forecasting to get a more accurate estimate if the metric exhibits strong seasonality. Minor seasonality differences won’t have much of an effect on the final sample size estimate, so this can be skipped if seasonality is minimal.

Use a tool like the Analytics Toolkit A/B Test Calculator to choose a significance threshold for the test, that determines how you act after running it.

The significance threshold is the p-value threshold which will determine how you act after the test has completed. If the p-value is lower, then you would act as if the variant is better than control, and implement it. Otherwise, you would stick with the current experience. A few helpful guidelines:

  • The more expensive it would be to make the wrong choice, the lower the threshold should be.
  • The more difficult it would be to reverse the decision, the lower the threshold should be.
  • The larger the pushback against the proposed change, the lower the threshold should be.

Answer the question, What is the minimum effect we’d be happy to detect as statistically significant? to choose the minimum effect of interest.

The minimum effect of interest would be the difference that, if there, you wouldn’t want to miss. It is related to the false negative rate of the testing procedure and only plays into post-test analysis if the test ended up statistically insignificant. It also helps to think in terms of risk and reward trade-off, due to the feedback loops involving fixed and variable risks, and rewards associated with running the test and making a decision within a given timeframe.

Use your expertise and experience to judge what minimum effect of interest can be realistically expected. A small change is unlikely to have a large impact, so setting the minimum effect of interest unrealistically high might doom the experiment to being a false negative if the actual effect is much lower than the minimum effect of interest.

Adjust the minimum effect of interest if the first few estimates turn out to be prohibitive, given the time it would take to run the test. For example, a 0.5% effect might sound exciting for a given test, but it might take a year of testing to achieve. So, you could compromise by choosing a minimum effect of 2%, and running the test for 3-4 months instead.

Use a sample size calculator that fits your statistical model, like the Power & Sample Size Calculator or the Analytics Toolkit Statistical Calculator, to calculate the required sample size and estimated duration.

Things to be aware of:

  • Proper support for more than one variant versus control.
  • Whether it computes sample size for relative difference or just absolute difference*.
  • The computation should be for a one-sided (one-tailed) p-value, unless the particular test actually calls for a two-sided alternative.

Estimate the expected test duration based on your sample size. For example, if the part of the website the test would affect sees 20,000 users per week, a sample size requirement of 120,000 users total would mean the test should last six weeks. Always round up the number of weeks or days to allow for overestimation of the expected number of users per week.

  • the latter could still be used by adding a conservative upward adjustment of 5% to the estimated sample size.