Detect and discard experiment data with serious deficiencies in implementation, tracking, or data retrieval.
The p-value threshold is the value of p under which you would consider the statistical test sufficient to rule out the possibility of the experiment having no issues. If you observe a p-value lower than the threshold, the experiment data would be considered unfit for use. Values between 0.01 and 0.001 are common choices.
Extract as you usually would before computing statistical estimates such as p-values or confidence intervals. This includes making sure there is no sampling in extracting the data or statistical estimation in the obtained counts - cardinality estimates are an often overlooked issue.
Usually this would be an equal distribution. For example, 0.50/0.50 for an A/B test, 0.333/0.333/0.333 for an A/B/C test.
In rare cases, you might have gone for unequal distribution like 0.80/0.20. If the distribution is expressed as percentages like 50%/50%, transform them into proportions first.
Use a tool like GIGA Calculator’s Sample Ratio Mismatch Calculator to perform a chi-square goodness-of-fit test with sample sizes in the first column and allocation ratios in the second.
The Sample Ratio Mismatch Calculator has a versatile interface and does not require allocation ratios to be entered if the target split is equal.
Alternatively, perform the test in R using this code, replacing the sample size and proportion values with your own:
sampleSizes = c(10000, 10050) proportions = c(0.5, 0.5) chisq.test(sampleSizes, p=proportions)
Compare the p-value from the chi-square test to the threshold you chose earlier and either discard, replace, or use the data.
- If the p-value is smaller than the threshold, then treat the test as if something went wrong in its implementation, tracking, or data extraction. In some cases, you can uncover valid test data and use that instead, but this is usually only possible if the sample ratio mismatch is due to improper data extraction. Usually, you need to identify the issue, eliminate it, and then restart the test from scratch.
- If the p-value is greater than the threshold, then the test can be determined as good quality and the main statistical analysis may proceed.