Analyze A/B test results with GA

Business Benefits

Go deeper with analysis, find more wins and have better data to make decisions.


Integrate your A/B testing tool with Google Analytics, using tools like Optimizely and VWO.

You should, send data for each test to Google Analytics. This way you create multiple sources of data, which is more reassuring as testing tools might be sometimes recording data incorrectly.

Use Universal Analytics over Classic Google Analytics if it’s available, as this offers new GA features, such as up to 20 concurrent A/B tests sending data to Google Analytics rather than Classic’s limit of 5.

Set up the Optimizely Classic integration under Project Settings and switch your GA tracker to use Universal Analytics instead of Classic Google Analytics. Go to Settings in Optimizely X and navigate to Integrations to find Google Universal Analytics and turn it on, then activate integration on Project level and activate for each experiment by picking the slot; Custom Dimension.

Add only one experiment per slot or custom dimension, as multiple experiments will overwrite each others’ data.

Pick a slot for each test after the implementation is done. Check there aren’t multiple tests that use the same Custom Dimension or Custom Variable for Classic, slot in GA.

For Optimizely:

  • Take a look at the guide on their website for more detailed information on how to get the integration running, including creating the Custom Dimensions in Google Analytics.

For VWO:

  • Pick the right Custom Dimension in the Others tab of Experiment Settings when integrating Google Analytics with VWO experiments, for every experiment.
  • Use one active experiment per Custom Dimension, just like with Optimizely, to avoid overwriting some of the test data stored in Google Analytics.
  • Check articles in VWO’s knowledge base on their website to dig in more information on the integration.

Wait until you have statistically significant numbers to work with.

Don’t start the analysis in GA before the data is cooked. Make sure the needed sample size and significance and power levels are there.

Go to Google Analytics to look at your test results; use a custom report to dig into the metrics you need.

Use whatever metrics are useful in your particular case. Swipe the custom report used in the example here. For example, look at average cart value or average quantity, to find out why some variations have more revenue per user.

Make the report show any data you want, then pull that data into an Excel/Google spreadsheet to calculate p-values, power levels, error margins, and so on.

Don’t start the analysis in GA before the data is cooked. Make sure the needed sample size and significance + power levels are there.

To use advanced segments, send an event to Google Analytics each time a variation is loaded.

Add one line to the test’s Global JavaScript, plus a line of event tracking code as the last line for each test variation, to use advanced segments (audiences).

Add this code to your testing tool’s Global Experiment JavaScript console: window.ga=window.ga||function(){(window.ga.q=window.ga.q||[]).push(arguments);};window.ga.l=+new Date();

Create segments in Google Analytics for each of the variations after adding the code in your testing tool, and apply them onto any report that you want.

In Optimizely:

  1. Open Settings while editing a test and choose Experiment Javascript, then add in the following in the Global Experiment JavaScript console:
    window.ga=window.ga||function(){(window.ga.q=window.ga.q||[]).push(arguments);};window.ga.l=+new Date();
  2. Add a line of event tracking code at the end of each variation, including Original, by changing the Experiment ID number and the name of the variation:
    window.ga('send', 'event', 'Optimizely', 'exp-2207684569', 'Variation1', {'nonInteraction': 1});
  3. The code sends an event to GA where the event category is Optimizely, action is Experiment ID, you can get that from your URL while editing a test, and label is Variation1; can also be Original, or Variation 2. Non-interaction means that no engagement is recorded. The bounce rate for experiment pages would be 0%.

Compare the results for each variation: look for potential discrepancies as well as checking basic performance metrics.

Check for data consistency. For example, compare thank you page visits and revenue numbers between your Optimizely result panel, and GA custom dimension or event based report.

Built-in Google Analytics integration is not foolproof. Sometimes the data is not passed on, there’s a 20% to 50% discrepancy and somewhere somehow part of the data gets lost. There could be numerous reasons for that, anything from how the scripts are loaded, in which order to script timeouts and other issues, therefore the above workarounds.

If results are inconclusive, repeat the test a few more times or dig down into specific segments, like mobile vs desktop environments.

Try a few more iterations if there’s no difference between test variations in case your test hypothesis might have been right, but the implementation was bad. For example, your qualitative research says that concern about security is an issue. There are unlimited ways to beef up the perception of security. You might be onto something, just the way you did something is wrong.

Analyze your test across key segments like desktop vs. tablet/mobile and new vs. returning, to ensure segments with adequate sample size, and consider a personalized approach for those segments where treatment performed well. For example, if you got a lift in returning visitors and mobile visitors, but a drop for new visitors and desktop users, those segments might cancel each out, therefore it may seem the case is “no difference”. This is where you should check key segments.

With a personalized approach, if your test says that there’s no significant difference between variations, but you like B better, there’s really no reason for not going with B. If B is a usability improvement or represents your brand image better, go for it. But those are not good reasons to go with B, if B performs worse in a test.

1 Like