Diagnosing Statistical Significance Issues

Diagnosing Statistical Significance Issues

Does this sound familiar? Your team spends months coming up with the perfect test idea, weeks developing the test, and days getting multiple sign-offs from QA, upper management, and the brand team. When you finally launch, you get one of two scenarios:

Scenario 1: After three days, the test reaches statistical significance, everyone jumps for joy, and you stop the test. A few months later, you notice the change has produced zero visible lift in your bottom line.

Scenario 2: The test runs for ten weeks with a somewhat consistent trend but the results remain inconclusive. Frustrated, you stop the test and make changes based on an insignificant data set. Months on, you see no visible lift.

Cue the deep sigh. Here is how we handle these issues.

Diagnose the Problem

When we encounter one of these two scenarios, we evaluate the results to determine which of the following is to blame:

For the most part, each is caused by something we can control, adjust, or prevent. First up, false results are caused by one of three things:

Second, we have underpowered tests, which are caused by a combination of low traffic and a low percentage change between the variation and the original.

Having low traffic, heavily segmenting your visitors for the test, testing minor changes (think button color) or testing low-traffic areas of your site can all cause an underpowered test.

Need examples? We’ve got ‘em. Below we’ll show you a false result and an underpowered test.

*Download Blue Acorn iCi’s Complete Customer Experience Report to learn how to create an unforgettable, personalized customer experience. *

Understand the Problem

Scenario 1: False Results

A test that quickly reaches statistical significance should be met with caution, not celebration. Due to the limited amount of time the test was given to collect data, account for outliers or determine normality, in this instance, a fake winner is highly probable. You wouldn’t turn off a movie if the bad guy is vanquished halfway through the movie. Movie-watching experience tells you to keep watching ‘cause that guy is coming back before the final scene. Similarly, you need to train yourself to keep watching a test when a win seems like it shows up too early.

The graph below depicts a test that reached statistical significance within twenty-four hours after it was deployed. Immediately afterward, the test goes insignificant again, then two days later it’s significant, where it stays until another two days later when it drops to almost insignificant results.

The green arrows show each time the variation became statistically significant, while the orange arrow shows the variation descending toward the significance threshold.

This up and down behavior is directly related to an artificially high level of lift, causing a false positive. By the time the test had reached statistical significance the first time, fifteen thousand visitors had been exposed to the test, but only two hundred total conversions were attained. The difference in the conversion rate between the original and the variation was great enough to cause the significance but, due to the short run time, the data had not yet normalized.

Our Solution

This is a great example of the normalization process and the importance of providing the test with enough time to collect a significant amount of data.

All tests that have a significant amount of traffic should run for a minimum of two weeks. Sites with lower traffic should increase the minimum runtime to account for the full data normalization process. Additionally, you should run your tests for complete purchase cycles to overcome skewed data caused by stopping the test on a particularly high or low converting day of the week.

Scenario 2: Underpowered Test

When a test runs for multiple months with a consistent trend but fails to reach statistical significance, this indicates an underpowered test.

The graph below shows a test that ran for four months and failed to reach statistical significance. With this client, it was apparent that the amount of traffic to the site was too low to power subtle tests. Minor changes would nearly always produce insignificant results and require a lengthy run time.

Our Solution

Although a commonly proposed solution to this type of test is “Keep waiting,” few people have such patient managers. Even those of us with the most unflappable superiors know that we can’t run a test for a year and have nothing to show but a measly increase.

We know that an underpowered test is caused by either the test failing to produce a distinguishable change, low traffic, or a combination of the two. Solving for those two outcomes is hardly a challenge for a smart tester. Especially, since we are going to provide some suggestions on how to limit your chances of an inconclusive test. At the very least, you’ll prevent future issues. At most, you could find a way to tweak your test to be successful.

If you have a low level of lift, try…

If you have low traffic, try…

Last Tips Before You Test

It’s a lot less costly to test the right thing the first time around. As our last bit of advice, here’s a checklist of questions to ask yourself before you launch a test:

Do you want more content like this delivered directly to your inbox on a monthly basis? Subscribe to our newsletter, Blue Acorn iCi Monthly Digital Digest!