How to stay safe when A/B testing

Or it is, until you think about it (which I wouldn’t advise).

A/B testing can be a rabbit hole from which some never re-emerge (we call these people statisticians, or even, if things get really bad, data scientists).

This article provides some pointers to help you navigate the ins and outs of A/B testing, and perhaps, escape unscathed.

Plan

When A/B testing, the plan is everything. Once you set the plan in motion, you must stick to the plan. You might be tempted to deviate, analyse results early, or tweak the experiment and move the goalposts. That way lies madness.

As the plan is everything, it needs to be clear. There is an art to a well-defined hypothesis and there is an imperative for a single, solid metric on which the hypothesis hinges. Generally-speaking, it makes most sense for marketers to use revenue, and if you can’t use revenue, use the next closest thing, such as sign-ups.

Depending on how you are testing, you will also need to consider setting the sample size in advance (use this handy tool). You should always plan to guard against analysing results too early. Most importantly, do not make this more complicated than it already is. Make a plan, make it a good one, and stick to it.

Traffic isn’t uniform

The multitude of things that can trip you up when A/B testing is enough to induce a healthy degree of paranoia. For example, you will want to think about the web traffic you are testing on. Where is it coming from and how might this affect your test?

If your sample is dominated by traffic from a PPC ad, you will get different users, with different intentions and behaviour, than if the sample is dominated by traffic from users sharing a specific article, particularly if the ad and the article are about different parts of your business.

During your test, as long as you split traffic randomly between A and B, you might not need to worry too much about uneven traffic. However, remember why you are testing. You want to know which variant will make more money in the immediate future. If your traffic in the immediate future is of a wildly different character to the traffic you tested on, then your test data won’t be telling you anything useful.

The traffic source is not the only thing that might differ between a test period and the period of implementation. Think about seasonality. If your business is affected by changing seasons, you will need to consider that your users’ intentions will differ over time. The A/B test you ran in the run-up to Christmas might not tell you much about what to do in the summer holiday season, or vice versa.

Time presents other difficulties too. User behaviour in your variants can be affected by the simple novelty of a change. Regular users might be put off by an unfamiliar new feature. However, after a month or so they might become accustomed to it and prefer it to the predecessor. Conversely, users might be naturally curious about a new feature they notice, giving the variant an artificial bump in performance that will soon wear off.

There’s also another way that traffic isn’t uniform. Not all your users are humans. The internet is beset by bots, crawling around (usually benignly) on behalf of search engines and so on. While it is likely to be a low priority, there will be some scenarios where you might need to worry about bots skewing the data in your A/B test.

Delegates discussing testing at an Amigo breakfast event

Lies, damn lies

The simplicity of A/B testing can also quickly give way to some fairly advanced statistical concepts. You may find yourself torn between one- and two-tailed tests or, and buckle up for this one, Frequentist versus Bayesian inference.

You will at least have to get your head around the fundamentals of probability. For example, to say that the p-value is less than 0.05 doesn’t mean you are 95% sure. P-values can be slippery even for the professionals. The p-value is the chance that you would record results at least as extreme as the ones you have recorded if A and B were in fact the same.

You might get results that are easier to understand and communicate by using Bayesian rather than Frequentist testing. Bayesian comes with the blessing of avoiding p-values. However you will still have to deal with some uncomfortable concepts, such as expected loss.

Besides the fact that no-one wants to be talking to their boss or their clients about an ‘expected loss,’ this concept isn’t a simple measure of the likelihood that A is better than B either. Expected loss is the amount you would expect to lose if you rolled out the winning variant but were wrong to do so.

If all this is too technical, here are a couple of practical steps you can take to improve the statistical rigour of your A/B tests.

Run an A/A test. An A/A test is where both halves of your test are identical, so you know it shouldn’t pass in favour of either. It’s a simple way to check the robustness of your testing framework. If you get a result in an A/A test, you probably have a problem.
Control for outliers. Even A/B tests with large sample sizes can be swung by outlier data. That one corporate client who makes a massive transaction doesn’t come along very often, so they probably won’t be split evenly between your control and your variant. This will give one of A or B an unfair edge and you need to adapt to that in your analysis.

Limit your assumptions

Statistical analysis isn’t too different to any other logical process of evaluating evidence. It just supports complex ways of evaluating large bodies of evidence. You should still be able to summarise in plain English (or any other vernacular language) what you are trying to find out and how you are trying to do it.

This means you need to think logically and critically about what you are actually doing and the best way to start is by thinking about your assumptions. You will need to make some assumptions, but you want to be aware of what they are and why you are making them.

You will want to avoid making assumptions about user journeys. Marketers are accustomed to the metaphor of a funnel. If you increase the size at the top, you ought to see a proportionate increase at the bottom. However, this is not true of all user journeys.

Clicks through the stages of your user journey (sometimes called micro-conversions) may not necessarily carry through to the conversions you actually care about. You might boost your click-through rate 100X by advertising “free bitcoin, no questions asked.” However at some point, your conversion rate will tank as you are forced to admit to customers that their blockchain bounty was a lie.

Another dangerous assumption you might be tempted to make is causation. You may (especially if you’ve ever had the misfortune to get into an argument on social media) be familiar with the logical fallacy known as “confusing correlation and cause.” Just because two things seem related, it doesn’t mean they are (see these excellent spurious correlations).

This is an especially dangerous assumption when A/B testing. On the one hand, the A/B test (when done right) should guard against this fallacy by design. If the only difference between A and B really is the variant you are testing, then you should be fine to assume some causal relationship.

However, having more or less proven causation once, you shouldn’t get carried away and extend your certainty any further than the precise subject of the test. B may have seen better results than A because your variant is better than the control, but you still can’t be sure why. You can’t confidently know what it is about B that makes your customers more likely to convert.

Use your brain

A/B testing may not be as easy as it seems, but it is an exercise in logic. Therefore the most important advice for any marketer getting involved in A/B testing is simply to think very carefully about everything you choose to do and be sceptical about everything you think you have learned.

It is a minefield out there, so your best bet is always to keep it simple. Come up with a strong hypothesis on which you can justify your expectation of seeing a big uplift. Approach the data with scepticism and back your own reasoning. Don’t believe the case studies and certainly don’t believe anyone who would tell you that your opinion doesn’t matter in the face of the cold, hard data.

Thinking about A/B testing is dangerous, but it is inevitable. And the only way out is, I’m afraid, more thinking.

Further reading: