We here at Win-Vector LLC been working through an ad-hoc series about A/B testing combining elements of both operations research and statistical points of view.
- A dynamic programming solution to A/B test design
- Why does designing a simple A/B test seem so complicated?
- A clear picture of power and significance in A/B tests
- Bandit Formulations for A/B Tests: Some Intuition
- Bayesian/loss-oriented: New video course: Campaign Response Testing
Our most recent article was a dynamic programming solution to the A/B test problem. Explicitly solving such dynamic programs gets long and tedious, so you are well served by finding and introducing clever invariants to track (something better than just raw win-rates). That clever idea is called “sequential analysis” and was introduced by Abraham Wald (somebody we have written about before). If you have ever heard of a test plan such as “first process to get more than 30 wins ahead of the other is the one we choose” you have seen methods derived from Wald’s sequential analysis technique.
Wald’s famous airplane armor problem
In this “statistics as it should be” article we will discuss Wald’s sequential analysis.
Wald’s graphical sequential inspection procedure
A particularly compelling (and unusual) type of test plan is the graphical sequential inspection procedure designed by Abraham Wald.
Wald treats A/B testing as a “stopping problem.” Wald asks the business partner for four parameters: u0,u1 (the desired bounds on relative error allowed in the estimate); and alpha,beta (the power and significance goals). With these parameters Wald designs an inspection plan that is a single chart (shown below).
Figure 14, Section 6.4.2, page 111, Abraham Wald, Sequential Analysis, Dover 2004 (reprinting a 1947 edition).
The chart is used as follows: we send traffic to processes 1 and 2 in matched pairs. We get back pair measurements (c1,c2) where c1=1 if process 1 paid off (0 otherwise) and c2=1 if process 2 paid off (zero otherwise). Only look at pairs (0,1) or (1,0) (we discard all (0,0) and (1,1) pairs, they are not allowed to update the chart). Start at the origin of the inspection chart. When you see a (0,1) or (1,0) pair move on the inspection chart one unit to the right; if the observation was a (0,1) also move one unit up (a process 1 “win”). When you cross one of the decision lines you are done (“Reject process 1” means go with 2, and “Accept process 1” means go with 1). This is completely procedural, but brilliant. It is a chart you can actually print, laminate, and post on the machine shop floor.
The idea is so brilliant, it is easier to demonstrate than to describe (so we suggest skipping ahead and checking out our worked example here).
Wald’s example was running two manufacturing processes and then shutting down the one that seems to have a lower success rate. Our internet age example would be running two advertisement campaigns and shutting down the one with lower conversion rate. In all cases the procedure is designed in terms of continuously looking at results (something people are likely to do), instead of prescribing a batch interval and hoping nobody spoils the statistical significance by peaking early (the classic approach).
Wald’s sequential inspection procedure is itself a purely prescriptive procedure: move along the chart as he prescribes and you have fairly good decision procedure. So any fretting or worrying about business trade-offs need to have been worked through when specifying the u0,u1,alpha,beta needed to design the chart. This means these values need to be tried and the business person needs to see if the expected length of the inspection plans is acceptable (and revise their selection of u0,u1,alpha,beta if they are not). However, due to the extremely clever frame work, the chart is simple (two parallel lines) and all calculation is done when producing the chart (no calculation remains while making decisions).
The decision boundaries are two parallel lines with common slope determined by four user supplied parameters. Notice that there is no “experiment length” in Wald’s sequential analysis procedure. How long the experiment runs is a random variable determined by the user parameters and the results seen. You don’t spoil a sequential analysis by “peeking” or “stopping early” (which would spoil a standard A/B test design).
One fascinating property of Wald’s procedure is it isn’t running off a traditional sufficient statistic. If the data were all exchangeable then the sufficient statistic would be how many times process 1 was tried, how many process 1 successes, how many times process 2 was tried, and how many process 2 successes. Wald’s position chart isn’t a function of these sufficient statistics; it also depends on the exact pairing of process 1 and process 2 trials. Wald derives his decision surface from how many times process 1 performed different from process 2 for a single given pairing of the process 1 an process 2 trials. Oddly enough for measuring different rates of variations rare events (such as advertisement conversion) this “loss of statistical efficiency” is tiny.
Part of the brilliance of the plan is by throwing out (0,0) and (1,1) measurements (not allowing them to move the point on the sequential analysis chart) makes the chart independent of the rate we see (1,0) and (0,1) events, or largely independent of the magnitude of the unknown rates we are trying to compare. That is Wald’s chart is roughly a pivotal summary of the experiment to date as the estimate is roughly independent of the unknown rates (to be estimated) once conditioned on the observations (a property shared by sampling distributions such as Student’s t-distribution). By retaining only the (0,1) and (1,0) measurements Wald converted an arbitrary comparison of rates experiment to a specific “is the rate 50/50 or not” experiment.
These advantages are likely why Optimizely switched to sequential hypothesis testing (from standard power/significance planners).
We are sharing an R implementation of Wald’s sequential analysis test and also plot the simulated application of the test over many repetitions. This yields the following figure showing a simulation process 2 being reliably selected (when it is in fact the higher return process).
Repeated simulation of sequential selection procedure.