Frequentist inference only seems easy

[This article was first published on Win-Vector Blog » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Two of the most common methods of statistical inference are frequentism and Bayesianism (see Bayesian and Frequentist Approaches: Ask the Right Question for some good discussion). In both cases we are attempting to perform reliable inference of unknown quantities from related observations. And in both cases inference is made possible by introducing and reasoning over well-behaved distributions of values.

As a first example, consider the problem of trying to estimate the speed of light from a series of experiments.

In this situation the frequentist method quietly does some heavy philosophical lifting before you even start work. Under the frequentist interpretation since the speed of light is thought to have a single value it does not make sense to model it as having a prior distribution of possible values over any non-trivial range. To get the ability to infer, frequentist philosophy considers the act of measurement repeatable and introduces very subtle concepts such as confidence intervals. The frequentist statement that a series of experiments places the speed of light in vacuum at 300,000,000 meters a second plus or minus 1,000,000 meters a second with 95% confidence does not mean there is a 95% chance that the actual speed of light is in the interval 299,000,000 to 301,000,000 (the common incorrect recollection of what a confidence interval is). It means if the procedure that generated the interval were repeated on new data, then 95% of the time the speed of light would be in the interval produced: which may not be the interval we are looking at right now. Frequentist procedures are typically easy on the practitioner (all of the heavy philosophic work has already been done) and result in simple procedures and calculations (through years of optimization of practice).

Bayesian procedures on the other hand are philosophically much simpler, but require much more from the user (production and acceptance of priors). The Bayesian philosophy is: given a generative model, a complete prior distribution (detailed probabilities of the unknown value posited before looking at the current experimental data) of the quantity to be estimated, and observations: then inference is just a matter of calculating the complete posterior distribution of the quantity to be estimated (by correct application of Bayes’ Law). Supply a bad model or bad prior beliefs on possible values of the speed of light and you get bad results (and it is your fault, not the methodology’s fault). The Bayesian method seems to ask more, but you have to remember it is trying to supply more (complete posterior distribution, versus subjunctive confidence intervals).

In this article we are going to work a simple (but important) problem where (for once) the Bayesian calculations are in fact easier than the frequentist ones.

Consider estimating from observation the odds that a coin-flip comes out heads (as shown below).

IMG 6260
Heads outcome.

The coin can also show a tails (as shown below).

IMG 6261
Tails outcome.

This might be a fair coin, that when tossed properly can be argued to have heads/tails probabilities very close to 50/50. Or the heads/tails outcome could in fact be implemented by some other process with some other probability p of coming up heads. Suppose we flip the coin 100 times and record heads 54 times.

In this case the frequentist procedure is to generate a point-estimate of the unknown p as pest = 54/100 = 0.54. That is we estimate p to be the relative frequency we actually empirically observed. Stop and consider: how do we know this is the right frequentist estimate? Beyond being told to use it, what principles lead us to this estimate? It may seem obvious in this case, but in probability mere obviousness often leads to contradictions and paradox. What criteria can be used to derive this estimate in a principled manner?

Gelman, Carlin, Stern, Dunson, Vehtari, Rubin “Bayesian Data Analysis” 3rd Edition p. 92 states that frequentist estimates are designed to be consistent (as the sample size increases they converge to the unknown value), efficient (they tend to minimize loss or expected square-error), or even have asymptotic unbiasedness (the difference in the estimate from the true value converges to zero as the experiment size increases, even when re-scaled by the shrinking standard error of the estimate). Because some of the estimators we will work with are biased we are going to use expected square-error as our measure of error. This is the expected value of the square of the distance of our estimate from the unknown true value, and not the variance (which is the expected value of the square of the distance of the estimator from its own mean).

Frequentists also commonly insist on fully unbiased procedures (which is what we will discuss) here. In this case an unbiased
procedure is a function f(nHeads,nFlips) that given the sufficient statistics of the experiment (the number of heads and the total number of flips) returns an estimate for the unknown probability. The frequentist philosophy assumes the unknown probability p is fixed and the observed number of heads might vary as we repeat the coin-flip experiment again and again. To confirm a frequentist procedure to estimate p from 100 flips is unbiased, we must check that the entire family of possible estimates f(0,100), f(1,100), ... f(100,100) together represent an panel of estimates that are simultaneously unbiased no matter what the unknown true value of p is.

That is: the following bias check equation must hold for any p in the range [0,1]:

Equation family 1: Bias checks (one f(h,n) variable for every possible outcome h, one equation for every possible p).

Some combinatorics or probability theory tells us P(h|n,p) = (n choose h) p^h (1-p)^(n-h). We can choose to treat the sequence f(0,nFlips),f(1,nFlips), ... f(nFlips,nFlips) either as a set of pre-made estimates (to be checked) or as a set of variables (to be solved for). It turns out there is a solution that satisfies all of the equations simultaneously: f(h,n) = h/n. This fact is just a matter of checking that the expected value of the number of heads is p times the number of flips. And this is the only unbiased solution. The set of check equations we can generate for various p has rank nFlips+1 (when we include check equations from at least nFlips+1 different values of p, this follows as the check equations behave a lot like the moment curve). We will work a concrete example of the family 1 bias checks a bit later (which should make seeing the content of the chekcs a bit easier).

The pre-packaged frequentist estimation procedure is easy: write down the empirically observed frequency as your estimate. But the derivation should now seem a bit scary (submit a panel of nFlips+1 simultaneous estimates and confirm they simultaneously obey an uncountable family of bias check equations). And this is one of the merits of the frequentist methods- the hard derivational steps don’t have to be reapplied each time you encounter new data, so the end user may not need to know about them.

Let’s look at the same data using Bayesian methods. First we are required to supply prior beliefs on the possible values for p. Most typically we would operationally assume unknown p is beta distributed with shape parameters (1/2,1/2) (the Jeffreys prior) or shape parameters (1,1) (implementing classic Laplace smoothing). I’ll choose to use the Jeffreys prior, and in that case the posterior distribution (what we want to calculate) turns out to be a beta distribution with shape parameters (54.5,46.5). Our complete posterior estimate of probable values of p is given by the R plot below:

d <- data.frame(p=seq(0,1,0.01))
d$density <- dbeta(d$p,shape1=54.5,shape2=46.5)
ggplot(data=d) + geom_line(aes(x=p,y=density))
## [1] 0.539604

The posterior distribution of p.

And the common Bayesian method if obtaining an estimate of a summary statistics is to just compute the appropriate summary statistic from the estimated posterior distribution. So if we only want a point-estimate for p we can use the expected value 54.5/(54.5+46.5) = 0.539604 or the mode (maximum likelihood value) (54.5-1)/(54.5+46.5-2) = 0.540404 of the posterior beta distribution. But having a complete graph of an estimate of the complete posterior distribution also allows a lot more. For example: from such a graph we can work out a Bayesian credible interval (which has a given chance of containing the unknown true value p assuming our generative modeling assumptions and priors were both correct). And this is one of the reasons Bayesians emphasize working with distributions (instead of point-estimates): even though they can require more work to derive and use, they retain more information.

Notice the complications of having to completely specify a prior distribution have not been hidden from us. The actual application of Bayes’ law (an explicit convolution or integral relating the prior distribution to the posterior through a data likelihood function) has (thankfully) been hidden by appealing to the theory of conjugate distributions. So the Bayes theory is hiding some pain from us, but significant pain is still leaking through.

And this is common: what is commonly called a frequentist analysis is often so quick you almost can’t describe the motivation, and the Bayesian analysis seems like more work. What we want to say is this is not always the case. If there is any significant hidden state, or constraints on the possible values, then the Bayesian calculation becomes in fact easier than a fully derived frequentist calculation. And that is what we will show in our next example. But first let’s cut down confusion by fixing detailed names for a few common inference methods:

  • Empirical frequency estimate. This is just the procedure of using the empirically observed frequencies as your estimate. This is commonly thought of as “the frequentist estimate.” However, we are going to reserve the term “proper frequentist estimate” for an estimate that most addresses the common frequentist criticisms: bias and loss/square-error. We will also call the empirical frequency estimate the “prescriptive frequentist estimate” as it is a simple “do what you are told” style procedure.
  • Proper frequentist estimate. As we said, we are going to use this term for the estimate that most addresses the common frequentist criticisms: bias and loss/square-error. We use the traditional frequentist framework: the unknown parameters to be estimated are assumed to be fixed, and probabilities are over variations in possible observations if our measurement procedures were to be repeated. We define this estimate as an unbiased estimate that minimizes expected loss/square-error for arbitrary possible values of the unknown parameters to be estimated. Often the bias check conditions are so restrictive that they completely determine the proper frequentist estimate and cause the proper frequentist estimate to agree with the empirical frequency estimate.
  • Full generative Bayesian estimate. This is a complete estimate of the entire posterior distribution of values for the unknown parameters to be estimated. This is under the traditional Bayesian framework that the observations are fixed and the unknown parameters to be estimated take on values from a non-trivial prior distribution (that is a distribution that takes on more than one possible value). Under the (very strong) assumptions that we have the correct generative model and the correct prior distribution the estimated posterior is identical to how the unknown parameters are distributed conditioned on the known observations. Thus reasonable summaries built from the full generative Bayesian estimate should be good (without explicitly satisfying conditions such as unbiasedness or minimal loss/square-error). We are avoiding the usual distinction of objective versus subjective interpretation (Bayesian usually being considered subjective if we consider the required priors subjective beliefs).
  • Bayes point-estimate. This is a less common procedure. A full generative Bayesian estimate is wrapped in a procedure that hides details of the generative model, priors and Bayes inference step. What is returned is single summary of the detailed posterior distribution such as a mean (useful for producing low square-error estimates) or mode (useful for producing maximum likelihood estimates). For our examples the Bayes point-estimate will be a procedure that returns an estimate mean (or probability/rate) using the correct generative model and uniform priors (when there is a preferred problem parameterization, otherwise we suggest looking into invariant ideas like the Jeffreys prior).

Our points are going to be: the empirical frequency estimate is very easy, but is not always the proper frequentist estimate. The proper frequentist estimate can be itself cumbersome to derive, and therefore hard to think of as “always being easier than the Bayesian estimate.” And finally one should consider something like the Bayes point-estimate when one does not want to make a complete Bayesian analysis the central emphasis of a given project. We will illustrate these points with a simple (and natural) example.

Returning to our coin-flip problem. Suppose we introduce a five sided control die that is set once (and held fixed) before we start our experiments. Then suppose each experiment is a roll of a fair six-sided die and we observe “heads” if the number of pips on the six-sided die is greater than the number (1 through 5) shown on the control die (and otherwise “tails”). The process is strongly stationary in that the probability p is a single fixed value of the entire series of experiments. Our imagined apparatus is depicted below.

IMG 6264
Our apparatus (the 5-sided die is simulated with a 10-sided die labeled 1 through 5 twice).

We assume that while we understand the generative mechanics of the generation process, but that we don’t see the details of the actual die rolls. We observe only the reported heads/tails outcomes (as shown below).

IMG 6265
What is observed.

This may seem like a silly estimation game, but it succinctly models a number of important inference situations such as: estimating advertisement conversion rates, estimating health treatment success rates, and so on. We pick a simple formulation so that when we run into difficulties or complications it will be clear that they are essential difficulties (and not avoidable domain issues). Or: if your estimation procedures are not correct on this example, how can you expect them to be correct in more complicated real-world situations? Another good example of this kind of analysis is: Sean R. Eddy “What is Bayesian statistics” Nature Biotechnology, Vol. 22, No. 9, Sept. 2004, pp. 1177-1178. Eddy presented a clever inference problem comparing where pool balls hit a rail relative to a uniform random chalk mark on the rail. Eddy’s problem illustrates the issues of inference when there are important unobserved (or omitted) state variables. Our example is designed to allow further investigation of the both Bayesian and Frequentist inference in the presence of constraints (not quite the same as complete priors).

We will consider two important ways the control die could be set: by a single roll before we start observations (essentially embodying the Bayesian generative assumptions), or by a manual selection by an assumed hostile agent (justifying the usual distribution-free frequentist minimax treatment of loss/square-error).

IMG 6267
An adversary holding the control die at a chosen value.

Let’s start with the case where the control die is set before we start measurements by a fair (uniform) roll of the five sided die. Because the control die only has 5 possible states the unknown probability p has exactly 5 possible values. In this case we can write down all of the bias check equations for every possible outcome of a one coin-flip simulation. For only one flip observed there are only two possible outcomes: either we see one heads or one tails. So we have two possible outcomes (giving us two variables, as we get one estimate variable per sufficient outcome) and 5 check equations (one for each possible value of p). The complete bias check equations are represented by the matrix a and vector b shown below:

> print(freqSystem(6,1)) $a prob. 0 heads prob. 1 heads check for p=0.166666666666667 0.8333333 0.1666667 check for p=0.333333333333333 0.6666667 0.3333333 check for p=0.5 0.5000000 0.5000000 check for p=0.666666666666667 0.3333333 0.6666667 check for p=0.833333333333333 0.1666667 0.8333333 $b p check for p=0.166666666666667 0.1666667 check for p=0.333333333333333 0.3333333 check for p=0.5 0.5000000 check for p=0.666666666666667 0.6666667 check for p=0.833333333333333 0.8333333

The above is just the family 1 bias check equations for our particular problem. A vector of estimates f is unbiased if and only if a f - b = 0 (i.e. it obeys the equation family 1 checks). When a is full rank (in this case when the number of variables is no more than the number of checks) the bias check equations completely determine the unique unbiased solution (more on this later). So even in this “discrete p” situation: for any number of flips less than 5, the bias conditions alone completely determine the unique unbiased estimate.

What we are trying to show is that when we move away from the procedure “copy the observed frequency as your estimate” to the more foundational “pick an unbiased family of estimates with minimal expected square-error”, then frequentist reasoning appears a bit more complicated. Let’s continue with a frequentist analysis of this problem (this time in python instead of R, see here for the complete code).

The common “everything wrapped in a bow” prescriptive empirical frequency procedure is by far the easiest estimate:

# Build the traditional frequentist empirical estimates of # the expected value of the unknown quantity pWin # for each possible observed outcome of number of wins # seen in kFlips trials def empiricalMeansEstimates(nSides,kFlips): return numpy.array([ j/float(kFlips) for j in range(kFlips+1) ])

And if we load this code (and all of its pre-conditions) we get the following estimates of p if we observe one coin experiment:

>>> printEsts(empiricalMeansEstimates(6,1)) pest for 0 heads 0.0 pest for 1 heads 1.0

Using our bias check equations we can confirm this solution is indeed unbiased:

>>> sNK = freqSystem(6,1) >>> printBiasChecks(matMulFlatten(sNK['a'], \ empiricalMeansEstimates(6,1)) - flatten(sNK['b'])) bias for p=0.166666666667 0.0 bias for p=0.333333333333 0.0 bias for p=0.5 0.0 bias for p=0.666666666667 0.0 bias for p=0.833333333333 0.0

And has moderate loss/square-errors:

>>> printLosses(losses(6,empiricalMeansEstimates(6,1))) exp. sq error for p= 0.166666666667 0.138888888889 exp. sq error for p= 0.333333333333 0.222222222222 exp. sq error for p= 0.5 0.25 exp. sq error for p= 0.666666666667 0.222222222222 exp. sq error for p= 0.833333333333 0.138888888889

But the solution is kind of icky. Remember, this result was completely determined by the unbiased check conditions. It says if we observe one coin experiment and see tails then the estimate for p is zero, if we see heads the estimate for p is one. Both of these estimates are well outside the range of possible values for p! Recall our heads/tails coin events are assigned “heads” if the number of pips on the 6-sided die exceeds the mark on the control die (which are the numbers 1 through 5). Thus p only takes on values in the range 1/6 (when the control is 5) through 5/6 (when the control is 1). In fact p is always going to be one of the values: 1/6, 2/6, 3/6, 4/6, or 5/6. The frequentist analysis is failing to respect these known constraints (which are weaker than assuming actual priors).

We can try fixing this with a simple procedure such as Winsorising or knocking everything back into range. For example the estimate [1/6,5/6] is biased but has improved loss/square-error:

>>> w = [1/6.0,5/6.0] >>> printBiasChecks(matMulFlatten(sNK['a'], w) - flatten(sNK['b'])) bias for p=0.166666666667 0.111111111111 bias for p=0.333333333333 0.0555555555556 bias for p=0.5 0.0 bias for p=0.666666666667 -0.0555555555556 bias for p=0.833333333333 -0.111111111111 >>> printLosses(losses(6,w)) exp. sq error for p=0.166666666667 0.0740740740741 exp. sq error for p=0.333333333333 0.101851851852 exp. sq error for p=0.5 0.111111111111 exp. sq error for p=0.666666666667 0.101851851852 exp. sq error for p=0.833333333333 0.0740740740741

There are other ideas for fixing estimates (such as shrinkage to reduce expected square-error, or quantization to improve likelihood). But the point is these are not baked into the traditional simple empirical frequency estimate. Once you start adding all of these features you may have a frequentist estimator that is as complicated as a Bayesian estimator is thought to be, and a frequentist estimator that is no longer considered pure with respect to traditional frequentist criticisms.

Let’s switch to the Bayes analysis for the game where the 5-sided control dice is set uniformly at random. A good Bayes point-estimate is easy to derive, as the appropriate priors for p are obvious (uniform on 1/6, 2/6, 3/6, 4/6, 5/6). Our Bayes point-estimates for the expected value of p turn out to be:

>>> printEsts(bayesMeansEstimates(6,1)) pest for 0 heads 0.388888888889 pest for 1 heads 0.611111111111

Which means: for 1 tails we estimate p=.38888889 and for 1 heads we estimate p=0.61111111. Notice these estimates are strictly inside the range [1/6,5/6] (pulled in by 2/9 in both cases). Also notice because we have wrapped the Bayes estimate in code it appears no more complicated to the user than the empirical estimate (sure the code is larger than the empirical estimate, but that is exactly what an end user does not need to see). We have intentionally hidden from the user some important design choices (priors, the Bayes step convolution, use of a mean estimate instead of a mode). The estimator (see here or here) has wrapped up proposing a prior distribution, deriving the posterior distribution from the data likelihood equations (applying Bayes law), and then returning the expected value of the posterior as a single point-estimate. In addition to hiding the implementation details, we have refrained (or at least delayed) educating the user out of their desire for a simple point-estimate. We have not insisted the user/consumer of the result learn to use the (superior) complete posterior distribution in favor of mere point-estimates. For a Bayes estimate to be replacement compatible for a frequentist one we need (at least initially) put it into the same format as the frequentist estimate it is competing with. This squanders a number of the advantages the Bayes posterior, but as we will see the Bayes estimate is still lesser expected square-error (more efficient) than the frequentist one. So initially offering a Bayes estimate as a ready to go replacement for the frequentist estimate is of some value, and we don’t want to lose that value by initially requiring additional user training.

Unfortunately this Bayes point-estimate solution is biased, as we confirm here:

>>> printBiasChecks(matMulFlatten(sNK['a'], \ bayesMeansEstimates(6,1)) - flatten(sNK['b'])) bias for p=0.166666666667 0.259259259259 bias for p=0.333333333333 0.12962962963 bias for p=0.5 0.0 bias for p=0.666666666667 -0.12962962963 bias for p=0.833333333333 -0.259259259259

But, as we mentioned, our Bayes point-estimate has some advantages. Let’s also look at the expected loss each estimate would give for every possible value of the unknown probability p:

>>> printLosses(losses(6,bayesMeansEstimates(6,1))) exp. sq error for p= 0.166666666667 0.0740740740741 exp. sq error for p= 0.333333333333 0.0277777777778 exp. sq error for p= 0.5 0.0123456790123 exp. sq error for p= 0.666666666667 0.0277777777778 exp. sq error for p= 0.833333333333 0.0740740740741

Notice that the Bayes estimate has smaller expected square-error (or in statistical parlance is a more efficient estimator) no matter what value p takes. The unbiased check conditions forced the frequentist estimate to a high expected square-error estimator. This means demanding the estimator be strictly unbiased may not be a good trade-off (and the frequentist habit of deriding other estimators for “not being unbiased” may not always be justified). To be fair bias can be a critical flaw if you intend to aggregate it with other estimators later (as enough independent unbiased estimates can be averaged to reduce noise, which is not always true for biased estimators).

Let’s give the frequentist estimate another chance. For our discrete set of possible values p (1/6, 2/6, 3/6, 4/6, 5/6) once the number of coin-flips is large enough the equation family 1 bias checks no longer completely determine the estimate. So it is no longer immediately obvious that the observed empirical frequency is minimal loss. In fact it is not, so we can no longer consider the canned empirical solution to be the unique optimal estimate. Note this differs from the case where p takes on many different values from a continuous interval, which is enough to ensure the bias check conditions completely determine a unique solution. Continuing with an example: if we observed 7 flips an improved frequentist estimate (under the idea it is an unbiased point-estimate with minimal expected square-error) is as follows:

>>> printEsts(newSoln) pest for 0 heads 0.0319031034157 pest for 1 heads 0.111845090806 pest for 2 heads 0.296666330987 pest for 3 heads 0.439170280769 pest for 4 heads 0.560830250198 pest for 5 heads 0.703332297349 pest for 6 heads 0.888156558984 pest for 7 heads 0.968095569167

To say we decrease loss we have to decide on a scalar definition of loss: be it maximum loss, total loss or some other criteria. This solution was chosen to decrease maximum loss (an idea compatible with frequentist philosophy) and was found through constrained optimization. Notice this solution is not the direct empirical relative frequency estimate. For example: in this estimate if you see seven tails in a row you estimate p=0.0319031 not p=0 (though we still have 0.0319031 < 1/6 which is an out of bounds estimate). This estimate is a pain to work out (the technique I used involved optimizing a move in directions orthogonal to the under-rank bias check conditions; perhaps some clever math would allow us to consider this solution obvious, but that is not the point). It is not important if this new solution is actually optimal, what is important is it is unbiased and has a smaller maximum loss (meaning the empirical estimate itself can not be considered optimal in that sense). The fact that the unknown probability p can only be one of the values 1/6, 2/6, 3/6, 4/6, 5/6 has changed which unbiased estimate is in fact the minimal loss one (added a new lower loss solution that would not considered unbiased if p could choose from more possible values).

Depending on your application it can be the case that either of the frequentist or Bayesian estimate has better utility. But is is unusual for the frequentist estimate to be the harder one to calculate (as is the case here).

The Bayes solution in this case is:

>>> printEsts(bayesSoln) pest for 0 heads 0.203065668302 pest for 1 heads 0.251405546037 pest for 2 heads 0.33603150662 pest for 3 heads 0.443861984801 pest for 4 heads 0.556138015199 pest for 5 heads 0.66396849338 pest for 6 heads 0.748594453963 pest for 7 heads 0.796934331698

This is still biased, but all values are in range and the losses are smaller than the frequentist losses for all possible values of p (again limited to: 1/6, 2/6, 3/6, 4/6, 5/6).

To be fair the differences in loss/square-error are small (and shrinking rapidly as the number of observed flips goes up, so it is a small data problem). The point we want to make isn’t which estimate is better (that depends on how you are going to use the estimate, your domain, and your application), but the idea that: Bayesian methods are not necessarily more painful that frequentist procedures. The Bayesian estimation procedure requires more from the user (the priors) and has an expensive and complicated convolution step to use the data to relate the priors to the posteriors (unless you are lucky enough to have something like the theory of conjugate distributions to hide this step). The frequentist estimation procedure seems to be as simple as “copy over your empirical observation as your estimate.” That is unless you have significant hidden state, constraints or discreteness (not the same as having priors). When you actually have to justify the frequentist inference steps (versus just benefiting from them) you find you have to at least imaging submitting every possible inference you could make as a set of variables and picking a minimax solution optimizing expected square-error over the unknown quantities while staying in the linear flat of unbiased solutions (itself a complicated check).

Note that each style analysis is correct on its own terms and is not always compatible with the assumptions of the other. This doesn’t give one camp a free-card to criticize the other.

My advice is: Bayesians need to do a better job of wrapping standard simple analyses (you shouldn’t have to learn and fire up Stan for this sort of thing), and we all need to be aware that proper frequentist inference is not always just the common simple procedure of copying over the empirical observations.

For full implementations/experiments (and results) click here for R and here for python.

To leave a comment for the author, please follow the link and comment on their blog: Win-Vector Blog » R. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)