Bayesian and Frequentist Approaches: Ask the Right Question

[This article was first published on Win-Vector Blog » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

It occurred to us recently that we don’t have any articles about Bayesian approaches to statistics here. I’m not going to get into the “Bayesian versus Frequentist” war; in my opinion, which style of approach to use is less about philosophy, and more about figuring out the best way to answer a question. Once you have the right question, then the right approach will naturally suggest itself to you. It could be a frequentist approach, it could be a bayesian one, it could be both — even while solving the same problem.

Let’s take the example that Bayesians love to hate: significance testing, especially in clinical trial style experiments. Clinical trial experiments are designed to answer questions of the form “Does treatment X have a discernible effect on condition Y, on average?” To be specific, let’s use the question “Does drugX reduce hypertension, on average?” Assuming that your experiment does show a positive effect, the statistical significance tests that you run should check for the sorts of problems that John discussed in our previous article, Worry about correctness and repeatability, not p-values: What are the chances that an ineffective drug could produce the results that I saw? How likely is it that another researcher could replicate my results with the same size trial?

We can argue about whether or not the question we are answering is the correct question — but given that it is the question, the procedure to answer it and to verify the statistical validity of the results is perfectly appropriate.

So what is the correct question? From your family doctor’s viewpoint, a clinical trial answers the question “If I prescribe drugX to all my hypertensive patients, will their blood pressure improve, on average?” That isn’t the question (hopefully) that your doctor actually asks, though possibly your insurance company does. Your doctor should be asking “If I prescribe drugX to this patient, the one sitting in my examination room, will the patient’s blood pressure improve?” There is only one patient, so there is no such thing as “on average.”

If your doctor has a masters degree in statistics, the question might be phrased as “If I prescribe drugX to this patient, what is the posterior probability that the patient’s blood pressure will improve?” And that’s a bayesian question.Let’s run through a small toy example. We will run a 500 patient clinical trial on drugX. All the patients have “moderately high” blood pressure, and are of similar age, health and family history, and so on. We will measure whether or not drugX reduces their blood pressure to “normal” — somewhere in the region of 120/80. The control group will be on the sort of diet recommended for hypertensive patients(say a low-sodium, low-cholesterol, high-fiber diet) and will take a placebo. The treatment group will be on the same diet, plus drugX.

Now suppose that the diet alone will normalize blood pressure in about 10% of the population. And also suppose that (unknown to the researchers) there is a hidden factor HF (a genetic factor, perhaps) that moderates whether or not drugX actually works. There are two types of people. 90% of the population are HFA, and drugX has no effect on them. 10% of the population are HFB, and drugX completely normalizes blood pressure for 95% of the HFB population.

So you, the omnipotent readers of this article, now know that drugX is only effective on about 9.5% of the general population, although an overlapping 10% of the general population will show improvement from diet alone. This gives you the luxury of comparing the “right answer” with what could happen in an actual experiment. Now let’s see what might be observed.

Here’s some R code to simulate the trial.

# 2 populations: HFA, HFB. A not affected by the drug

n = 500;
spontaneous = 0.1 # effectiveness of diet alone
effectiveness = c(0, 0.95)
names(effectiveness) = c("HFA", "HFB")

# set the HF for the population
hfcoin = runif(n)
hf = ifelse(hfcoin < 0.9, "HFA", "HFB")

# assign control and treatment groups
group = runif(n)
group = ifelse(group<0.5, "control", "drug")

# assign outcomes
spontcoin = runif(n)
drugcoin = runif(n)
outcome = ( (spontcoin < spontaneous) | 
            (drugcoin < effectiveness[hf]*(group=="drug")) )

expframe=data.frame(group=group,hf=hf, improved=outcome)

Here are the summaries I got when I ran the code:

> summary(expframe)
     group       hf       improved      
 control:255   HFA:449   Mode :logical  
 drug   :245   HFB: 51   FALSE:437      
                         TRUE :63       
                         NA's :0    

> with(expframe[group=="control",], table(hf, improved))
  HFA   209   16
  HFB    26    4

> with(expframe[group=="drug",], table(hf, improved))
  HFA   201   23
  HFB     1   20

> tab = with(expframe, table(group, improved))
> tab
group     FALSE TRUE
  control   235   20
  drug      202   43

The last contingency table, tab, is the only of the above summaries known to the researchers. From it, you can see that the drug group had a 100*43/(202+43) = 17.5% improvement rate, and the control group had a 7.8% improvement rate. So, empirically, drugX more than doubled the probability of improvement (17.5/7.8 = 2.25 — this is called the risk ratio). If you think in odds like a gambler does (odds of improvement are 20 to 235 for the control group), then we have also more than doubled the odds of improvement ( (43/202)/(20/235) = 2.5 — this is called the odds ratio). Now we want to test if these results are real (and not a fluke).

Frequentist Approach

One way to check the significance of the results (from a frequentist viewpoint) is check whether the contingency table tab is independent. Under the null hypothesis that improvement is independent of whether or not the patient took the drug, the odds of improvement should be the same for both the control and the drug groups. We can test this using Fisher’s Exact Test for Count Data (or we can use the chi-squared test, which is an approximation of Fisher’s exact test). In Fisher’s test, the null hypothesis is that the odds ratio is 1.

> fisher.test(tab)

	Fisher's Exact Test for Count Data

data:  tab 
p-value = 0.001158
alternative hypothesis: true odds ratio is not equal to 1 
95 percent confidence interval:
 1.384802 4.635614 
sample estimates:
odds ratio 

So now we know that our results are significant to the 0.05 level (in fact, to the 0.01 level: if the drug had no effect, we would see a result this good or better no more than 1% of the time). We also know that if our estimate of the odds ratio is correct, then when other researchers repeat our experiment, 95% of the time they will see an odds ratio between about 1.38 to 4.63 — definitely greater than one. So we can reject the null hypothesis and assume that drugX will increase the improvement rate in the population, relative to diet alone.

Bayesian Approach

But what about the poor patient sitting in the doctor’s examination room? What are the chances that his blood pressure will improve if he takes drugX? Roughly 17%, which is better than the 10% chance from dietary changes alone, but still isn’t very high. Let’s verify this statement using the bayesian approach.

The bayesian approach assumes that the quantity that you are interested in, in this case the rate of improvement p, is distributed according to some distribution Prior(p). Once you have a set of observations, x, you update your estimate of the distribution to

Posterior(p | x) = C * Prior(p) * f(x | p),

where f is the probability of the data conditioned on the parameter, and C is the total probability of the data over all possible settings of the parameter. Usually, calculating C is hard. Fortunately, for some common scenarios, like coin-flipping, calculating the posterior is quite easy.

Estimating the improvement rate p of drugX is a coin-flipping problem, where p is the (unknown) probability of the coin coming up heads. If you model the coin as a binomial distribution, and the distribution of p as a Beta distribution:


then the posterior is also a Beta distribution, with α’ = α + nheads and β’ = β + ntails.

The mode of the distribution (which is what is usually used as a point estimate for p) is


The mean of the distribution (which is close to the mode when α and β are large) is

NewImageNow back to our problem. Suppose that we already knew (never mind how) that the hypertension improvement rate from diet alone was about 10%. We can set the prior to have a mean value of 0.1 by setting α = 0.1 and β = 0.9. That looks like this:

Prior dist

It’s an nasty prior — notice it goes to infinity at both 0 and 1 — but it spreads the probability mass all along the unit interval, which is what we want, since we don’t want to start with a very strong bias about the improvement rate. Another common prior is the Jeffrey’s prior: α = β = 0.5. The Jeffrey’s prior is maximally uninformative (or minimally biased) and has a mean of 0.5.

Now let’s calculate the posterior, its mean and its mode, in R:

# The mean of the Beta distribution
beta_mean = function(alpha, beta)

# The mode of the Beta Distribution
beta_mode = function(alpha, beta)

#  prior, mean 0.1, mode not defined
alpha = 0.1
beta = 0.9

# The values from the contingency table for the experiment
improved.control = tab[1,2]     # 20
notimproved.control = tab[1,1]  # 235
improved.drug = tab[2,2]        # 43
notimproved.drug = tab[2,1]     # 202

# update the distribution for the treatment group
alpha.drug = alpha + improved.drug
beta.drug = beta + notimproved.drug

# calculate the mean and the mode for the treatment group
beta_mean(alpha.drug, beta.drug) # 0.1752033
beta_mode(alpha.drug, beta.drug) # 0.1807377

# update the distribution for the control group
alpha.control = alpha + improved.control
beta.control = beta + notimproved.control

# calculate the mean and the mode for the control group
beta_mean(alpha.control, beta.control) # 0.07851563
beta_mode(alpha.control, beta.control) # 0.08307087

# plot both distributions to compare
# the function dbeta() returns the value of the distribution
# at point x, for a given alpha and beta
x=seq(from=0.0, to=0.3,by=0.005)
           measure.vars=c("control", "drug"),
ggplot(frame, aes(x=x,y=y,color=treatment)) + geom_line()

Post compare

The means and modes of both distributions are about where we estimated them from the naive calculations directly on the contingency table; if we use the mode as our point estimate of the improvement rates for both groups, then the spreads of the distributions give us the uncertainty around that estimate, based on the size of our data sample. The distributions don’t overlap much (the result we expected, based on our frequentist analysis); the two populations do in fact have different improvement rates. The difference (mostly a philosophical one, perhaps) is that this analysis gives us directly what our family doctor wants: an estimate of the posterior probability of a patient’s blood pressure improving when prescribed drugX. We can calculate what is called the 95% credible interval for each distribution: the interval that with 95% probability contains the true improvement rate:

credible_interval= function(conf, alpha, beta){
  p = (1-conf)
  lower = p/2
  upper = 1-lower
  c(qbeta(lower, alpha, beta), qbeta(upper, alpha, beta))

credible_interval(0.95, alpha.drug, beta.drug) 
# 0.1303853 0.2250201

credible_interval(0.95, alpha.control, beta.control)
# 0.04887709 0.11437549

Based on this data, if you take drugX for your hypertension, the probability of normalizing your blood pressure is likely somewhere in the range of 13 to 22 percent, compared to 4.8 to 11 percent from diet alone. So you will improve your chance of normalizing your blood pressure — but it’s more likely that your blood pressure will remain high.

The credible interval, by the way, is what most people think the confidence interval is. With 95% probability (based on the available evidence), the true improvement rate is in the 95% credible interval. The 95% confidence interval is the interval that is produced by a construction procedure such that, if you repeated the experiment again and again, the constructed confidence interval contains the true improvement rate 95% of the time. This still makes it likely that you’ve bracketed the true improvement rate, and in practice, the confidence interval is probably a good stand-in for the credible interval. It’s just not really answering the question you actually asked, philosophically speaking.

The Hidden Factor

Suppose the researchers had suspected that the hidden factor HF might be implicated in the drug’s performance, and had been able to measure it in the experiment.

tabfull = aggregate(numeric(dim(expframe)[1])+1,
        by=list(expframe$group, expframe$hf, expframe$improved), FUN=sum)

> tabfull
  Group.1 Group.2 Group.3   x
1 control     HFA   FALSE 209
2    drug     HFA   FALSE 201
3 control     HFB   FALSE  26
4    drug     HFB   FALSE   1
5 control     HFA    TRUE  16
6    drug     HFA    TRUE  23
7 control     HFB    TRUE   4
8    drug     HFB    TRUE  20

In this case, we can also estimate the posterior probabilities of improvement for each group, using the bayesian approach. I’ll just give you the graph.

Post withhf

From this evidence, HFB people taking drugX have better than 75% probability of improving their blood pressure; everyone else has probability less than 25%. Just looking at the modes of the distributions, you might naively think that HFA people also have a higher improvement rate when they are taking drugX, or that HFB people have a higher improvement rate than HFA people even in the control group. But the distributions overlap substantially; there is no real evidence that the three groups on the left of the graph have different improvement rates. In other words, if your family doctor knows that you are type HFB, it would make sense to prescribe drugX for your high blood pressure; if you are type HFA, then it doesn’t.

This is the kind of reasoning promoted by the personalized medicine movement. In fact it is what your family doctor already tries to do, by taking into account your family and previous health history, and so on. So far, your doctor can only do this in a negative way — if you have a family history of colon cancer, then start your annual colonoscopies sooner, otherwise, don’t bother — and as far as I know (though I’m not a doctor or a medical researcher) most published medical research isn’t designed to help doctors make “bayesian type” assessments in a more positive way.

But Don’t Throw Out Frequentism

So we’ve established that determining individual patient outcomes is a bayesian question. You might then wonder why anyone would use the frequentist approach at all. But some problems really are frequentist. A medical practitioner who is in public health rather than in a direct patient care practice is interested in the effects of treatments over entire populations, rather than on individuals. Similarly, an insurance company that is deciding whether or not to approve coverage for drugX is interested in whether the drug helps anyone, at all, or if the drug is no better than diet alone. In those situations, a frequentist analysis of drugX does in fact answer the question that is being asked.

To leave a comment for the author, please follow the link and comment on their blog: Win-Vector Blog » R. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)