# Comparing Frequentist, Bayesian and Simulation methods and conclusions

**Posts on R Lover ! a programmer**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

So, a programmer, a frequentist, and a bayesian walk into a bar. No this post

isn’t really on the path to some politically incorrect stereotypical humor. Jut

trying to make it fun and catch your attention. As the title implies this post

is really about applying the differing viewpoints and methodologies inherent in

those approaches to statistics. To be honest I’m not even going to spend a lot

time on the details of methods, I want to focus on the conclusions each of these

people would draw after analyzing the same data using their favored methods.

This post isn’t really so much about how they would proceed but more about what

they would conclude at the end of the analysis. I make no claim about which of

these fine people have the best way of doing things although I was raised a

frequentist, I am more and more enamored of bayesian methods and while my tagline

is accurate “R Lover !a programmer” I will admit a love for making computers do

the hard work for me so if that involves a **little** programming, I’m all for

it.

### Background

Last week I saw a nice post from Anindya

Mozumdar

on the R Bloggers feed. The topic material was

fun for me (analyzing the performance of male 100m sprinters and the fastest

man on earth), as well as exploring the concepts in Allen B. Downey’s book

*Think Stats: Probability and Statistics for
Programmers* , which is at

my very favorite price point, free download, and Creative Commons licensed. The

post got me interested in following up a bit and thinking things through “out

loud” as a blog post.

**This post is predicated on your having read
https://www.radmuzom.com/2019/05/18/fun-with-statistics-is-usain-bolt-really-the-fastest-man-on-earth/.
Please do. I’ll wait until you come back.**

Welcome back. I’m going to try and repeat as little as possible from that blog

post just make comparisons. So to continue our little teaser… So, a

programmer, a frequentist, and a bayesian walk into a bar and start arguing

about whether Usain Bolt really is the fastest man on earth. The programmer has

told us how they would go about answering the question. The answer was:

There is only a 1.85% chance of seeing a difference as large as the observed

difference if there is actually no difference between the (median) timings of

Usain Bolt and Asafa Powell.

and was derived by counting observations across 10,000 simulations of the data

using the `infer`

package and looking at differences between median timings. Our

null hypothesis was there is no “real” difference difference between Bolt and

Powell even though our data has a median for Bolt of 9.90 and median for Powell

of 9.95. That is after all a very small difference. But our simulation allows us

to reject that null hypothesis and favor the alternative that the difference is

real.

Should we be confident that we are 100% – 1.85% = 98% likely to be correct? NO!

as Downey notes:

For most problems, we only care about the order of magnitude: if the p-value

is smaller that 1/100, the effect is likely to be real; if it is greater than

1/10, probably not. If you think there is a difference between a 4.8%

(significant!) and 5.2% (not significant!), you are taking it too seriously.

Can we say that Bolt will win a race with Powell 98% of time? Again a resounding

NO! We’re 98% certain that the *“true”* difference in their medians is .05

seconds? NOPE.

### Back to the future

Okay we’ve heard the programmer’s story at our little local bar. It’s time to

let our frequentist have their moment in the limelight. Technically the best

term would be Neyman-Pearson Frequentist but we’re not going to stand on

formality. It is nearly a century old and stands as an “improvement” on Fisher’s

significance testing. A nice little summary here on

Wikipedia.

I’m not here to belabor the nuances but frequentist methods are among the

oldest and arguably the most prevalent method in many fields. They are often the

first method people learned in college and sometimes the only method. They still

drive most of the published research in many fields although other methods are

taking root.

Before the frequentist can tell their tale though let’s make sure they have the

same data as in the earlier post. Let’s load all the libraries we’re going to

use and very quickly reproduce the process Anindya Mozumdar went through to

scrape and load the data. We’ll have a `tibble`

named `male_100`

that contains

the requisite data and we’ll confirm that the summary for the top 6 runners mean

and median are identical. Note that I am suppressing messages as the libraries

load since R 3.6.0 has gotten quite chatty in this regard.

library(rvest) library(readr) library(dplyr) library(ggplot2) library(ggstatsplot) library(BayesFactor) male_100_html <- read_html("http://www.alltime-athletics.com/m_100ok.htm") male_100_pres <- male_100_html %>% html_nodes(xpath = "//pre") male_100_htext <- male_100_pres %>% html_text() male_100_htext <- male_100_htext[[1]] male_100 <- readr::read_fwf(male_100_htext, skip = 1, n_max = 3178, col_types = cols(.default = col_character()), col_positions = fwf_positions( c(1, 16, 27, 35, 66, 74, 86, 93, 123), c(15, 26, 34, 65, 73, 85, 92, 122, 132) )) male_100 <- male_100 %>% select(X2, X4) %>% transmute(timing = X2, runner = X4) %>% mutate(timing = gsub("A", "", timing), timing = as.numeric(timing)) %>% filter(runner %in% c("Usain Bolt", "Asafa Powell", "Yohan Blake", "Justin Gatlin", "Maurice Greene", "Tyson Gay")) %>% mutate_if(is.character, as.factor) %>% droplevels male_100

## # A tibble: 520 x 2 ## timing runner ## <dbl> <fct> ## 1 9.58 Usain Bolt ## 2 9.63 Usain Bolt ## 3 9.69 Usain Bolt ## 4 9.69 Tyson Gay ## 5 9.69 Yohan Blake ## 6 9.71 Tyson Gay ## 7 9.72 Usain Bolt ## 8 9.72 Asafa Powell ## 9 9.74 Asafa Powell ## 10 9.74 Justin Gatlin ## # … with 510 more rows

male_100$runner <- forcats::fct_reorder(male_100$runner, male_100$timing) male_100 %>% group_by(runner) %>% summarise(mean_timing = mean(timing)) %>% arrange(mean_timing)

## # A tibble: 6 x 2 ## runner mean_timing ## <fct> <dbl> ## 1 Usain Bolt 9.90 ## 2 Asafa Powell 9.94 ## 3 Tyson Gay 9.95 ## 4 Justin Gatlin 9.96 ## 5 Yohan Blake 9.96 ## 6 Maurice Greene 9.97

male_100 %>% group_by(runner) %>% summarise(median_timing = median(timing)) %>% arrange(median_timing)

## # A tibble: 6 x 2 ## runner median_timing ## <fct> <dbl> ## 1 Usain Bolt 9.9 ## 2 Asafa Powell 9.95 ## 3 Yohan Blake 9.96 ## 4 Justin Gatlin 9.97 ## 5 Maurice Greene 9.97 ## 6 Tyson Gay 9.97

Most of the code above is simply shortened from the original post. The only

thing that is completely new is `forcats::fct_reorder(male_100$runner, male_100$timing)`

which takes the `runner`

factor and reorders it according to

the median by runner. This doesn’t matter for the calculations we’ll do but it

will make the plots look nicer.

### Testing, testing!

One of the issues with a frequentist approach compared to a programmers approach

is that there are a lot of different tests you could choose. But in this case

wearing my frequentist hat there really are only two choices. A Oneway ANOVA or

the Kruskall Wallis which uses ranks and eliminates some assumptions.

This also gives me a chance to talk about a great package that supports both

frequentists and bayesian methods and completely integrates visualizing your

data with analyzing your data, which IMHO is the only way to go. The package is

`ggstatsplot`

. Full disclosure

I’m a minor contributor to the package but please know that the true hero of the

package is Indrajeet Patil. It’s stable,

mature, well tested and well maintained try it out.

So let’s assume we want to run a classic Oneway ANOVA first (Welch’s method so

we don’t have to assume equal variances across groups). Assuming that the

omnibuds F test is significant lets say we’d like to look at the pairwise

comparisons and adjust the p values for multiple comparison using Holm. We’re a

big fan of visualizing the data by runner and of course we’d like to plot things

like the mean and median and the number of races per runner. We’d of course like

to know our effect size we’ll stick with eta squared we’d like it as elegant as

possible.

Doing this analysis using frequentist methods in `R`

is not difficult. Heck

I’ve even blogged about it myself it’s so

“easy”. The benefit of `ggbetweenstats`

from `ggstatsplot`

is that it pretty

much allows you to do just about everything in one command. Seamlessly mixing

the plot and the results into one output. We’re only going to scratch the

surface of all the customization possibilities.

ggbetweenstats(data = male_100, x = runner, y = timing, type = "p", var.equal = FALSE, pairwise.comparisons = TRUE, partial = FALSE, effsize.type = "biased", point.jitter.height = 0, title = "Parametric (Mean) testing assuming unequal variances", ggplot.component = ggplot2::scale_y_continuous(breaks = seq(9.6, 10.4, .2), limits = (c(9.6,10.4))), messages = FALSE )

Our conclusion is similar to that drawn by simulation. We can clearly reject the

null that all these runners have the same mean time. Using Games-Howell and

controlling for multiple comparisons with Holm, however, we can only show

support for the difference between Usain Bolt and Maurice Green. There is

insufficient evidence to reject the null for all the other possible pairings.

(You can actually tell `ggbetweenstats`

to show the p value for all the pairings

but that gets cluttered quickly).

From a frequentist perspective there are a whole set of non-parametric tests

that are available for use. They typically make fewer assumptions about the data

we have and often operate by exchanging the ranks of the outcome variable

(`timing`

) rather than using the number.

The only thing we need to change in our input to the function is `type = "np"`

and the `title`

.

ggbetweenstats(data = male_100, x = runner, y = timing, type = "np", var.equal = FALSE, pairwise.comparisons = TRUE, partial = FALSE, effsize.type = "biased", point.jitter.height = 0, title = "Non-Parametric (Rank) testing", ggplot.component = ggplot2::scale_y_continuous(breaks = seq(9.6, 10.4, .2), limits = (c(9.6,10.4))), messages = FALSE )

Without getting overly consumed by the exact numbers note the very similar

results for the overall test, but that we now also are more confident about

whether the difference between Usain Bolt and Justin Gaitlin. I highlight that

because there is a common misconception that non-parametric tests are always less

powerful (sensitive) than their parametric cousins.

### Asking the question differently (see Learning Statistics with R )

Much of the credit for this section goes to Danielle Navarro (bookdown translation: Emily Kothe) in Learning Statistics with R

It usually takes several lessons or even an entire semester to teach the

frequentist method, because null hypothesis testing is a very elaborate

contraption that people (well in my experience very smart undergraduate

students) find very hard to master. In contrast, the Bayesian approach to

hypothesis testing *“feels”* far more intuitive. Let’s apply it to our current

scenario.

We’re at the bar the three of us wondering whether Usain Bolt is really the

fastest or whether all these individual data points really are just a random

mosaic of data noise. Both the programmer and the frequentist set the testing up

conceptually the same way. Can we use the data to reject the null that all the

runners are the same. Convinced they’re not all the same they applied the same

general procedure to reject (or not) the hypothesis that any pair was the same

for example Bolt versus Powell (for the record I’m not related to either). They

differ in computational methods and assumptions but not in overarching method.

At the end of their machinations they have no ability to talk about how likely

(probable) it is that runner 1 will beat runner 2. Often times that’s exactly

what you really want to know. There are two hypotheses that we want to compare,

a null hypothesis h_{0} that all the runners run equally fast and an alternative

hypothesis h_{1} that they don’t. Prior to looking at the data while we’re

sitting at the bar we have no real strong belief about which hypothesis is true

(odds are 1:1 in our naive state). We have our data and we want it to inform our

thinking. Unlike frequentist statistics, Bayesian statistics allow us to talk

about the probability that the null hypothesis is true (which is a complete **no
no** in a frequentist context). Better yet, it allows us to calculate the

posterior probability of the null hypothesis, using Bayes’ rule and our data.

In practice, most Bayesian data analysts tend not to talk in terms of the raw

posterior probabilities. Instead, we/they tend to talk in terms of the posterior

odds ratio. Think of it like betting. Suppose, for instance, the posterior

probability of the null hypothesis is 25%, and the posterior probability of the

alternative is 75%. The alternative hypothesis **h _{1}** is three times as probable as the

null

**h**, so we say that the odds are 3:1 in favor of the alternative.

_{0}At the end of the Bayesian’s efforts they can make what feel like very natural

statements of interest, for example, “The evidence provided by our data

corresponds to odds of 42:1 that these runners are not all equally fast.

Let’s try it using ggbetweenstats…

ggbetweenstats(data = male_100, x = runner, y = timing, type = "bf", var.equal = FALSE, pairwise.comparisons = TRUE, partial = FALSE, effsize.type = "biased", point.jitter.height = 0, title = "Bayesian testing", messages = FALSE )

Yikes! Not what I wanted to see in the bar. The pairwise comparisons have gone

away (we’ll get them back) and worse yet what the heck does log_{e}(BF_{10}) = 2.9

mean? I hate log conversions I was promised a real number like 42:1! Who’s

Cauchy why is he there at .0.707?

Let’s break this down. log_{e}(BF_{10}) = 2.9 is also `exp(2.9)`

or about 18 so

the good news is the odds are better than 18:1 that the runners are not equally

fast. Since rounding no doubt loses some accuracy lets use the `BayesFactor`

package directly and get a more accurate answer before we round `anovaBF(timing ~ runner, data = as.data.frame(male_100), rscaleFixed = .707)`

is what we want

where `rscaleFixed = .707`

ensures we have the right Cauchy value.

anovaBF(timing ~ runner, data = male_100, rscaleFixed = .707)

## Bayes factor analysis ## -------------- ## [1] runner : 19.04071 ±0.01% ## ## Against denominator: ## Intercept only ## --- ## Bayes factor type: BFlinearModel, JZS

Okay that’s better so to Bayesian thinking the odds are 19:1 against the fact that they all run about the same speed, or 19:1 they run at different speeds.

Hmmm. One of the strengths/weaknesses of the Bayesian approach is that people

can have their own sense of how strong 19:1 is. I like those odds. One of the

really nice things about the Bayes factor is the numbers are inherently

meaningful. If you run the data and you compute a Bayes factor of 4, it means

that the evidence provided by your data corresponds to betting odds of 4:1 in

favor of the alternative. However, there have been some attempts to quantify the

standards of evidence that would be considered meaningful in a scientific

context. One that is widely used is from Kass and Raftery (1995). (**N.B. there are others and I have deliberately selected one of the most conservative standards. See for example https://media.springernature.com/full/springer-static/image/art%3A10.1186%2Fs12888-018-1761-4/MediaObjects/12888_2018_1761_Fig1_HTML.png**)

Bayes factor value | Interpretation |
---|---|

1 – 3 | Negligible evidence |

3 – 20 | Positive evidence |

20 -150 | Strong evidence |

>150 | Very strong evidence |

Okay we have “positive evidence” and we can quantify it, that’s good. But what

about all the pairwise comparisons? Can we take this down to all the

individual pairings? I’m on the edge of my bar stool here. What are the odds

Bolt really is faster than Powell? Can we quantify that without somehow breaking

the multiple comparisons rule?

The short answer is yes we can safely extend this methodology to incorporate

pairwise comparisons. We shouldn’t abuse the method and we should fit our model

with the best possible prior information but in general, as simulated

here,

With Bayesian inference (and the correct prior), though, this problem

disappears. Amazingly enough, you don’t have to correct Bayesian inferences for

multiple comparisons.

With that in mind let’s build a quick little function that will allow us to pass

a data source and two names and run a Bayesian t-test via `BayesFactor::ttestBF`

to compare two runners. `ttestBF`

returns a lot of info in a custom object so

we’ll use the `extractBF`

function to grab it in a format where we can pluck out

the actual BF_{10}

compare_runners_bf <- function(df, runner1, runner2) { ds <- df %>% filter(runner %in% c(runner1, runner2)) %>% droplevels %>% as.data.frame zzz <- ttestBF(formula = timing ~ runner, data = ds) yyy <- extractBF(zzz) xxx <- paste0("The evidence provided by the data corresponds to odds of ", round(yyy$bf,0), ":1 that ", runner1, " is faster than ", runner2 ) return(xxx) }

Now that we have a function we can see what the odds are that Bolt is faster

than the other 5 and print them one by one

compare_runners_bf(male_100, "Usain Bolt", "Asafa Powell")

## [1] "The evidence provided by the data corresponds to odds of 5:1 that Usain Bolt is faster than Asafa Powell"

compare_runners_bf(male_100, "Usain Bolt", "Tyson Gay")

## [1] "The evidence provided by the data corresponds to odds of 5:1 that Usain Bolt is faster than Tyson Gay"

compare_runners_bf(male_100, "Usain Bolt", "Justin Gatlin")

## [1] "The evidence provided by the data corresponds to odds of 21:1 that Usain Bolt is faster than Justin Gatlin"

compare_runners_bf(male_100, "Usain Bolt", "Yohan Blake")

## [1] "The evidence provided by the data corresponds to odds of 8:1 that Usain Bolt is faster than Yohan Blake"

compare_runners_bf(male_100, "Usain Bolt", "Maurice Greene")

## [1] "The evidence provided by the data corresponds to odds of 1355:1 that Usain Bolt is faster than Maurice Greene"

Okay now I feel like we’re getting somewhere with our bar discussions. Should I

feel inclined to make a little wager on say who buys the next round of drinks as

a Bayesian I have some nice useful information. I’m not rejecting a null

hypothesis I’m casting the information I have as a statement of the odds I think

I have of “winning”.

But of course this isn’t the whole story so please read on…

### Who’s Cauchy and why does he matter?

Earlier I made light of the fact that the output from `ggbetweenstats`

had

r_{Cauchy} = 0.707 and `anovaBF`

uses `rscaleFixed = .707`

. Now we need to spend

a little time actually understanding what that’s all about. Cauchy is

Augustin-Louis Cauchy and

the reason that’s relevant is that BayesFactor makes use of his distribution as

a default. I’m not even

going to try and take you into the details of the math but it is important we

have a decent understanding of what we’re doing to our data.

The BayesFactor

package

has a few built-in “default” named settings. They all have the same shape; the

only differ by their scale, denoted by r. The three named defaults are medium =

0.707, wide = 1, and ultrawide = 1.414. “Medium”, is the default. The scale

controls how large, on average, the expected true effect sizes are. For a

particular scale 50% of the true effect sizes are within the interval (−r,r).

For the default scale of “medium”, 50% of the prior effect sizes are within the

range (−0.7071,0.7071). Increasing r increases the sizes of expected effects;

decreasing r decreases the size of the expected effects.

Let’s compare it to a frequentist test we’re all likely to know, the t-test,

(we’ll use the Welch variant). Our initial hypothesis is that Bolt’s mean times

are different than Powell’s mean times (two-sided) and then test the one-sided

that Bolt is faster. Then let’s go a little crazy and run it one sided but

specify the mean difference 0.038403 of a second faster that we “see” in our data

`mu = -0.038403`

.

justtwo <- male_100 %>% filter(runner %in% c("Usain Bolt", "Asafa Powell")) %>% droplevels %>% as.data.frame t.test(formula = timing ~ runner, data = justtwo)

## ## Welch Two Sample t-test ## ## data: timing by runner ## t = -2.5133, df = 111.58, p-value = 0.01339 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -0.06868030 -0.00812721 ## sample estimates: ## mean in group Usain Bolt mean in group Asafa Powell ## 9.904930 9.943333

t.test(formula = timing ~ runner, data = justtwo, alternative = "less")

## ## Welch Two Sample t-test ## ## data: timing by runner ## t = -2.5133, df = 111.58, p-value = 0.006694 ## alternative hypothesis: true difference in means is less than 0 ## 95 percent confidence interval: ## -Inf -0.01306002 ## sample estimates: ## mean in group Usain Bolt mean in group Asafa Powell ## 9.904930 9.943333

t.test(formula = timing ~ runner, data = justtwo, alternative = "less", mu = -0.038403)

## ## Welch Two Sample t-test ## ## data: timing by runner ## t = -4.9468e-05, df = 111.58, p-value = 0.5 ## alternative hypothesis: true difference in means is less than -0.038403 ## 95 percent confidence interval: ## -Inf -0.01306002 ## sample estimates: ## mean in group Usain Bolt mean in group Asafa Powell ## 9.904930 9.943333

Hopefully that last one didn’t trip you up and you recognized by definition if

the mean difference in our sample data is -0.038403 then the p value should

reflect 50/50 p value?

Let’s first just try different r_{Cauchy} values with `ttestBF`

.

justtwo <- male_100 %>% filter(runner %in% c("Usain Bolt", "Asafa Powell")) %>% droplevels %>% as.data.frame ttestBF(formula = timing ~ runner, data = justtwo, rscale = "medium")

## Bayes factor analysis ## -------------- ## [1] Alt., r=0.707 : 5.164791 ±0% ## ## Against denominator: ## Null, mu1-mu2 = 0 ## --- ## Bayes factor type: BFindepSample, JZS

ttestBF(formula = timing ~ runner, data = justtwo, rscale = "wide")

## Bayes factor analysis ## -------------- ## [1] Alt., r=1 : 4.133431 ±0% ## ## Against denominator: ## Null, mu1-mu2 = 0 ## --- ## Bayes factor type: BFindepSample, JZS

ttestBF(formula = timing ~ runner, data = justtwo, rscale = .2)

## Bayes factor analysis ## -------------- ## [1] Alt., r=0.2 : 6.104113 ±0% ## ## Against denominator: ## Null, mu1-mu2 = 0 ## --- ## Bayes factor type: BFindepSample, JZS

Okay the default medium returns just what we reported earlier 5:1 odds. Going

wider gets us 4:1 and going narrower (believing the difference is smaller) takes

us to 6:1. Not huge differences but noticeable and driven by our data.

Let’s investigate directional hypotheses with `ttestBF`

. First let’s ask what’s the evidence that Bolt is faster than Powell **NB the order is driven by factor level in the dataframe not the order in the filter command below. Also note that faster is a lower number**

justtwo <- male_100 %>% filter(runner %in% c("Usain Bolt", "Asafa Powell")) %>% droplevels %>% as.data.frame # notice these two just return the same answer in a different order ttestBF(formula = timing ~ runner, data = justtwo, nullInterval = c(0, Inf))

## Bayes factor analysis ## -------------- ## [1] Alt., r=0.707 0<d<Inf : 0.04312062 ±0% ## [2] Alt., r=0.707 !(0<d<Inf) : 10.28646 ±0% ## ## Against denominator: ## Null, mu1-mu2 = 0 ## --- ## Bayes factor type: BFindepSample, JZS

ttestBF(formula = timing ~ runner, data = justtwo, nullInterval = c(-Inf, 0))

## Bayes factor analysis ## -------------- ## [1] Alt., r=0.707 -Inf<d<0 : 10.28646 ±0% ## [2] Alt., r=0.707 !(-Inf<d<0) : 0.04312062 ±0% ## ## Against denominator: ## Null, mu1-mu2 = 0 ## --- ## Bayes factor type: BFindepSample, JZS

So the odds that Bolt has a bigger number i.e. is slower than Powell is 0.04:1

and the converse is the odds that Bolt has a smaller timing (is faster) is 10:1.

You can feel free to put these in the order that makes the most sense to your

workflow. They’re always going to be mirror images.

And yes in some circumstances it is perfectly rational to combine the

information by dividing those odds. See this blog post for a research

scenario

but accomplishing it is trivial. Running this code snippet essentially combines

what we know in both directions of the hypothesis.

justtwo <- male_100 %>% filter(runner %in% c("Usain Bolt", "Asafa Powell")) %>% droplevels %>% as.data.frame powellvbolt <- ttestBF(formula = timing ~ runner, data = justtwo, nullInterval = c(-Inf, 0)) powellvbolt[1]/powellvbolt[2]

## Bayes factor analysis ## -------------- ## [1] Alt., r=0.707 -Inf<d<0 : 238.5509 ±0% ## ## Against denominator: ## Alternative, r = 0.707106781186548, mu =/= 0 !(-Inf<d<0) ## --- ## Bayes factor type: BFindepSample, JZS

### What have I learned

- All three approaches yielded similiar answers to our question at the bar.
- Frequentist methods have stood the test of time and are pervasive in the

literature - Computational methods like resmapling allow us to free ourselves

from some of the restrictions and assumptions in classical hypothesis testing

in an age when cpmpute power is cheap - Bayesian methods allow us to speak in

the very human vernacular language of probabilities about our evidence.

### Done!

I hope you’ve found this useful. I am always open to comments, corrections and

suggestions.

Chuck (ibecav at gmail dot com)

**leave a comment**for the author, please follow the link and comment on their blog:

**Posts on R Lover ! a programmer**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.