Fun with M&M’s – April 3, 2018

[This article was first published on Chuck Powell, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this post we’re going to explore the Chi Squared Goodness of Fit test using M&M’s as our subject material. From there we’ll take a look at simultaneous confidence intervals a.k.a. multiple comparisons. On the R side of things we’ll make use of some old friends like ggplot2 and dplyr but we’ll also make use of two packages that were new to me scimp and ggimage. We’ll also make heavy use of the kable package to make our output tables look nicer.

Background and credits

See this blog post by Rick Wicklin for the full background story. This posting is simply an attempt to do the same sort of analysis in R. It is an expansion of work that Bob Rudis did on RPubs.

Let’s load the required R packages. See Bob Rudis’ Github pages for more on scimple. Let’s take care of housekeeping and set up the right libraries.

# devtools::install_github("hrbrmstr/scimple")
library(hrbrthemes) # for scales
# I had to install Image Magick first on my Mac as well as EBImage 
# install.packages("ggimage")

options(knitr.table.format = "html") 
cap_src <- "Source: <>"

SAS M&M’s Measurements

The breakroom containers at SAS are filled from two-pound bags. So as to not steal all the M&M’s in the breakroom, [Rick] conducted this experiment over many weeks in late 2016 and early 2017, taking one scoop of M&M’s each week.

Create a dataframe called mms that contains information about:

Column Contains Type
color_name What color M&M factor
official_color color as hex code according to Mars standards char
count observed frequency counts in SAS breakrooms dbl
prop_2008 expected freq as a % (Mars 2008) dbl
imgs filenames for the M&M lentils char
prop convert observed counts to proportions dbl
mms <- data_frame(
  color_name = c("Red", "Orange", "Yellow", "Green", "Blue", "Brown"),
  official_color = c("#cb1634", "#eb6624", "#fff10a", "#37b252", "#009edd", "#562f14"), 
  count = c(108, 133, 103, 139, 133, 96),
  prop_2008 = c(0.13, 0.20, 0.14, 0.16, 0.24, 0.13),
  imgs=c("im-red-lentil.png", "im-orange-lentil.png", "im-yellow-lentil.png",
         "im-green-lentil.png", "im-blue-lentil.png", "im-brown-lentil.png")
) %>% 
  mutate(prop = count / sum(count),
         color_name = factor(color_name, levels=color_name))

The data set contains the cumulative counts for each of the six colors in a sample of size N = 712. Let’s graph the observed percentages as bars (ordered by frequency) and the expected percentages that Mars published in 2008 as black diamonds.

ggplot(mms, aes(reorder(color_name,-prop), prop, fill=official_color)) +
  geom_col(width=0.85) +
  geom_point(aes(color_name,prop_2008),shape=18,size = 3) +
  scale_y_percent(limits=c(0, 0.25)) +
  scale_fill_identity(guide = FALSE) +
  labs(x=NULL, y=NULL, 
       title=sprintf("Observed distribution of M&M colors for a sample of N=%d", sum(mms$count)),
       subtitle="Black diamonds represent 2008 Mars company published percentages (expected)",
       caption=cap_src) +

M&M Bargraph M&M Bargraph

The same data as a table:

mms %>% 
  arrange(desc(count)) %>%
         ) %>% 
  select(Color=color_name, Observed=count, `Observed %`=prop, `Expected %`=prop_2008, Difference=difference) %>% 
  kable(align="lrrrr") %>% 
  kable_styling(full_width = FALSE)
Color Observed Observed % Expected % Difference
Green 139 19.52% 16% 3.52%
Orange 133 18.68% 20% -1.32%
Blue 133 18.68% 24% -5.32%
Red 108 15.17% 13% 2.17%
Yellow 103 14.47% 14% 0.47%
Brown 96 13.48% 13% 0.48%

Chi-Squared Goodness of Fit Test Results

Whether we look at the results in a graph or a table there are clearly differences between expected and observed for most of the colors. We would expect to find some differences but the overall question is do our data fit the “model” that is inherent in the expected 2008 data we have from Mars? The statistical test for this is the Chi-Square Goodness of Fit (GoF). Let’s run it on our data. We give the test our observed counts mms$count as well as p=mms$prop_2008 which indicates what our expected probabilities (proportions) are. If we didn’t specify then the test would be run against the hypothesis that they M&M’s were equally likely. The broom::tidy() takes the output from the Chi Square test converts it to a data frame and allows us to present it neatly using kable.

chisq.test(mms$count, p=mms$prop_2008) %>% 
  broom::tidy() %>% 
  select(`Chi Squared`=statistic, `P Value`=p.value, `Degrees of freedom`=parameter, 
                `R method`=method) %>%
  kable(align = "rrcl",digits=3) %>% 
  kable_styling(full_width = FALSE)
Chi Squared P Value Degrees of freedom R method
17.353 0.004 5 Chi-squared test for given probabilities

We can reject the null hypothesis at the alpha = 0.05 significance level (95% confidence). In other words, the distribution of colors for M&M’s in this 2016/2017 sample does NOT appear to be the same as the color distribution we would expect given the data from Mars published in 2008!

The data provide support for the hypothesis that the overall distribution doesn’t match what Mars said it should be. That’s exciting news, but leaves us with some other unanswered questions. One relatively common question is, how “big” is the difference or the effect? Is this a really big discrepancy between the published data and our sample? Is there a way of knowing how big this difference is?

Let’s start answering the second question first. Effect size is a measure we use in statistics to express how big the differences are. For this test the appropriate measure of effect size is Cohen’s w which can be calculated from the Chi squared statistic and N.

chisquaredresults<-broom::tidy(chisq.test(mms$count, p=mms$prop_2008))

## [1] 0.1561162

So our value for Cohen’s w is 0.156 . The rule of thumb for interpreting this number indicates that this is a small effect size Obviously you should exercise professional judgment in interpreting effect size but it does not appear that the differences are worthy of a world wide expose at this time…

On to our other question…

Is there a way of telling by color which quantities of M&M’s are significantly different? After all a cursory inpsection of the graph or the table says that green and blue seem to be “off” quite a bit while yellow and brown are really close to what we would expect! Is there a way, now that we have conducted an overall omnibuds test of the goodness of fit (GOF), we can refine our understanding of the differences color by color?

Simultaneous confidence intervals for the M&M proportions (multiple comparisons)

Any sample is bound have some random variability compared to the true population count or percentage. How can we use confidence intervals to help us understand whether the data are indicating simple random variation or whether the underlying population is different. By now you no doubt have thought of confidence intervals. We just need to compute the confidence interval for each color and then see whether the percentages provided by Mars lie inside or outside the confidence interval our sample generates. We would expect that if we ran our experiment 100 times with our sample size numbers for each color the Mars number would lie inside the upper and lower limit of our confidence interval 95 times out of those 100 times. If our data shows it outside the confidence interval that is evidence of a statistically significant difference.

Ah, but there’s a problem! We have 6 colors and we would like to test each color to see if it varies significantly. Assuming we want to have 95% confidence again, across all six colors, we are “cheating” if we compute a simple confidence interval and then run the test six times. It’s analogous to rolling the die six times instead of once. The more tests we run the more likely we are to find a difference even though none exists. We need to adjust our confidence to account for the fact that we are making multiple comparisons (a.k.a. simultaneous comparisons). Our confidence interval must be made wider (more conservative) to account for the fact we are making multiple simultaneous comparisons. Thank goodness the tools exist to do this for us. As a matter of fact there is no one single way to make the adjustment… there are many. We’re going to focus on Goodman.

In his original posting Rick used SAS scripts he had written for a previous blog post to overcome this challenge. As R users we have a few different packages for computing simultaneous confidence intervals (as well as the option of simply doing the calculations in base R). Bob Rudis took a look at several different choices in R packages but one of the “better” ones CoinMinD does the computations nicely and then prints out the results (literally with print()) as opposed to returning data we can act upon. So he made a new package that does the same computations and returns tidy data frames for the confidence intervals. The package is much cleaner and it includes a function that can compute multiple SCIs and return them in a single data frame, similar to what binom::binom.confint() does.

Here are a couple examples of scimple in action. We’ll feed it the counts mms$count we have, and ask it to use the Goodman method for computing the confidence interval for each of the six colors assuming we want 95% confidence alpha = .05. For comparison we’ll also run the Wald method with continuity correction.

The command is scimp_goodman(mms$count, alpha=0.05). I’ve added a select statement to remove some columns for clarity. The scimp_waldcc(mms$count, alpha=0.05) shows you the more verbose output for Wald.

scimp_goodman(mms$count, alpha=0.05) %>% 
  select( `95% Lower`=lower_limit, `95% Upper`=upper_limit) %>%
  kable(align = "lrrrrr",caption = "Goodman Method") %>% 
  kable_styling(full_width = FALSE)
Goodman Method
95% Lower 95% Upper
0.1196016 0.1905134
0.1513616 0.2282982
0.1133216 0.1828844
0.1590634 0.2372872
0.1513616 0.2282982
0.1045758 0.1721577
scimp_waldcc(mms$count, alpha=0.05) %>% 
  kable(align = "lrrrrr",caption = "Wald Continuity Correction") %>% 
  kable_styling(full_width = FALSE)
Wald Continuity Correction
method lower_limit upper_limit adj_ll adj_ul volume
waldcc 0.1246345 0.1787363 0.1246345 0.1787363 0
waldcc 0.1574674 0.2161281 0.1574674 0.2161281 0
waldcc 0.1181229 0.1712030 0.1181229 0.1712030 0
waldcc 0.1654077 0.2250417 0.1654077 0.2250417 0
waldcc 0.1574674 0.2161281 0.1574674 0.2161281 0
waldcc 0.1090419 0.1606210 0.1090419 0.1606210 0

For each of the commands, back comes a tibble with six rows (one for each color) with the upper and lower bounds as well as other key data from the process for each method. Notice that the confidence interval width varies by color (row in the tibble) based on observed sample size and that the Goodman intervals are wider (more conservative) when you compare rows across tables with the Wald Continuity Correction method.

The documentation on GitHub that Bob Rudis provided has a nice graph that shows you the 6 different methods and how they would place the confidence intervals for the exact same observed data. Clearly YMMV depending on which method you choose.

Armed with this great package that Bob provided let’s bind these corrected confidence intervals to the data we have and see if we can determine whether our intuitions about which colors are significantly different from the expected values are accurate…

mms <- bind_cols(mms, scimp_goodman(mms$count, alpha=0.05))
mms %>% 
        `95% Lower`=lower_limit, 
        `95% Upper`=upper_limit, 
        Expected=prop_2008) %>% 
  kable(align=c("lrrrrr"), digits=3, caption="Simultaneous confidence Intervals (Goodman method)") %>% 
  kable_styling(full_width = FALSE, position = "center")
Simultaneous confidence Intervals (Goodman method)
Color Observed Percent 95% Lower 95% Upper Expected
Red 108 0.152 0.120 0.191 0.13
Orange 133 0.187 0.151 0.228 0.20
Yellow 103 0.145 0.113 0.183 0.14
Green 139 0.195 0.159 0.237 0.16
Blue 133 0.187 0.151 0.228 0.24
Brown 96 0.135 0.105 0.172 0.13

Hmmm, the table shows that only blue (0.24) is outside the 95% confidence interval, with green (0.16) just barely inside its interval. The rest are all somewhere inside the confidence interval range. We could of course choose a less stringent or conservative method than Goodman. Or we could choose and even stricter method! That exercise is left to you. For now though I find the table of numbers hard to read and to parse so let’s build a plot that hopefully makes our life a little easier. Later we’ll make use of ggimage and some work that Bob did to make an even better plot.

mms %>% 
  ggplot() +
  geom_segment(aes(x=lower_limit, xend=upper_limit, y=color_name, 
                   yend=color_name, color=official_color), size=3) +
  geom_point(aes(prop, color_name, fill=official_color), 
              size=8, shape=21, color="white") +
  geom_point(aes(prop_2008, color_name, color=official_color),
             shape="|", size=8) +
  scale_x_percent(limits=c(0.095, 0.25)) +
  scale_color_identity(guide = FALSE) +
  scale_fill_identity(guide = FALSE) +
  labs(x="Proportion", y=NULL, 
       title="Observed vs Expected 2008 for M&M Candies",
       subtitle=sprintf("95%% Simultaneous Confidence Intervals, [N=%d]",
                        sum(mms$count)), caption=cap_src) +

Ah, that’s better sometimes a picture really is worth a thousand numbers… We can now clearly see the observed percent as a circle. The Goodman adjusted confidence interval as a horizontal line and the expected value from the 2008 Mars information as a nice vertical line.

Plot twist – The Cleveland Comparison

So as it turns out, Rick the original author at SAS was able to make contact with the Mars Company and determine that there really was an explanation for the differences. Turns out some changes were made and there are actually two places where these M&M’s might have originated each with slightly different proportions! Who knew? Right?

Let’s take the opportunity to take our new data and the ggimage package and plot the plot twist (pun intended). All credit to Bob for carefully constructing the right commands to ggplot to make this compelling graphic. All we have to do is add the Cleveland plant expected proportions as cleveland_prop to our data since our observed hasn’t changed which means our CI’s remain the same.

url_base <- "" 

mms %>% 
  mutate(imgs=sprintf("%s%s", url_base, imgs)) %>% 
  mutate(cleveland_prop=c(0.131, 0.205, 0.135, 0.198, 0.207, 0.124)) %>% 
  ggplot() +
  geom_segment(aes(x=lower_limit, xend=upper_limit, y=color_name, 
                   yend=color_name, color=official_color), size=2) +
  geom_image(aes(prop, color_name, image=imgs),size=.10) +
  geom_point(aes(cleveland_prop, color_name, color=official_color),
             shape="|", size=6) +
  scale_x_percent(limits=c(0.095, 0.25)) +
  scale_color_identity(guide = FALSE) +
  scale_fill_identity(guide = FALSE) +
  labs(x="Proportion", y=NULL, 
       title="Observed vs 2017 Proportions for M&M Candies",
       subtitle=sprintf("95%% Simultaneous Confidence Intervals, [N=%d]",
                        sum(mms$count)), caption=cap_src) +

Certainly a more intriguing graphic now that we let ggimage put the lentils in there for us…

I hope you’ve found this useful. I am always open to comments, corrections and suggestions.

Chuck (ibecav at gmail dot com)


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

To leave a comment for the author, please follow the link and comment on their blog: Chuck Powell. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)