Multiple comparisons of group-level means is a tricky problem in statistical inference. A standard practice is to adjust the threshold for statistical significance according to the number of pairwise tests performed. For example, according to the widely-known Bonferonni method, if we have 3 different groups for which we want to compare the means of a given variable, we would divide the standard significance level (.05 by convention) by the number of tests performed (3 in this case), and only conclude that a given comparison is statistically significant if its p-value falls below .017 (e.g. .05/3).
In this post, we’ll employ an approach often recommended by Andrew Gelman, and use a multi-level model to examine pairwise comparisons of group means. In this method, we use the results of a model built with all the available data to draw simulations for each group from the posterior distribution estimated from our model. We use these simulations to compare group means, without resorting to the expensive corrections typical of standard multiple comparison analyses (e.g. the Bonferroni method referenced above).
We will return to a data source that we have examined previously on this blog: walking data from the Accupedo app on my phone. Accupedo is a free step-counting app that I’ve been using for more than 3 years to track how many steps I walk each day. It’s relatively straightforward to import these data into R and to extract the cumulative number of steps taken per hour.
In the data used for this post, I exclude all measurements taken before 6 AM (because the step counts are cumulative we don’t lose any steps by excluding these observations). This results in 18,117 observations of cumulative hourly step counts for 1113 days. The first observation is on 3 March 2015 and the last on 22 March 2018, resulting in just over 3 years of data!
A sample of the dataset, called aggregate_day_hour, is shown below:
Oneday represents the calendar year/month/day the observation was recorded, dow contains the name of the day of the week, hour is the hour of the day (in 24-hour or “military” style), steps gives the cumulative step count and week_weekend indicates whether the day in question is a weekday or a weekend.
Our ultimate purpose here is to compare step counts across a large number of days, without having to adjust our threshold for statistical significance. In this example, we will compare the step counts for 50 different days. The standard Bonferonni correction would require us to divide the significance level by the number of possible pairwise comparisons, 1225 in this case. The resulting significance level is .05/1225, or .00004, which imposes a very strict threshold, making it more difficult to conclude that a given comparison is statistically significant.
As mentioned above, the method adopted in this post is one that Andrew Gelman often advocates on his blog. This advice often comes in the form of: “fit a multilevel model!,” with a link to this paper by Gelman, Hill and Yajima (2012). I have to admit that I never fully understood the details of the method from the blog posts. However, by looking at the paper and the explanations and code presented in Chapter 12 of Gelman and Hill’s (2007) book, I think I’ve figured out what he means and how to do it. This post represents my best understanding of this method.1
In brief (see the paper for more details), we calculate a multilevel model using flat priors (e.g. the standard out-of-the-box model in the lme4 package). We extract the coefficients from the model and the standard deviation of the residual error of estimated step counts per day (also known as sigma hat), and use these values to simulate step count totals for each day (based on the estimated total step count and sigma hat). We then compare the simulations for each day against each other in order to make pairwise comparisons across days.
Sound confusing? You’re not alone! I’m still grappling with how this works, but I’ll gently walk through the steps below, explaining the basic ideas and the code along the way.
Step 1: Calculate the Model
In order to perform our pairwise comparisons of 50 days of step counts, we will construct a multi-level model using all of the available data. We’ll employ a model similar to one we used in a previous post on this blog. The model predicts step count from A) the hour of the day and B) whether the day is a weekday or a weekend. We specify fixed effects (meaning regression coefficients that are the same across all days) for hour of the day and time period (weekday vs. weekend). We specify a random effect for the variable that indicates the day each measurement was taken (e.g. a random intercept per day, allowing the average number of estimated steps to be different on each day). We center the hour variable at 6 PM (more on this below).
We can calculate this model using the lme4 package. We will also load the arm package (written to accompany Gelman and Hill’s (2007) book) to extract model parameters we’ll need to compute our simulations below.
The estimates of the fixed effects are as follows:
|Estimate||Std. Error||t value|
Because we have centered our hour variable at 6 PM, the intercept value gives the estimated step count at 6 PM during a weekday (e.g. when the week_weekend variable is 0, as is the case for weekdays in our data). The model thus estimates that at 6 PM on a weekday I have walked 13,709.05 steps. The hour_centered coefficient gives the estimate of the number of steps per hour: 1,154.39. Finally, the week_weekendWeekend variable gives the estimated difference in the total number of steps per day I walk on a weekend, compared to a weekday. In other words, I walk 1,860.15 fewer steps on a weekend day compared to a weekday.
Step 2: Extract Model Output
In order to compare the step counts across days, we will make use of the coefficients from our multilevel model. We can examine the coefficients (random intercepts and fixed effects for hour and day of week) with the following code:
Which the gives the coefficient estimates for each day (first 10 rows shown):
We will extract these coefficients from the model output, and merge in the day of the week (week vs. weekend) for every date in our dataset. We will then create a binary variable for week_weekend, which takes on a value of 0 for weekdays and 1 for weekends (akin to that automatically created when running the multilevel model with the lme4 package):
Our resulting dataframe looks like this:
These coefficients will allow us to create an estimate of the daily step counts for each day.
Standard Deviation of Residual Errors: Sigma Hat
In order to simulate draws from the posterior distribution implied for each day, we need the estimated step counts per day (also known as y hat, which we’ll calculate using the coefficient dataframe above), and the sigma hat value (e.g. the standard deviation of the residual error from our estimated step counts) from the model. As Gelman and Hill explain in their book, we can simply extract the standard deviation using the sigma.hat command (this function is part of their arm package, which we loaded above). We’ll store the sigma hat value in a variable called sigma_y_hat. In this analysis, the standard deviation of the residual error of our estimated step counts is 2193.09.
Step 3: Sample 50 days and plot
For this exercise, we’ll randomly select 50 days for our pairwise comparisons. We’ll then plot the model estimated step count for the selected days. Here, I calculate the estimated step counts at 6 PM (using the formula and coefficients from the model) directly in the ggplot2 code.
Our sampled dates dataframe looks like this (first 10 rows shown):
And our plot looks like this:
We can see that ‘2015-12-25’ (Christmas 2015) was day in which I walked very few steps. The highest step count is on ‘2016-08-08’, while the second-highest date is ‘2016-04-23’.
Note that the samples dataframe shown above is no longer in chronological order. However, ggplot2 is smart enough to understand that the x-axis is a date, and orders the values in the plot accordingly.
Step 4: Simulate Posterior Distributions of Step Counts Per Day
We now need to simulate posterior distributions of step counts per day. The method below is, as far as I can understand it, that applied in the Gelman et al. (2012) paper. I have adapted the logic from this paper and the Gelman and Hill book (particularly Chapter 12).
What are we doing?
Our model estimates the overall step counts as a function of our predictor variables. However, the model is not perfect, and the predicted step counts do not perfectly match the observed step counts. This difference between the observed and the predicted step counts for each day is called “residual error” – it’s what your model misses.
We will make 1000 simulations for each day with step count values that we did not actually see, but according to the model estimate and the uncertainty of that estimate (e.g. sigma hat), we could have seen. To do this, we simply calculate the model-predicted step count for each day, and use this value along with the sigma hat value (the standard deviation of our residual error) to sample 1000 values from each day’s implied posterior distribution.
How do we do it?
I wrote a simple function to do this. It takes our sample data frame and, for each row (date) in the dataset, it calculates the estimated step count using the regression formula and parameter estimates and passes this value, along with the sigma hat value, to the rnrom function, drawing 1000 samples from the implied posterior distribution for that day.
Note that I’m not including the hour_centered variable in this function. This is because I’m choosing to estimate the step count when hour_centered is zero (e.g. 6 PM because we centered on this value above).
Our resulting posterior_samples matrix has 1 column for each day, with 1000 samples for each day contained in the rows. Below I show a small subset of the entire matrix – just the first 10 rows for the first 5 columns. It is interesting to note that the first column, representing the first day (‘2015-12-25’) in our dataframe of samples above, has smaller values than the second column (representing the second day, ‘2016-04-21’). If we look at the intercepts in the samples dataset, we can see that the first day is much lower than the second. This difference in our model-estimated intercepts is propagated to the simulations of daily step counts.
Step 5: Compare the Simulations For Each Day
We can see in the above matrix of simulations that the first day (‘2015-12-25’, contained in the first column) has lower values than the second day (‘2016-04-21’, contained in the second column). In order to formally compare the the two columns, we will compare the simulations and see how many times the values in the first column are larger than those in the second column. If values in one column are larger more than 95% of the time, we will say that there is a meaningful (“significant”) difference in step counts between the two dates.
We will apply this logic to all of the possible pairwise comparisons in our sampled 50 dates. The following code accomplishes this (adopted from this tremendous Stackoverflow question + answer):
The first 10 rows of the first 5 columns of our comparison matrix look like this:
The counts in each cell indicate the number of times (out of 1000) that the column value is greater than the row value (we set the diagonals – comparisons of a day with itself – to zero in the above code). The comparison we identified above (‘2015-12-25’ and ‘2016-04-21’) is shown in the first row, second column (and the second row, first column – this matrix contains the same information but in reverse above and below the diagonals). The samples from ‘2016-04-21’ were greater than those from ‘2015-12-25’ 998 times out of 1000!
Step 6: Make the Pairwise Comparison Plot
We can make a heat map using ggplot2 to show which pairwise comparisons are “significantly” different from one another, akin to that used in Gelman et al. (2007).
Because ggplot2 requires data to be in a tidy format, we’ll have to melt the comparison matrix so that it has 1 row per pairwise comparison. The code to do this was based on these very clear explanations of how to make a heatmap with ggplot2:
Our final formatted matrix for plotting, called melted_cormat, looks like this (first 10 rows shown):
The count variable here gives the number of times that the y date simulations were greater than the x date simulations. So the second row above indicates that the simulations on ‘2015-03-29’ were greater than those for ‘2015-03-18’ a total of 222 times.
We can make the heatmap with the following code:
Which gives us the following plot:
How can we read this plot? All of our 50 dates are displayed on the x and y axes. When the row step count is meaningfully larger than the column step count (according to our method, e.g. if more than 950 of the simulations for the row date are greater than those for the column date), the cell is colored in maroon. Otherwise, cells are colored in grey.
Rows that have a lot of maroon values are days that have higher step counts. To take a concrete example, the row representing ‘2016-08-08’ has many red values. If we look at the estimated step count for that day on the first graph above, we can see that it has the highest predicted step count in the sampled dates – over 25,000! It therefore makes sense that the estimated step count on this day is “significantly” larger than those from most of the other dates.
Columns that have many red values are days that have especially low step counts. For example, the column representing ‘2015-12-25’ is mostly red. As we saw above, this date has the lowest predicted step count in our sample – under 5,000! It is no wonder then that this date has a “significantly” lower step count than most of the other days.
How Baysian Am I?
I’m a newbie to Baysian thinking, but I get the sense that Baysian statistics comes in many different flavors. The approach above strikes me as being a little bit, but not completely, Baysian.
That’s So Baysian
This approach is Baysian in that it uses posterior distributions of daily step counts to make the comparisons between days. The approach recognizes that observed daily step counts are just observations from a larger posterior distribution of possible step counts, and we make explicit use of this posterior distribution in the data analysis. In contrast, frequentist methods basically only make use of point estimates of coefficients and the standard errors of those estimates to draw conclusions from data.
Not So Fast!
There are some things about this perspective that aren’t so Baysian. First, we use flat priors in the model; a “more Baysian” approach would assign prior distributions to the intercept and the coefficients, which would then be updated during the model computation. Second, we use the point estimates of our coefficients to compute the estimated daily step counts. A “more Baysian” approach would recognize that the coefficient estimates also have posterior distributions, and would incorporate this uncertainty in the simulations of daily step counts.
What Do I Know?
This is my current best understanding of the differences in Baysian and frequentist perspectives as they apply to what we’ve done here. I’m reading Statistical Rethinking (and watching the accompanying course videos) by Richard McElreath (love it), and these differences are what I’ve understood from the book and videos thus far.
The method used in this blog post is a nice compromise for someone comfortable with frequentist use of multi-level models (like myself). It requires a little bit more work than just interpreting standard multilevel model output, but it’s not a tremendous stretch. Indeed, the more “exotic” procedures (for a Baysian newbie) like assigning priors for the model parameters are not necessary here.
Summary and Conclusion
In this post, we used multilevel modeling to construct pairwise comparisons of estimated step counts for 50 different days. We computed the multilevel model, extracted the coefficients and sigma hat value, and computed 1000 simulations from each day’s posterior distribution. We then conducted pairwise comparisons for each day’s simulations, concluding that a day’s step count was “significantly” larger if 950 of the 1000 simulations were greater than the comparison day. We used a heat map to visualize the pairwise comparisons; this heat map highlighted days with particularly high and low step counts as being “significantly” different from most of the other days. This allowed us to conduct these comparisons without the expensive adjustments that come with lowering our significance threshold for multiple comparisons!
[Edit – if you’re interested, you can find the data and code for this blog post at this link. Comments, suggestions, or improvements are very welcome!]
Coming Up Next
In the next post, we will move away from classical statistics and talk about machine learning techniques. Specifically, we will use a type of deep learning model to automatically generate band names.
Please correct me if I’m wrong! ↩