How to Estimate Models with Measurement Error for our COVID-19 Indices
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This tutorial gives an overview of the COVID-19 policy indexes just released by the CoronaNet project of which I am a part and the Oxford Government Response Tracker. See the associated paper here. This post shows you how to download the estimated indices and also how to incorporate measurement error into analyses using the eivtools
and brms
R packages.
As shown in our working paper, these six indices (social distancing policies, business restrictions, school restrictions, health resources, health monitoring and mask policies) provide statistical estimates that take into account coding and data limitations. Each index has an associated value between 0 and 100 along with an uncertainty interval that provides the most likely values for the true level of policy activity for that index type. We also provide for each estimate the posterior standard deviation of the estimate, which is an analogue to the standard error in frequentist statistics. These standard deviations are helpful for incorporating measurement error in analyses using existing packages without much further work. It is the aim of this blog post to show how to do that.
If you want to see the full code underlying this post as an Rmarkdown file, just hop over to the site’s Github repo. The Rmarkdown file is in the content/post
folder and the Google mobility data is in the content/post/data
folder.
Load and Merge Data
First, the outcome variable we will look at is the Google cell phone mobility data reported in retail establishments and restaurants. This is a key metric in COVID-19 spread because these kinds of areas can be a hotspot for indoors transmission of the virus.
mobility <- readRDS("data/gmob.rds") %>% filter(is.na(sub_region_1), is.na(sub_region_2), is.na(metro_area)) %>% select(country,date_policy="date", retail="retail_and_recreation_percent_change_from_baseline")
I next load the CoronaNet-OxCGRT indices from this direct download link as a CSV:
indices <- read_csv("https://drive.google.com/uc?export=download&id=1dMCTVPrf-tJyhv_uxr0yAQO-Elx0QOCG")
I also want to point out that we have the full list of indicators (i.e, individual policies as variables) available via a direct download link as well:
indicators <- read_csv("https://drive.google.com/uc?id=1lorcowHNnF0Vl6pxBjMdjTC4yPhHBLJI&export=download")
I’ll display the data to get a sense of what it looks like. First, these are the raw indicators. There are a lot of them–more than 160 variables:
indicators %>% filter(date_policy>ymd("2020-04-01"), date_policy<ymd("2020-04-07"), country=="Afghanistan") %>% select(country,date_policy,social_distance,mask_everywhere,primary_school) %>% knitr::kable() %>% kable_styling()
country | date_policy | social_distance | mask_everywhere | primary_school |
---|---|---|---|---|
Afghanistan | 2020-04-02 | 0 | 0.1624358 | 4.516191 |
Afghanistan | 2020-04-03 | 0 | 0.1624358 | 4.516191 |
Afghanistan | 2020-04-04 | 0 | 0.1624358 | 4.565445 |
Afghanistan | 2020-04-05 | 0 | 0.1624358 | 4.565445 |
Afghanistan | 2020-04-06 | 0 | 0.1624358 | 4.565445 |
The data is organized in a country-day-policy record format with the different policy indicators in columns. The fractional values indicate a proportion of the population that have the following policy in place (i.e., at the provincial or municipal level). The high value for primary_school
has to do with the fact that this variable naturally has 3 levels, from totally open to totally closed, and when provincial and municipal policies are added it can exceed 3. For more information about the underlying indicators, you can see our codebook.
We can now take a glance at the estimated indices that collapse these indicators into single constructs:
indices %>% filter(date_policy==ymd("2020-04-01"), country=="Afghanistan") %>% knitr::kable() %>% kable_styling()
country | modtype | date_policy | med_est | high_est | low_est | sd_est |
---|---|---|---|---|---|---|
Afghanistan | Business Restrictions | 2020-04-01 | 62.53219 | 67.09647 | 57.94249 | 2.8344479 |
Afghanistan | Health Monitoring | 2020-04-01 | 41.40538 | 48.99068 | 34.40337 | 4.4375788 |
Afghanistan | Health Resources | 2020-04-01 | 46.97723 | 52.57886 | 41.45534 | 3.4747765 |
Afghanistan | Mask Policies | 2020-04-01 | 40.92886 | 48.72028 | 32.89501 | 4.8434060 |
Afghanistan | School Restrictions | 2020-04-01 | 77.33891 | 78.97166 | 75.70822 | 0.9911008 |
Afghanistan | Social Distancing | 2020-04-01 | 61.88964 | 68.51834 | 55.19300 | 3.9687823 |
This dataset is organized somewhat differently. We have one row per index value per index type. The modtype
column records which index is in each row. The columns include estimates of the given index for that day. med_est
is the median posterior estimate; i.e., the most likely estimate. The upper and lower bounds of the 5% - 95% posterior distribution (the uncertainty interval) are the low_est
- high_est
columns. Finally, the posterior standard deviation (standard error) is the sd_est
column.
If you just want “the” value of the index, then med_est
is the best bet. However, it’s best not to ignore the other columns. A simple way to take into account measurement error is to fit three separate models, one with med_est
, another with low_est
, and another with high_est
. If there are important differences across these models than you would know that measurement error in the indices could be affecting your results.
However, there are better ways to incorporate your uncertainty into models, which I will show next. We will use the retail and recreation outcome from the Google mobility as the response we want to understand, and each of our indices as predictors. We will use the sd_est
column to incorporate Normally-distributed measurement error in our models; i.e., we will assume that measurement error is centered around med_est
with a standard deviation of sd_est
. It’s not an exact incorporation of uncertainty as the posterior isn’t a perfect bell curve shape, but it’s pretty good as the posterior is usually at least unimodal if not symmetric.
First I use the eivtools
package to do a what is called an errors-in-variables model. These have been around for some time in economics and other fields, and are relatively easy to estimate. To do so, first I convert our index into two datasets with the posterior median estimates and standard deviations and one column for each index. I can then join these together to make one dataset with both estimates per index:
index_med <- select(indices,country,date_policy,med_est,modtype) %>% spread(key="modtype",value="med_est") index_sds <- select(indices,country,date_policy,sd_est,modtype) %>% spread(key="modtype",value="sd_est") # need to adjust names so that we can bind by column names(index_med)[3:8] <- paste0(names(index_med)[3:8],"_med") %>% str_replace(" ","_") names(index_sds)[3:8] <- paste0(names(index_sds)[3:8],"_sd") %>% str_replace(" ","_") # merge back together index_combined <- left_join(index_med,index_sds,by=c("country","date_policy"))
Finally we will merge our predictors into the Google mobility data:
# we only have index values through Jan 16th, 2021 mobility <- left_join(mobility,index_combined,by=c("country","date_policy")) %>% filter(!is.na(Social_Distancing_sd))
To make the modeling a bit easier, we will only use a subset of the countries in the full data.
mobility <- filter(mobility, country %in% c("United States", "Brazil","China", "United Kingdom", "France", "South Africa","Nigeria", "Indonesia", "Australia","Thailand"))
Conventional Errors-in-Variables
Now we can use the eivreg
as a function for estimating a linear regression model for our outcome, retail
. We need to simplify our uncertainty measures because conventional measurement/errors-in-variables models can only handle one standard deviation/standard error per variable. As such, we will pass in the average posterior standard deviation for each index as our estimate of measurement error Sigma
:
# first create Sigma, the covariance matrix # only has a diagonal value as we don't have covariance estimates across indices # we have to square them first as Sigma is a covariance matrix and we have standard deviations (variance = SD^2) Sigma <- diag(c(mean(mobility$`Business_Restrictions_sd`^2,na.rm=T), mean(mobility$`Health_Monitoring_sd`^2,na.rm=T), mean(mobility$`Health_Resources_sd`^2,na.rm=T), mean(mobility$`Mask_Policies_sd`^2,na.rm=T), mean(mobility$`School_Restrictions_sd`^2,na.rm=T), mean(mobility$`Social_Distancing_sd`^2,na.rm=T))) # add dimnames to identify each variables dimnames(Sigma) <- list(c("Business_Restrictions_med", "Health_Monitoring_med", "Health_Resources_med", "Mask_Policies_med", "School_Restrictions_med", "Social_Distancing_med"), c("Business_Restrictions_med", "Health_Monitoring_med", "Health_Resources_med", "Mask_Policies_med", "School_Restrictions_med", "Social_Distancing_med")) retail_eiv <- eivreg(retail~Business_Restrictions_med + Health_Monitoring_med + Health_Resources_med + Mask_Policies_med + School_Restrictions_med + Social_Distancing_med, data=mobility,Sigma_error=Sigma)
We can then see what our estimated associations are given that we incorporated average standard errors for each index:
summary(retail_eiv) ## ## Call: ## eivreg(formula = retail ~ Business_Restrictions_med + Health_Monitoring_med + ## Health_Resources_med + Mask_Policies_med + School_Restrictions_med + ## Social_Distancing_med, data = mobility, Sigma_error = Sigma) ## ## Error Covariance Matrix ## Business_Restrictions_med Health_Monitoring_med ## Business_Restrictions_med 3.658 0.000 ## Health_Monitoring_med 0.000 9.721 ## Health_Resources_med 0.000 0.000 ## Mask_Policies_med 0.000 0.000 ## School_Restrictions_med 0.000 0.000 ## Social_Distancing_med 0.000 0.000 ## Health_Resources_med Mask_Policies_med ## Business_Restrictions_med 0.000 0.00 ## Health_Monitoring_med 0.000 0.00 ## Health_Resources_med 6.375 0.00 ## Mask_Policies_med 0.000 15.79 ## School_Restrictions_med 0.000 0.00 ## Social_Distancing_med 0.000 0.00 ## School_Restrictions_med Social_Distancing_med ## Business_Restrictions_med 0.000 0.000 ## Health_Monitoring_med 0.000 0.000 ## Health_Resources_med 0.000 0.000 ## Mask_Policies_med 0.000 0.000 ## School_Restrictions_med 1.755 0.000 ## Social_Distancing_med 0.000 8.896 ## ## Residuals: ## Min 1Q Median 3Q Max ## -53.157 -11.276 2.005 12.105 57.204 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 12.45477 1.74730 7.128 1.30e-12 *** ## Business_Restrictions_med 0.27001 0.04336 6.228 5.48e-10 *** ## Health_Monitoring_med 0.02532 0.02871 0.882 0.3779 ## Health_Resources_med 0.32352 0.06208 5.212 2.01e-07 *** ## Mask_Policies_med -0.04313 0.02231 -1.933 0.0533 . ## School_Restrictions_med 0.17267 0.01967 8.779 < 2e-16 *** ## Social_Distancing_med -1.42736 0.04879 -29.254 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## Number of observations used: 2680 ## Latent residual standard deviation: 16.14 ## Latent R-squared: 0.4038, (df-adjusted: 0.4022) ## ## EIV-Adjusted vs Unadjusted Coefficients: ## Adjusted Unadjusted ## (Intercept) 12.45477 10.67932 ## Business_Restrictions_med 0.27001 0.26442 ## Health_Monitoring_med 0.02532 0.05415 ## Health_Resources_med 0.32352 0.21373 ## Mask_Policies_med -0.04313 -0.05209 ## School_Restrictions_med 0.17267 0.14216 ## Social_Distancing_med -1.42736 -1.25473
We see that the social distancing index has the strongest association with reduced retail mobility, followed by the mask policy index. The other indices are associated with increased retail mobility. It is important to note that even with the inclusion of measurement error, the effects are still very precisely estimated due to the amount of data we have.
Bayesian Measurement Models
Next we will do the same exercise, except we will use Bayesian estimation with the R package brms
. The big advantage here is that we do not need to simplify to average posterior standard deviations for our estimate of uncertainty; rather, we can include the standard deviation/standard error for each variable separately. It’s also frankly easier to use this package as we don’t need to take the extra step of making a Sigma matrix.
Instead, we can use the handy me()
function as part of the brms
formula syntax to identify each variable with measurement error:
# read from disk as this can take a while to run if(run_model) { # turn off correlations among latent variables to match what we gave # eivreg brms_formula <- bf(retail ~ me(Business_Restrictions_med,Business_Restrictions_sd) + me(Health_Monitoring_med,Health_Monitoring_sd) + me(Health_Resources_med,Health_Resources_sd) + me(Mask_Policies_med,Mask_Policies_sd) + me(School_Restrictions_med,School_Restrictions_sd) + me(Social_Distancing_med,Social_Distancing_sd), center=TRUE) + set_mecor(FALSE) brms_est <- brm(brms_formula, data=mobility, silent=0, chains=1,save_pars = save_pars(), iter=500, warmup=250, backend='rstan') saveRDS(brms_est,"data/brms_est.rds") } else { brms_est <- readRDS("data/brms_est.rds") }
We can then look at the estimated associations:
summary(brms_est) ## Family: gaussian ## Links: mu = identity; sigma = identity ## Formula: retail ~ me(Business_Restrictions_med, Business_Restrictions_sd) + me(Health_Monitoring_med, Health_Monitoring_sd) + me(Health_Resources_med, Health_Resources_sd) + me(Mask_Policies_med, Mask_Policies_sd) + me(School_Restrictions_med, School_Restrictions_sd) + me(Social_Distancing_med, Social_Distancing_sd) ## Data: mobility (Number of observations: 2680) ## Samples: 1 chains, each with iter = 500; warmup = 250; thin = 1; ## total post-warmup samples = 250 ## ## Population-Level Effects: ## Estimate Est.Error l-95% CI ## Intercept 13.89 1.95 10.20 ## meBusiness_Restrictions_medBusiness_Restrictions_sd 0.28 0.04 0.21 ## meHealth_Monitoring_medHealth_Monitoring_sd 0.06 0.02 0.02 ## meHealth_Resources_medHealth_Resources_sd 0.20 0.05 0.11 ## meMask_Policies_medMask_Policies_sd -0.06 0.02 -0.09 ## meSchool_Restrictions_medSchool_Restrictions_sd 0.14 0.02 0.10 ## meSocial_Distancing_medSocial_Distancing_sd -1.31 0.04 -1.38 ## u-95% CI Rhat Bulk_ESS ## Intercept 17.55 1.00 218 ## meBusiness_Restrictions_medBusiness_Restrictions_sd 0.35 1.00 136 ## meHealth_Monitoring_medHealth_Monitoring_sd 0.11 1.00 119 ## meHealth_Resources_medHealth_Resources_sd 0.28 1.00 119 ## meMask_Policies_medMask_Policies_sd -0.03 1.01 249 ## meSchool_Restrictions_medSchool_Restrictions_sd 0.18 1.06 288 ## meSocial_Distancing_medSocial_Distancing_sd -1.23 1.00 213 ## Tail_ESS ## Intercept 187 ## meBusiness_Restrictions_medBusiness_Restrictions_sd 128 ## meHealth_Monitoring_medHealth_Monitoring_sd 193 ## meHealth_Resources_medHealth_Resources_sd 196 ## meMask_Policies_medMask_Policies_sd 206 ## meSchool_Restrictions_medSchool_Restrictions_sd 147 ## meSocial_Distancing_medSocial_Distancing_sd 162 ## ## Family Specific Parameters: ## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS ## sigma 16.22 0.24 15.79 16.68 1.00 573 197 ## ## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS ## and Tail_ESS are effective sample size measures, and Rhat is the potential ## scale reduction factor on split chains (at convergence, Rhat = 1).
The associations are quite similar in magnitude and sign to those we estimated with eivreg
. We estimated that an increase of one unit in the social distancing index is associated with a -1.427 decrease in retail mobility, while with brms
we estimated a -1.31 decrease in retail mobility.
As we can see, for this simple model the two are equivalent, and eivreg
is much faster–it took almost no time to estimate the model, while brms
took 20 minutes. However, brms
is incorporating much more measurement uncertainty, and brms
also allows for a much wider of array of modeling options, such as GLMs (probit/logit/ordered logit), multilevel/hierarchical models, and multiple imputation. The eivreg
is limited to continuous (Gaussian) outcomes and does not allow for interactions with variables with measurement uncertainty. In any case, both of these packages will incorporate measurement uncertainty in their estimates.
Comparison with No Measurement Error
Finally, we can examine what measurement error does to our coefficients’ uncertainty by comparing the results to the standard OLS lm
regression that ignores uncertainty:
lm_est <- lm(retail~Business_Restrictions_med + Health_Monitoring_med + Health_Resources_med + Mask_Policies_med + School_Restrictions_med + Social_Distancing_med, data=mobility) summary(lm_est) ## ## Call: ## lm(formula = retail ~ Business_Restrictions_med + Health_Monitoring_med + ## Health_Resources_med + Mask_Policies_med + School_Restrictions_med + ## Social_Distancing_med, data = mobility) ## ## Residuals: ## Min 1Q Median 3Q Max ## -53.785 -10.103 2.617 11.961 55.443 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 10.67932 1.97234 5.415 6.69e-08 *** ## Business_Restrictions_med 0.26442 0.03575 7.397 1.85e-13 *** ## Health_Monitoring_med 0.05415 0.02351 2.304 0.02132 * ## Health_Resources_med 0.21373 0.04215 5.071 4.23e-07 *** ## Mask_Policies_med -0.05209 0.01797 -2.899 0.00377 ** ## School_Restrictions_med 0.14216 0.01881 7.557 5.64e-14 *** ## Social_Distancing_med -1.25473 0.03588 -34.971 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 16.66 on 2673 degrees of freedom ## Multiple R-squared: 0.3654, Adjusted R-squared: 0.364 ## F-statistic: 256.6 on 6 and 2673 DF, p-value: < 2.2e-16
We can see that the coefficients are similar though not identical to brms
and eivreg
, and the standard errors are noticeably smaller. The eivreg
estimated that there was 19.44% more uncertainty than the naive lm
estimator, and brms
8.33% more uncertainty. As such, this kind of uncertainty is important to take into account, and it can get much more critical when there is less data available or when are examining non-linear effects such as interactions.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.