Site icon R-bloggers

Linear models with weighted observations

[This article was first published on Brokering Closure » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In data analysis it happens sometimes that it is neccesary to use weights. Contexts
that come to mind include:

If you use, or have been using, SPSS you probably know about the possibility to define one of the variables as weights. This information is used when producing cross-tabulations (cells include sums of weights), regression models and so on. SPSS weights are frequency weights in the sense that $w_i$ is the number of observations particular case $i$ represents.

On the other hand, in R lm and glm functions have weights argument that serves a related purpose.

suppressMessages(local({
  library(dplyr)
  library(ggplot2)
  library(survey)
  library(knitr)
  library(tidyr)
  library(broom)
}))

Let’s compare different ways in which a linear model can be fitted to data with weights. We start by generating some artificial data:

set.seed(666)

N <- 30 # number of observations

# Aggregated data
aggregated <- data.frame(x=1:5) %>%
  mutate( y = round(2 * x + 2 + rnorm(length(x)) ),
          freq = as.numeric(table(sample(1:5, N, 
                replace=TRUE, prob=c(.3, .4, .5, .4, .3))))
          )
aggregated

##   x  y freq
## 1 1  5    4
## 2 2  8    5
## 3 3  8    8
## 4 4 12    8
## 5 5 10    5

# Disaggregated data
individuals <- aggregated[ rep(1:5, aggregated$freq) , c("x", "y") ]

Visually:

ggplot(aggregated, aes(x=x, y=y, size=freq)) + geom_point() + theme_bw()

Let’s fit some models:

models <- list( 
               ind_lm = lm(y ~ x, data=individuals),
               raw_agg = lm( y ~ x, data=aggregated),
               ind_svy_glm = svyglm(y~x, design=svydesign(id=~1, data=individuals),
                                 family=gaussian() ),
               ind_glm = glm(y ~ x, family=gaussian(), data=individuals),
               wei_lm = lm(y ~ x, data=aggregated, weight=freq),
               wei_glm = glm(y ~ x, data=aggregated, family=gaussian(), weight=freq),
               svy_glm = svyglm(y ~ x, design=svydesign(id=~1, weights=~freq, data=aggregated),
                                family=gaussian())
               )

## Warning in svydesign.default(id = ~1, data = individuals): No weights or
## probabilities supplied, assuming equal probability

In short, we have the following linear models:

We would expect that models ind_lm, ind_glm, and ind_svy_glm will be identical.

Summarise and gather in long format

results <- do.call("rbind", lapply( names(models), function(n) cbind(model=n, tidy(models[[n]])) )) %>%
                                      gather(stat, value, -model, -term)

Check if point estimates of model coefficients are identical:

results %>% filter(stat=="estimate") %>% 
  select(model, term, value) %>%
  spread(term, value)

##         model (Intercept)        x
## 1      ind_lm     4.33218 1.474048
## 2     raw_agg     4.40000 1.400000
## 3 ind_svy_glm     4.33218 1.474048
## 4     ind_glm     4.33218 1.474048
## 5      wei_lm     4.33218 1.474048
## 6     wei_glm     4.33218 1.474048
## 7     svy_glm     4.33218 1.474048

Apart from the “wrong” raw_agg model, the coefficients are identical across models.

Let’s check the inference:

# Standard Errors
results %>% filter(stat=="std.error") %>%
  select(model, term, value) %>%
  spread(term, value)

##         model (Intercept)         x
## 1      ind_lm    0.652395 0.1912751
## 2     raw_agg    1.669331 0.5033223
## 3 ind_svy_glm    0.500719 0.1912161
## 4     ind_glm    0.652395 0.1912751
## 5      wei_lm    1.993100 0.5843552
## 6     wei_glm    1.993100 0.5843552
## 7     svy_glm    1.221133 0.4926638

# p-values
results %>% filter(stat=="p.value") %>%
  mutate(p=format.pval(value)) %>%
  select(model, term, p) %>%
  spread(term, p)

##         model (Intercept)          x
## 1      ind_lm  3.3265e-07 2.1458e-08
## 2     raw_agg    0.077937   0.068904
## 3 ind_svy_glm  2.1244e-09 2.1330e-08
## 4     ind_glm  3.3265e-07 2.1458e-08
## 5      wei_lm    0.118057   0.085986
## 6     wei_glm    0.118057   0.085986
## 7     svy_glm    0.038154   0.058038

Recall, that the correct model is ind_lm. Observations:

Functions weights lm and glm implement precision weights: inverse-variance weights that can be used to model differential precision with which the outcome variable was estimated.

Functions in the “survey” package implement sampling weights: inverse of the probability of particular observation to be selected from the population to the sample.

Frequency weights are a different animal.

However, it is possible get correct inference statistics for the model fitted to aggregated data using lm with frequency weights supplied as weights. What needs correcting is the degrees of freedom (see also http://stackoverflow.com/questions/10268689/weighted-regression-in-r).

models$wei_lm_fixed <- models$wei_lm
models$wei_lm_fixed$df.residual <- with(models$wei_lm_fixed, sum(weights) - length(coefficients))

results <- do.call("rbind", lapply( names(models), function(n) cbind(model=n, tidy(models[[n]])) )) %>%
                                      gather(stat, value, -model, -term)

## Warning in summary.lm(x): residual degrees of freedom in object suggest
## this is not an "lm" fit

# Coefficients
results %>% filter(stat=="estimate") %>% 
  select(model, term, value) %>%
  spread(term, value)

##          model (Intercept)        x
## 1       ind_lm     4.33218 1.474048
## 2      raw_agg     4.40000 1.400000
## 3  ind_svy_glm     4.33218 1.474048
## 4      ind_glm     4.33218 1.474048
## 5       wei_lm     4.33218 1.474048
## 6      wei_glm     4.33218 1.474048
## 7      svy_glm     4.33218 1.474048
## 8 wei_lm_fixed     4.33218 1.474048

# Standard Errors
results %>% filter(stat=="std.error") %>%
  select(model, term, value) %>%
  spread(term, value)

##          model (Intercept)         x
## 1       ind_lm    0.652395 0.1912751
## 2      raw_agg    1.669331 0.5033223
## 3  ind_svy_glm    0.500719 0.1912161
## 4      ind_glm    0.652395 0.1912751
## 5       wei_lm    1.993100 0.5843552
## 6      wei_glm    1.993100 0.5843552
## 7      svy_glm    1.221133 0.4926638
## 8 wei_lm_fixed    0.652395 0.1912751

See model wei_lm_fixed. Thus, correcting the degrees of freedom manually gives correct coefficient estimates as well as inference statistics.

Performance

Aggregating data and using frequency weights can save you quite some time. To illustrate it, let’s generate large data set in a disaggregated and aggregated form.

N <- 10^4 # number of observations

# Aggregated data
big_aggregated <- data.frame(x=1:5) %>%
  mutate( y = round(2 * x + 2 + rnorm(length(x)) ),
          freq = as.numeric(table(sample(1:5, N, replace=TRUE, prob=c(.3, .4, .5, .4, .3))))
          )

# Disaggregated data
big_individuals <- aggregated[ rep(1:5, big_aggregated$freq) , c("x", "y") ]

… and fit lm models weighting the model on aggregated data. Benchmarking:

library(microbenchmark)

speed <- microbenchmark(
  big_individual = lm(y ~ x, data=big_individuals),
  big_aggregated = lm(y ~ x, data=big_aggregated, weights=freq)
)

speed %>% group_by(expr) %>% summarise(median=median(time / 1000)) %>%
  mutate( ratio = median / median[1])

## Source: local data frame [2 x 3]
## 
##             expr   median     ratio
## 1 big_individual 7561.158 1.0000000
## 2 big_aggregated 1492.057 0.1973319

So quite an improvement.

The improvement is probably the bigger, the more we are able to aggregate the data.

To leave a comment for the author, please follow the link and comment on their blog: Brokering Closure » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.