**Brokering Closure » R**, and kindly contributed to R-bloggers)

In data analysis it happens sometimes that it is neccesary to use *weights*. Contexts

that come to mind include:

- Analysis of data from complex surveys, e.g. stratified samples. Sample inclusion probabilities might have been unequal and thus observations from different strata should have different weights.
- Application of propensity score weighting e.g. to correct data being Missing At Random (MAR).
- Inverse-variance weighting (https://en.wikipedia.org/wiki/Inverse-variance_weighting) when different observations have been measured with different precision which is known apriori.
- We are analyzing data in an aggregated form such that the weight variable encodes how many original observations each row in the aggregated data represents.
- We are given survey data with post-stratification weights.

If you use, or have been using, SPSS you probably know about the possibility to define one of the variables as weights. This information is used when producing cross-tabulations (cells include sums of weights), regression models and so on. SPSS weights are *frequency weights* in the sense that $w_i$ is the number of observations particular case $i$ represents.

On the other hand, in R `lm`

and `glm`

functions have `weights`

argument that serves a related purpose.

```
suppressMessages(local({
library(dplyr)
library(ggplot2)
library(survey)
library(knitr)
library(tidyr)
library(broom)
}))
```

Let’s compare different ways in which a linear model can be fitted to data with weights. We start by generating some artificial data:

```
set.seed(666)
N <- 30 # number of observations
# Aggregated data
aggregated <- data.frame(x=1:5) %>%
mutate( y = round(2 * x + 2 + rnorm(length(x)) ),
freq = as.numeric(table(sample(1:5, N,
replace=TRUE, prob=c(.3, .4, .5, .4, .3))))
)
aggregated
```

```
## x y freq
## 1 1 5 4
## 2 2 8 5
## 3 3 8 8
## 4 4 12 8
## 5 5 10 5
```

```
# Disaggregated data
individuals <- aggregated[ rep(1:5, aggregated$freq) , c("x", "y") ]
```

Visually:

```
ggplot(aggregated, aes(x=x, y=y, size=freq)) + geom_point() + theme_bw()
```

Let’s fit some models:

```
models <- list(
ind_lm = lm(y ~ x, data=individuals),
raw_agg = lm( y ~ x, data=aggregated),
ind_svy_glm = svyglm(y~x, design=svydesign(id=~1, data=individuals),
family=gaussian() ),
ind_glm = glm(y ~ x, family=gaussian(), data=individuals),
wei_lm = lm(y ~ x, data=aggregated, weight=freq),
wei_glm = glm(y ~ x, data=aggregated, family=gaussian(), weight=freq),
svy_glm = svyglm(y ~ x, design=svydesign(id=~1, weights=~freq, data=aggregated),
family=gaussian())
)
```

```
## Warning in svydesign.default(id = ~1, data = individuals): No weights or
## probabilities supplied, assuming equal probability
```

In short, we have the following linear models:

`ind_lm`

is a OLS fit to individual data (the*true*model).`ind_agg`

is a OLS fit to aggregated data (definitely wrong).`ind_glm`

is a ML fit to individual data`ind_svy_glm`

is a ML fit to individual data using simple random sampling with replacement design.`wei_lm`

is OLS fit to aggregated data with frequencies as weights`wei_glm`

is a ML fit to aggregated data with frequencies as weights`svy_glm`

is a ML fit to aggregated using “survey” package and using frequencies as weights in the sampling design.

We would expect that models `ind_lm`

, `ind_glm`

, and `ind_svy_glm`

will be identical.

Summarise and gather in long format

```
results <- do.call("rbind", lapply( names(models), function(n) cbind(model=n, tidy(models[[n]])) )) %>%
gather(stat, value, -model, -term)
```

Check if point estimates of model coefficients are identical:

```
results %>% filter(stat=="estimate") %>%
select(model, term, value) %>%
spread(term, value)
```

```
## model (Intercept) x
## 1 ind_lm 4.33218 1.474048
## 2 raw_agg 4.40000 1.400000
## 3 ind_svy_glm 4.33218 1.474048
## 4 ind_glm 4.33218 1.474048
## 5 wei_lm 4.33218 1.474048
## 6 wei_glm 4.33218 1.474048
## 7 svy_glm 4.33218 1.474048
```

Apart from the “wrong” `raw_agg`

model, the coefficients are identical across models.

Let’s check the inference:

```
# Standard Errors
results %>% filter(stat=="std.error") %>%
select(model, term, value) %>%
spread(term, value)
```

```
## model (Intercept) x
## 1 ind_lm 0.652395 0.1912751
## 2 raw_agg 1.669331 0.5033223
## 3 ind_svy_glm 0.500719 0.1912161
## 4 ind_glm 0.652395 0.1912751
## 5 wei_lm 1.993100 0.5843552
## 6 wei_glm 1.993100 0.5843552
## 7 svy_glm 1.221133 0.4926638
```

```
# p-values
results %>% filter(stat=="p.value") %>%
mutate(p=format.pval(value)) %>%
select(model, term, p) %>%
spread(term, p)
```

```
## model (Intercept) x
## 1 ind_lm 3.3265e-07 2.1458e-08
## 2 raw_agg 0.077937 0.068904
## 3 ind_svy_glm 2.1244e-09 2.1330e-08
## 4 ind_glm 3.3265e-07 2.1458e-08
## 5 wei_lm 0.118057 0.085986
## 6 wei_glm 0.118057 0.085986
## 7 svy_glm 0.038154 0.058038
```

Recall, that the correct model is `ind_lm`

. Observations:

`raw_agg`

is clearly wrong, as expected.- Should the
`weight`

argument to`lm`

and`glm`

implement frequency weights, the results for`wei_lm`

and`wei_glm`

will be identical to that from`ind_lm`

. Only the point estimates are correct, all the inference stats are not correct. - The model using design with sampling weights
`svy_glm`

gives correct point estimates, but incorrect inference. - Suprisingly, the model fit with “survey” package to the individual data using simple random sampling design (
`ind_svy_glm`

) does not give identical inference stats to those from`ind_lm`

. They are close though.

Functions weights `lm`

and `glm`

implement *precision weights*: inverse-variance weights that can be used to model differential precision with which the outcome variable was estimated.

Functions in the “survey” package implement *sampling weights*: inverse of the probability of particular observation to be selected from the population to the sample.

Frequency weights are a different animal.

However, it is possible get correct inference statistics for the model fitted to aggregated data using `lm`

with frequency weights supplied as `weights`

. What needs correcting is the degrees of freedom (see also http://stackoverflow.com/questions/10268689/weighted-regression-in-r).

```
models$wei_lm_fixed <- models$wei_lm
models$wei_lm_fixed$df.residual <- with(models$wei_lm_fixed, sum(weights) - length(coefficients))
results <- do.call("rbind", lapply( names(models), function(n) cbind(model=n, tidy(models[[n]])) )) %>%
gather(stat, value, -model, -term)
```

```
## Warning in summary.lm(x): residual degrees of freedom in object suggest
## this is not an "lm" fit
```

```
# Coefficients
results %>% filter(stat=="estimate") %>%
select(model, term, value) %>%
spread(term, value)
```

```
## model (Intercept) x
## 1 ind_lm 4.33218 1.474048
## 2 raw_agg 4.40000 1.400000
## 3 ind_svy_glm 4.33218 1.474048
## 4 ind_glm 4.33218 1.474048
## 5 wei_lm 4.33218 1.474048
## 6 wei_glm 4.33218 1.474048
## 7 svy_glm 4.33218 1.474048
## 8 wei_lm_fixed 4.33218 1.474048
```

```
# Standard Errors
results %>% filter(stat=="std.error") %>%
select(model, term, value) %>%
spread(term, value)
```

```
## model (Intercept) x
## 1 ind_lm 0.652395 0.1912751
## 2 raw_agg 1.669331 0.5033223
## 3 ind_svy_glm 0.500719 0.1912161
## 4 ind_glm 0.652395 0.1912751
## 5 wei_lm 1.993100 0.5843552
## 6 wei_glm 1.993100 0.5843552
## 7 svy_glm 1.221133 0.4926638
## 8 wei_lm_fixed 0.652395 0.1912751
```

See model `wei_lm_fixed`

. Thus, correcting the degrees of freedom manually gives correct coefficient estimates as well as inference statistics.

# Performance

Aggregating data and using frequency weights can save you quite some time. To illustrate it, let’s generate large data set in a disaggregated and aggregated form.

```
N <- 10^4 # number of observations
# Aggregated data
big_aggregated <- data.frame(x=1:5) %>%
mutate( y = round(2 * x + 2 + rnorm(length(x)) ),
freq = as.numeric(table(sample(1:5, N, replace=TRUE, prob=c(.3, .4, .5, .4, .3))))
)
# Disaggregated data
big_individuals <- aggregated[ rep(1:5, big_aggregated$freq) , c("x", "y") ]
```

… and fit `lm`

models weighting the model on aggregated data. Benchmarking:

```
library(microbenchmark)
speed <- microbenchmark(
big_individual = lm(y ~ x, data=big_individuals),
big_aggregated = lm(y ~ x, data=big_aggregated, weights=freq)
)
speed %>% group_by(expr) %>% summarise(median=median(time / 1000)) %>%
mutate( ratio = median / median[1])
```

```
## Source: local data frame [2 x 3]
##
## expr median ratio
## 1 big_individual 7561.158 1.0000000
## 2 big_aggregated 1492.057 0.1973319
```

So quite an improvement.

The improvement is probably the bigger, the more we are able to aggregate the data.

**leave a comment**for the author, please follow the link and comment on their blog:

**Brokering Closure » R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...