Example 2014.3: Allow different variances by group

February 27, 2014
By

(This article was first published on SAS and R, and kindly contributed to R-bloggers)

One common violation of the assumptions needed for linear regression is heterscedasticity by group membership. Both SAS and R can easily accommodate this setting.

Our data today comes from a real example of vitamin D supplementation of milk. Four suppliers claimed that their milk provided 100 IU of vitamin D. The null hypothesis is that they all deliver this accurately, but there was some question about whether the variance was the same between the suppliers. Unfortunately, there are only four observations for each supplier.

SAS
In SAS, we'll read the data in with a simple input statement.
data milk;
input value milk_cat;
cards;
77 1
85 1
91 1
88 1
93 2
101 2
126 2
103 2
103 3
88 3
109 3
85 3
95 4
83 4
91 4
86 4
;;
run;
To fit the model, we'll use the group option to the repeated statement in proc mixed. This is specifically designed to allow differing values for groups sharing the same covariance structure. In this case, it's a simple structure: no correlation, constant value on the diagonal. The key pieces of output are selected out using ODS. To assess variance terms, it's best to use maximum likelihood, rather than REML fitting.
ods select covparms lrt;
proc mixed data = milk method = ml;
class milk_cat;
model value = milk_cat/solution;
repeated/group=milk_cat type = simple;
run;

  Covariance Parameter Estimates

Cov Parm Group Estimate

Residual milk_cat 1 27.1875
Residual milk_cat 2 150.69
Residual milk_cat 3 100.69
Residual milk_cat 4 21.1875


Null Model Likelihood Ratio Test

DF Chi-Square Pr > ChiSq

3 5.13 0.1623

There's not too much evidence to support different variances, but the power is likely quite small, so we'll retain this model when assessing the null hypothesis of equal means. A REML fit ought to be preferable here, but to agree with the R results, we fit again with ML. Note that proc mixed is not smart enough to accurately determine the degrees of freedom remaining (16 observations - 4 variance parameters - 4 mean parameters) so we need to manually specify the denominator degrees of freedom using the ddf option to the model statement.
ods select solutionf tests3;
proc mixed data = milk method = ml;
class milk_cat;
model value = milk_cat/solution ddf=8;
repeated/group=milk_cat type = simple;
run;
                   Solution for Fixed Effects

Standard
Effect milk_cat Estimate Error DF t Value Pr > |t|

Intercept 88.7500 2.3015 12 38.56 <.0001
milk_cat 1 -3.5000 3.4776 8 -1.01 0.3437
milk_cat 2 17.0000 6.5551 8 2.59 0.0319
milk_cat 3 7.5000 5.5199 8 1.36 0.2113
milk_cat 4 0 . . . .


Type 3 Tests of Fixed Effects

Num Den
Effect DF DF F Value Pr > F

milk_cat 3 8 3.86 0.0563
So there's some reason to suspect that the suppliers may be different.

R
In R, we'll re-type this simple data set and assign the group labels manually.
value = c(77,85,91,88,93,101,126,103,103,88,109,85,95,83,91,86)
mc = as.factor(rep(1:4, each=4))
milk= data.frame(value, mc)
To fit the model with unequal variances, we'll use the gls() function in the nlme library. (Credit department: Ben Bolker discusses this and more complex models in an Rpubs document.) (Note that we use maximum likelihood, as we did in SAS.)
library(nlme)
mod = gls(value~mc, data=milk, weights = varIdent(form = ~1|mc),
method="ML")
To assess the hypothesis of equal variances, we'll compare to the homoscedasticity model using the anova() function.
mod2 = gls(value~mc, data=milk, method="ML")
anova(mod,mod2)

     Model df      AIC      BIC    logLik   Test  L.Ratio p-value
mod 1 8 125.3396 131.5203 -54.66981
mod2 2 5 124.4725 128.3355 -57.23625 1 vs 2 5.132877 0.1623
This result is identical to SAS, although this is a likelihood ratio test and SAS shows a Wald test.

To assess whether the suppliers are different, we need to compare to the model with just an intercept. When using REML for both models, there was a warning message and a different answer than SAS (using REML in SAS as well). So we'll stick with maximum likelihood here.
mod3  = gls(value~1,data=milk, weights = varIdent(form = ~1|mc), method="ML")
anova(mod3,mod)
With ML in both programs we get the same results.
     Model df      AIC      BIC    logLik   Test  L.Ratio p-value
mod3 1 5 126.8858 130.7488 -58.44292
mod 2 8 125.3396 131.5203 -54.66981 1 vs 2 7.546204 0.0564


An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

To leave a comment for the author, please follow the link and comment on his blog: SAS and R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.