# R: Nutrient intake data, mixed methods analysis for males

[This article was first published on

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

To recap:**R in the Antipodes**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

- we’ve cleaned the data (see earlier posts)
- we’re up to constructing the output data that will be used to estimate the population distribution values for the particular nutrient we are examining (energy)

A boxcox analysis is performed to identify the best transformation to normality prior to undertaking the mixed methods analysis. The data contains age as single year, so I am interested in the effect of analysing age as both

- age bands used for nutrient reference values – there are standard age bands used for the reference values, so it makes sense to create age factors using these bands and then analyse age as a factor based on these bands, and
- a continuous variable, although because the method below is linear it assumes a linear effect of age on nutrient consumption and this is probably an unrealistic assumption

#The data equivalency with the SAS macro data:

#SAS macro variable SAS R

#subject seqn RespondentID

#response add_sug IntakeAmt

#repeat drddaycd IntakeDay

#covars_amt {Age groups} AgeFactor Note: no dummies as AgeFactor is an ordered factor

#replicate_var rndw1 SampleWeight Note: used as a frequency variable (“replicate” in SAS), used as weight in R

#seq seq2 IntakeDay Note: no dummy as IntakeDay is an ordered factor

# indwt SampleWeight

#ANALYSIS OF MALE DATA – STEP 1 OF 2

#MIXTRAN

#Delete any unused RespondentID levels that are related to the female subjects

Male.Data$RespondentID <- Male.Data$RespondentID[,drop=TRUE]

#row 419 reduce the data to one record per person to calculate weights

Male.Subset <- subset(Male.Data,!duplicated(RespondentID), select=c(RespondentID, SampleWeight, IntakeAmt, AgeFactor))

#row 446 add the total weights to the data frame

Male.Subset$TotWts <- sum(Male.Subset$SampleWeight)

#row 455 create the subgroup weights, these are all by age only

Male.Grpwt <- aggregate(Male.Subset$SampleWeight,by= data.frame(Male.Subset$AgeFactor),sum)

names(Male.Grpwt)[names(Male.Grpwt)==”Male.Subset.AgeFactor”] <- "AgeFactor"

names(Male.Grpwt)[names(Male.Grpwt)==”x”] <- "GrpWts"

Male.Subset <- merge(Male.Subset,Male.Grpwt, by=c("AgeFactor"))

#row 472 get the number of persons in total and by subgroup if given

#the SAS macro produces a dataset of 1 row per agegroup, with agegrp, num subjects in age group, total num subjects output

count <- function(x) {

length(na.omit(x))

}

Numsub <- aggregate(Male.Subset$RespondentID,by=data.frame(Male.Subset$AgeFactor),count)

names(Numsub)[names(Numsub)==”Male.Subset.AgeFactor”] <- "AgeFactor"

names(Numsub)[names(Numsub)==”x”] <- "NumSubjects"

Numsub$TotSubjects <- sum(Numsub$NumSubjects)

#row 492 merge numsub to _persons

Male.Subset <- merge(Male.Subset,Numsub, by=c("AgeFactor"))

Male.Subset <- Male.Subset[order(Male.Subset$RespondentID),]

#END OF GENERAL SET UP

#convert proc transreg, from row 803

#NOTE: CODE ONLY USED TO IDENTIFY THE LAMBDA VALUE BETWEEN 0.05 AND 1 ASSOCIATED WITH THE MAXIMUM LOG LIKELIHOOD

#fitting a linear model with a BoxCox transformation, on the data that includes replicates

library(MASS)

Male.Boxcox <- as.data.frame(with(Male.Data, {

boxcox(IntakeAmt ~ AgeFactor + IntakeDay,

lambda= seq(0.05, 1, 0.05), plotit=FALSE)

}))

#locate the row containing the largest log likelihood

which.max(Male.Boxcox$y)

Lambda.Value <- Male.Boxcox[Male.Boxcox$y == max(Male.Boxcox$y),1 ]

#row 818 add boxcox values to Male.Data

Male.Data$BoxCoxXY <- (Male.Data$IntakeAmt^Lambda.Value-1)/Lambda.Value

#clean up the R workspace by deleting data frames no longer required

objects()

rm(Imported.Data, Long.Data, Male.Boxcox, Male.Grpwt, Numsub)

#row 840 through nlmixed starting at row 996, do all the data preparation in one hit

#working through nlmixed lines now, prework on variable construction

#not sure why subject is fit as a nonlinear effect, fit using a linear method

#use a model with an intercept for fixed effects

library(lme4)

Male.lme1 <- lmer(BoxCoxXY ~ AgeFactor + IntakeDay + (1|RespondentID),

data = Male.Data,

weights = SampleWeight)

summary(Male.lme1)

plot(fitted(Male.lme1), residuals(Male.lme1))

Male.Data$lmefits <- fitted(Male.lme1)

#Compare with using age as a continuous variable

Male.Boxcox.Age <- as.data.frame(with(Male.Data, {

boxcox(IntakeAmt ~ Age + IntakeDay,

lambda= seq(0.05, 1, 0.05), plotit=FALSE)

}))

which.max(Male.Boxcox.Age$y)

Lambda.Value.Age <- Male.Boxcox.Age[Male.Boxcox.Age$y == max(Male.Boxcox.Age$y),1 ]

Male.Data.Age$BoxCoxXY <- (Male.Data.Age$IntakeAmt^Lambda.Value.Age-1)/Lambda.Value.Age

#Age as continuous fixed effect

Male.lme2 <- lmer(BoxCoxXY ~ Age + IntakeDay + (1|RespondentID),

data = Male.Data,

weights = SampleWeight)

summary(Male.lme2)

plot(fitted(Male.lme2), residuals(Male.lme2))

Male.Data$Agefits <- fitted(Male.lme2)

Some highlights for me:

- The same transformation was identified as the best for both analyses.
- An ANOVA comparison of the two models showed no significant improvement from using age as a continuous variable, so the age band method is retained for subsequent analyses:

Data: Male.Data

Models:

Male.lme2: BoxCoxXY ~ Age + IntakeDay + (1 | RespondentID)

Male.lme1: BoxCoxXY ~ AgeFactor + IntakeDay + (1 | RespondentID)

Df AIC BIC logLik Chisq Chi Df Pr(>Chisq)

Male.lme2 5 9894.4 9926.4 -4942.2

Male.lme1 7 9966.2 10011.1 -4976.1 0 2 1

The residuals plot for the age band analysis is reasonably nicely behaved given there are >4000 data points, although there could be a slight upwards trend:

The output from the age band lmer analysis is:

Linear mixed model fit by REML

Formula: BoxCoxXY ~ AgeFactor + IntakeDay + (1 | RespondentID)

Data: Male.Data

AIC BIC logLik deviance REMLdev

9994 10039 -4990 9952 9980

Random effects:

Groups Name Variance Std.Dev.

RespondentID (Intercept) 0.19408 0.44055

Residual 0.37491 0.61230

Number of obs: 4498, groups: RespondentID, 2249

Fixed effects:

Estimate Std. Error t value

(Intercept) 13.98016 0.03405 410.6

AgeFactor4to8 0.50572 0.04084 12.4

AgeFactor9to13 0.94329 0.04159 22.7

AgeFactor14to18 1.30654 0.04312 30.3

IntakeDayDay2Intake -0.13871 0.01809 -7.7

Correlation of Fixed Effects:

(Intr) AgFc48 AgF913 AF1418

AgeFactr4t8 -0.775

AgeFctr9t13 -0.761 0.634

AgFctr14t18 -0.734 0.612 0.601

IntkDyDy2In -0.266 0.000 0.000 0.000

To

**leave a comment**for the author, please follow the link and comment on their blog:**R in the Antipodes**.R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.