**R in the Antipodes**, and kindly contributed to R-bloggers)

To recap:

- we’ve cleaned the data (see earlier posts)
- we’re up to constructing the output data that will be used to estimate the population distribution values for the particular nutrient we are examining (energy)

The code below is my implementation of the mixtran sas macro for amount, where the nutrient is consumed daily. This means that a bunch of the macro code is irrelevant for the current analysis, as we don’t need to estimate daily consumption probabilities, or apply them in the analysis, because probability of consumption = 1.

A boxcox analysis is performed to identify the best transformation to normality prior to undertaking the mixed methods analysis. The data contains age as single year, so I am interested in the effect of analysing age as both

- age bands used for nutrient reference values – there are standard age bands used for the reference values, so it makes sense to create age factors using these bands and then analyse age as a factor based on these bands, and
- a continuous variable, although because the method below is linear it assumes a linear effect of age on nutrient consumption and this is probably an unrealistic assumption

The R code is below. Code lines mentioned relate to the line in the mixtran macro linked above. Note that this method is using a linear mixed effects model on the data, the SAS macro uses a nonlinear mixed effects model, but I am currently unsuccessful in duplicating that exact model into R. The data that I have is much less complex from a covariates perspective compared to the data that was analysed in SAS, so at this point I am not too worried about the change in this part of the method.

#The data equivalency with the SAS macro data:

#SAS macro variable SAS R

#subject seqn RespondentID

#response add_sug IntakeAmt

#repeat drddaycd IntakeDay

#covars_amt {Age groups} AgeFactor Note: no dummies as AgeFactor is an ordered factor

#replicate_var rndw1 SampleWeight Note: used as a frequency variable (“replicate” in SAS), used as weight in R

#seq seq2 IntakeDay Note: no dummy as IntakeDay is an ordered factor

# indwt SampleWeight

#ANALYSIS OF MALE DATA – STEP 1 OF 2

#MIXTRAN

#Delete any unused RespondentID levels that are related to the female subjects

Male.Data$RespondentID <- Male.Data$RespondentID[,drop=TRUE]

#row 419 reduce the data to one record per person to calculate weights

Male.Subset <- subset(Male.Data,!duplicated(RespondentID), select=c(RespondentID, SampleWeight, IntakeAmt, AgeFactor))

#row 446 add the total weights to the data frame

Male.Subset$TotWts <- sum(Male.Subset$SampleWeight)

#row 455 create the subgroup weights, these are all by age only

Male.Grpwt <- aggregate(Male.Subset$SampleWeight,by= data.frame(Male.Subset$AgeFactor),sum)

names(Male.Grpwt)[names(Male.Grpwt)==”Male.Subset.AgeFactor”] <- “AgeFactor”

names(Male.Grpwt)[names(Male.Grpwt)==”x”] <- “GrpWts”

Male.Subset <- merge(Male.Subset,Male.Grpwt, by=c(“AgeFactor”))

#row 472 get the number of persons in total and by subgroup if given

#the SAS macro produces a dataset of 1 row per agegroup, with agegrp, num subjects in age group, total num subjects output

count <- function(x) {

length(na.omit(x))

}

Numsub <- aggregate(Male.Subset$RespondentID,by=data.frame(Male.Subset$AgeFactor),count)

names(Numsub)[names(Numsub)==”Male.Subset.AgeFactor”] <- “AgeFactor”

names(Numsub)[names(Numsub)==”x”] <- “NumSubjects”

Numsub$TotSubjects <- sum(Numsub$NumSubjects)

#row 492 merge numsub to _persons

Male.Subset <- merge(Male.Subset,Numsub, by=c(“AgeFactor”))

Male.Subset <- Male.Subset[order(Male.Subset$RespondentID),]

#END OF GENERAL SET UP

#convert proc transreg, from row 803

#NOTE: CODE ONLY USED TO IDENTIFY THE LAMBDA VALUE BETWEEN 0.05 AND 1 ASSOCIATED WITH THE MAXIMUM LOG LIKELIHOOD

#fitting a linear model with a BoxCox transformation, on the data that includes replicates

library(MASS)

Male.Boxcox <- as.data.frame(with(Male.Data, {

boxcox(IntakeAmt ~ AgeFactor + IntakeDay,

lambda= seq(0.05, 1, 0.05), plotit=FALSE)

}))

#locate the row containing the largest log likelihood

which.max(Male.Boxcox$y)

Lambda.Value <- Male.Boxcox[Male.Boxcox$y == max(Male.Boxcox$y),1 ]

#row 818 add boxcox values to Male.Data

Male.Data$BoxCoxXY <- (Male.Data$IntakeAmt^Lambda.Value-1)/Lambda.Value

#clean up the R workspace by deleting data frames no longer required

objects()

rm(Imported.Data, Long.Data, Male.Boxcox, Male.Grpwt, Numsub)

#row 840 through nlmixed starting at row 996, do all the data preparation in one hit

#working through nlmixed lines now, prework on variable construction

#not sure why subject is fit as a nonlinear effect, fit using a linear method

#use a model with an intercept for fixed effects

library(lme4)

Male.lme1 <- lmer(BoxCoxXY ~ AgeFactor + IntakeDay + (1|RespondentID),

data = Male.Data,

weights = SampleWeight)

summary(Male.lme1)

plot(fitted(Male.lme1), residuals(Male.lme1))

Male.Data$lmefits <- fitted(Male.lme1)

#Compare with using age as a continuous variable

Male.Boxcox.Age <- as.data.frame(with(Male.Data, {

boxcox(IntakeAmt ~ Age + IntakeDay,

lambda= seq(0.05, 1, 0.05), plotit=FALSE)

}))

which.max(Male.Boxcox.Age$y)

Lambda.Value.Age <- Male.Boxcox.Age[Male.Boxcox.Age$y == max(Male.Boxcox.Age$y),1 ]

Male.Data.Age$BoxCoxXY <- (Male.Data.Age$IntakeAmt^Lambda.Value.Age-1)/Lambda.Value.Age

#Age as continuous fixed effect

Male.lme2 <- lmer(BoxCoxXY ~ Age + IntakeDay + (1|RespondentID),

data = Male.Data,

weights = SampleWeight)

summary(Male.lme2)

plot(fitted(Male.lme2), residuals(Male.lme2))

Male.Data$Agefits <- fitted(Male.lme2)

Some highlights for me:

- The same transformation was identified as the best for both analyses.
- An ANOVA comparison of the two models showed no significant improvement from using age as a continuous variable, so the age band method is retained for subsequent analyses:

anova(Male.lme1,Male.lme2)

Data: Male.Data

Models:

Male.lme2: BoxCoxXY ~ Age + IntakeDay + (1 | RespondentID)

Male.lme1: BoxCoxXY ~ AgeFactor + IntakeDay + (1 | RespondentID)

Df AIC BIC logLik Chisq Chi Df Pr(>Chisq)

Male.lme2 5 9894.4 9926.4 -4942.2

Male.lme1 7 9966.2 10011.1 -4976.1 0 2 1

The residuals plot for the age band analysis is reasonably nicely behaved given there are >4000 data points, although there could be a slight upwards trend:

The output from the age band lmer analysis is:

**leave a comment**for the author, please follow the link and comment on their blog:

**R in the Antipodes**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...