(This article was first published on

In a previous blog it was shown using literature data that liking of apples was related to juiciness. However, there were some questions**Wiekvoet**, and kindly contributed to R-bloggers)- Is the relation linear or slightly curved?
- The variation in liking around CJuiciness is large. Are more explanatory variables needed?
- So, what drives CJuiciness?

Error in average scores

The paper of Peneau et al. uses Tukey's post hoc test to examine the differences between the products. The test is performed within the weeks. First we use the data to retrieve at what difference the test shows significant differences.

library(xlsReadWrite)

library(plyr)

library(ggplot2)

library(locfit)

#get data as before

datain <- read.xls('condensed.xls')

#remove storage conditions

datain <- datain[-grep('bag|net',datain$Products,ignore.case=TRUE),]

#create week variable

datain$week <- sapply(strsplit(as.character(datain$Product),'_'),

function(x) x[[2]])

#function to extract pairwise numerical differences and significances

extract.diff <- function(descriptor) {

diffs1 <- ldply(c('W1','W2'),function(Week) {

value <-

as.numeric(gsub('[[:alpha:]]','',

datain[,descriptor]))[datain$week==Week]

sig.dif <-

gsub('[[:digit:].]','',datain[,descriptor])[datain$week==Week]

dif.mat <- outer(value,value,'-')

sig.mat <- outer(sig.dif,sig.dif,function(X,Y) {

sapply(1:length(X),function(i) {

g1 <- grep(paste('[',X[i],']'),Y[i])

length(g1>0)

})

})

data.frame(dif.val = as.vector(dif.mat), dif.sig =

as.vector(sig.mat),Week=Week,descriptor=descriptor)

})

diffs1

}

likedif <- extract.diff("CLiking")

likedif <- likedif[likedif$dif.val>=0,]

g <- ggplot(likedif, aes(dif.val,dif.sig))

g + geom_jitter(aes(colour=Week),position=position_jitter(height=.05))

The plot shows a difference of between 0.24 and 0.26 is enough to be significant. For Juiciness, the pattern is the same:

likedif <- extract.diff("CJuiciness")

likedif <- likedif[likedif$dif.val>=0,]

g <- ggplot(likedif, aes(dif.val,dif.sig))

g + geom_jitter(aes(colour=Week),position=position_jitter(height=.05))

In juiciness a difference of 0.24 is enough to be significant. Given all, a difference of 0.24 can be used for both variables.

Liking vs. Juiciness

The plot with errors is easy to make. For completeness a local fit is added.

dataval <- datain

vars <- names(dataval)[-1]

for (descriptor in vars) {

dataval[,descriptor] <- as.numeric(gsub('[[:alpha:]]','',dataval[,descriptor]))

}

l1 <- locfit(CLiking ~ lp(CJuiciness,nn=1),data=dataval)

topred <- data.frame(CJuiciness=seq(3.6,4.8,.1))

topred$CLiking <- predict(l1,topred)

g <- ggplot(dataval,aes(CJuiciness,CLiking))

g <- g + geom_point() + geom_errorbar(aes(ymin=CLiking-.24,ymax=CLiking+.24))

g <- g + geom_errorbarh(aes(xmin=CJuiciness-.24,xmax=CJuiciness+.24))

g <- g + geom_line(data=topred,colour='blue')

g

Both the local fit and the errors suggest that curvature is interesting to pursue. On top of that, a linear relation has as implication that any increase in juiciness is good. In general an optimum level is expected. Compare with sugar, if you like two lumps of sugar, two is dislike as not sweet enough, while four is too sweet. Hence, again all reason for curvature.

Regarding inclusion of extra variables, the data shows that the products Juiciness 4.1 are almost significant different. Given that this significance is Tukey HSD, a difference on the one point with much lower liking is probably relevant. On the other hand, the data is fairly well described by curvature, so one extra explaining variable should be enough.

Adding an extra explaining variable

The two prime candidates as second explaining variable are according to the previous calculation CSweetness and CMealiness. In general one would expect apples that are juicy to not be mealy so there is a reason to avoid CMealiness. Nevertheless both are investigated.

l1 <- locfit(CLiking ~ lp(CJuiciness,CSweetness,nn=1.3),data=dataval)

topred <- expand.grid(CJuiciness=seq(3.8,4.6,.1),CSweetness=seq(3.2,3.9,.1))

topred$CLiking <- predict(l1,topred)

v <- ggplot(topred, aes(CJuiciness, CSweetness, z = CLiking))

v <- v + stat_contour(aes(colour= ..level..) )

v + geom_point(data=dataval,stat='identity',position='identity',aes(CJuiciness,CSweetness))

l1 <- locfit(CLiking ~ lp(CJuiciness,CMealiness,nn=1.3),data=dataval)

topred <- expand.grid(CJuiciness=seq(3.8,4.6,.1),CMealiness=seq(1.4,2.1,.1))

topred$CLiking <- predict(l1,topred)

v <- ggplot(topred, aes(CJuiciness, CMealiness, z = CLiking))

v <- v + stat_contour(aes(colour= ..level..) )

v + geom_point(data=dataval,stat='identity',position='identity',aes(CJuiciness,CMealiness))

The link between CMealiness and CJuiciness is quite strong. It is also clear that CMealiness does not explain the large difference in liking at CJuiciness 4.1. Hence CSweetness is chosen. Not the best of statistical reasons, but all in all it feels like the better model

Simplified linear model

Finally, even though I like the local model, it is more convenient to use a simple linear model. After reduction of non-significant terms, only three factors are left. CJuiciness, CJuiciness^2 and CSweetness.

l1 <- lm(CLiking ~ CJuiciness*CSweetness + I(CJuiciness^2) + I(CSweetness^2),data=dataval)

summary(l1)

Call:

lm(formula = CLiking ~ CJuiciness * CSweetness + I(CJuiciness^2) +

I(CSweetness^2), data = dataval)

Residuals:

Min 1Q Median 3Q Max

-0.12451 -0.02432 0.01002 0.02591 0.06914

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -6.5326 19.8267 -0.329 0.7475

CJuiciness 6.7662 4.0414 1.674 0.1199

CSweetness -2.5410 8.7807 -0.289 0.7772

I(CJuiciness^2) -0.8638 0.3564 -2.424 0.0321 *

I(CSweetness^2) 0.1264 0.9538 0.133 0.8968

CJuiciness:CSweetness 0.3321 0.6574 0.505 0.6225

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.05827 on 12 degrees of freedom

Multiple R-squared: 0.9107, Adjusted R-squared: 0.8735

F-statistic: 24.48 on 5 and 12 DF, p-value: 6.589e-06

l1 <- lm(CLiking ~ CJuiciness+CSweetness + I(CJuiciness^2) ,data=dataval)

summary(l1)

Call:

lm(formula = CLiking ~ CJuiciness + CSweetness + I(CJuiciness^2),

data = dataval)

Residuals:

Min 1Q Median 3Q Max

-0.125337 -0.023834 0.004955 0.024922 0.087851

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -14.3609 5.1169 -2.807 0.01400 *

CJuiciness 8.5997 2.3715 3.626 0.00275 **

CSweetness -0.2683 0.1104 -2.430 0.02916 *

I(CJuiciness^2) -0.9428 0.2795 -3.373 0.00455 **

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.0547 on 14 degrees of freedom

Multiple R-squared: 0.9082, Adjusted R-squared: 0.8885

F-statistic: 46.17 on 3 and 14 DF, p-value: 1.654e-07

topred <- expand.grid(CJuiciness=seq(3.8,4.6,.05),CSweetness=seq(3.2,3.9,.05))

topred$CLiking <- predict(l1,topred)

v <- ggplot(topred, aes(CJuiciness, CSweetness, z = CLiking))

v <- v + stat_contour(aes(colour= ..level..) )

v + geom_point(data=dataval,stat='identity',position='identity',aes(CJuiciness,CSweetness))

The resulting plot shows quite some difference with the local fit, but this is mostly at the regions without data.

Hence it is concluded that liking of apples depends mainly on Juiciness, and somewhat on Sweetness. Above a certain Juiciness no gain is to be made. Lower sweetness gives better liking

Discussion

It is a bit disappointing that the error in CJuiciness and CSweetness is not incorporated in the model. Unfortunately, this is easier said than done. The keyword here is Total Least Squares also named Deming regression. Unfortunately these are only viable if two variables are regressed. Curvature is also outside of the scope.

In addition, this leads to the question, what is error?. The 'error' between the scores consists of different parts. Differences between slices of apple, differences in sensory perception and different ways to score. Regarding the model, differences in slices of apple and sensory perception are counted in liking, while scoring error is not. Ideally, these would be split. This is rather a tall order.

- what drives CJuiciness?
- what drives CSweetness?

To

**leave a comment**for the author, please follow the link and comment on his blog:**Wiekvoet**.R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...