# NIT: Fatty acids study in R – Part 006

March 12, 2012
By

(This article was first published on NIR-Quimiometría, and kindly contributed to R-bloggers)

In one of the columns, for constituent C16_0, one sample (57) has a value of “zero” (we could see this in the histogram).The reason for that is that the laboratory did not supply this value. The PLS regression will consider the lab value as cero, so we will get a plot like this:
I observed also that the sample 219 has a high residual for the regressions of all the constituents, so I decided to remove these two samples from the sample set in order to continue, and to develop the models.
I create two sample sets, in order to remove these two samples (219 and 57):
> fattyac1<-fattyac_msc[1:56,]
> fattyac2<-fattyac_msc[58:218,]
and I combined this three sets again:
> fattyac_msc1<-rbind(fattyac1,fattyac2)
Well, I can develop my regression now:

Now we have to take the decision of how many terms to choose. Let´s see the validation plot with 7 and 12 components (terms).
plot(C16_0,ncomp=7,which=”validation”)
abline(0,1,col=”red”)
plot(C16_0,ncomp=12,which=”validation”)
abline(0,1,col=”red”)

It is clear that the decision to choose one model or the other will have a great influence in the predictions. We need a validation set to make a better decision. But I think that it will work better with 12 terms.
It will be important, if possible to find samples with C16:0 values bellow 18 to add to our database in order to develop a better model.
Another decision could be to keep out this extreme sample until we find more, but we can decide to leave it, in order to extrapolate better in this zone.
It is important not to have unique samples in the model. In this case we have one. We have to consider this.
If you want to follow this tutorial, please send me an e_mail. I´ll send you the “txt” file attached.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Tags: