The CDK/Metabolomics/Chemometrics Unconference results

Posted on April 7, 2008 by Egon Willighagen in Uncategorized | 0 Comments

[This article was first published on chem-bla-ics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

As announced earlier, Miguel, Velitchka, Christoph and I held a small CDK/Metabolomics/Chemometrics unconference. We started late, and did not have an evening program, resulting in not overly much results. However, we did do molecular chemometrics.

We used the R statistics software together with Rajarshi’s rcdk package (an R wrapper around the CDK library) and Ron’s (my PhD supervisor) PLS package (see this paper), to predict retention indices for a number of metabolites.

We ended up with this R script:

library("rJava")
library("rcdk")
library("pls")
mols = load.molecules("data_cdk.sdf")
selection = get.desc.names()
selection = selection[-which(selection=="org.openscience.cdk.qsar.descriptors.molecular.AminoAcidCountDescriptor")]
x = eval.desc(mols, selection, verbose=TRUE)
x2 = x[,apply(x, 2, function(a) {all(!is.na(a))})]
y = read.table("data_cdk_RI")
input = data.frame(x2, y)
pls.model = plsr(V1 ~ ., 50, data=input, validation="CV")
summary(pls.model)
plot(RMSEP(pls.model))
plot(pls.model, ncomp=20)
abline(0,1, col="red")
plot(pls.model, "loadings", comps=1:2)
savehistory("finalHistory.R")

The AminoAcidCountDescriptor threw us a NullPointerException and there were a few NAs in the resulting matrix. The CV results were not so good as Velitchka’s best models, but still a good start:

No variable selection; 200 objects, 190 variables.

Questions:

Can we do this in Bioclipse2 too?
Can we improve the default CDK descriptor parameters to maximize the column count?
Rajarshi, what would be involved to write some wrapper code for atomic descriptors for rcdk?

To leave a comment for the author, please follow the link and comment on their blog: chem-bla-ics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

The CDK/Metabolomics/Chemometrics Unconference results

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)