# Converting a spreadsheet of SMILES: my first OSM contribution

June 30, 2014
By

(This article was first published on What You're Doing Is Rather Desperate » R, and kindly contributed to R-bloggers)

I’ve long admired the work of the Open Source Malaria Project. Unfortunately time and “day job” constraints prevent me from being as involved as I’d like.

So: I was happy to make a small contribution recently in response to this request for help:

Note – this all works fine under Linux; there seem to be some issues with Open Babel library files under OSX.

First step: make that data usable by rescuing it from the spreadsheet ;) We’ll clean up a column name too.

library(XLConnect)
mmv <- readWorksheetFromFile("TP compounds with solid amounts 14_3_14.xlsx", sheet = "Sheet1")
colnames(mmv)[5] <- "EC50"

COMPOUND_ID                                                      Smiles     MW QUANTITY.REMAINING..mg.
1   MMV668822 c1[n+](cc2n(c1OCCc1cc(c(cc1)F)F)c(nn2)c1ccc(cc1)OC(F)F)[O-] 434.35                     0.0
2   MMV668823      c1nc(c2n(c1OCCc1cc(c(cc1)F)F)c(nn2)c1ccc(cc1)OC(F)F)Cl 452.79                     0.0
3   MMV668824                        c1ncc2n(c1CCO)c(nn2)c1ccc(cc1)OC(F)F 306.27                    29.6
4   MMV668955                        C1NCc2n(C1CCO)c(nn2)c1ccc(cc1)OC(F)F 310.30                    18.5
5   MMV668956    C1(CN(C1)c1cc(c(cc1)F)F)Oc1cncc2n1c(nn2)c1ccc(cc1)OC(F)F 445.38                   124.2
6   MMV668957          c1ncc2n(c1N1CCC(C1)c1ccccc1)c(nn2)c1ccc(cc1)OC(F)F 407.42                    68.5
EC50 New.quantity.remaining
1  4.01                      0
2  0.16                      0
3 10.00                     29
4  8.37                     18
5  0.43                    124
6  2.00                     62


What OSM would like: an output file in Chemical Markup Language, containing the Compound ID and properties (MW and EC50).

The ChemmineR package makes conversion of SMILES strings to other formats pretty straightforward; we start by converting to Structure Data Format (SDF):

library(ChemmineR)
library(ChemmineOB)

mmv.sdf   <- smiles2sdf(mmv$Smiles)  That will throw a warning, since all molecules in the SDF object have the same CID; currently, no CID (empty string). We add the CID using the compound ID, then use datablock() to add properties: cid(mmv.sdf) <- mmv$COMPOUND_ID
datablock(mmv.sdf) <- data.frame(MW = mmv$MW, EC50 = mmv$EC50)


Now we can write out to a SDF file. We could also use a loop or an apply function to write individual files per molecule.

write.SDF(mmv.sdf, "mmv-all.sdf", cid = TRUE)


It would be nice to stay in the one R script for conversion to CML too but for now, I just run Open Babel from the command line. Note that the -xp flag is required to include the properties in CML:

babel -xp mmv-all.sdf mmv-all.cml


That’s it; here’s my OSMinformatics Github repository, here’s the output.

Filed under: open science, programming, R, statistics Tagged: cheminformatics, conversion, malaria, osm, smiles