Converting a spreadsheet of SMILES: my first OSM contribution

[This article was first published on What You're Doing Is Rather Desperate » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve long admired the work of the Open Source Malaria Project. Unfortunately time and “day job” constraints prevent me from being as involved as I’d like.

So: I was happy to make a small contribution recently in response to this request for help:

Note – this all works fine under Linux; there seem to be some issues with Open Babel library files under OSX.

First step: make that data usable by rescuing it from the spreadsheet 😉 We’ll clean up a column name too.

mmv <- readWorksheetFromFile("TP compounds with solid amounts 14_3_14.xlsx", sheet = "Sheet1")
colnames(mmv)[5] <- "EC50"


  COMPOUND_ID                                                      Smiles     MW
1   MMV668822 c1[n+](cc2n(c1OCCc1cc(c(cc1)F)F)c(nn2)c1ccc(cc1)OC(F)F)[O-] 434.35                     0.0
2   MMV668823      c1nc(c2n(c1OCCc1cc(c(cc1)F)F)c(nn2)c1ccc(cc1)OC(F)F)Cl 452.79                     0.0
3   MMV668824                        c1ncc2n(c1CCO)c(nn2)c1ccc(cc1)OC(F)F 306.27                    29.6
4   MMV668955                        C1NCc2n(C1CCO)c(nn2)c1ccc(cc1)OC(F)F 310.30                    18.5
5   MMV668956    C1(CN(C1)c1cc(c(cc1)F)F)Oc1cncc2n1c(nn2)c1ccc(cc1)OC(F)F 445.38                   124.2
6   MMV668957          c1ncc2n(c1N1CCC(C1)c1ccccc1)c(nn2)c1ccc(cc1)OC(F)F 407.42                    68.5
   EC50 New.quantity.remaining
1  4.01                      0
2  0.16                      0
3 10.00                     29
4  8.37                     18
5  0.43                    124
6  2.00                     62

What OSM would like: an output file in Chemical Markup Language, containing the Compound ID and properties (MW and EC50).

The ChemmineR package makes conversion of SMILES strings to other formats pretty straightforward; we start by converting to Structure Data Format (SDF):


mmv.sdf   <- smiles2sdf(mmv$Smiles)

That will throw a warning, since all molecules in the SDF object have the same CID; currently, no CID (empty string). We add the CID using the compound ID, then use datablock() to add properties:

cid(mmv.sdf) <- mmv$COMPOUND_ID
datablock(mmv.sdf) <- data.frame(MW = mmv$MW, EC50 = mmv$EC50)

Now we can write out to a SDF file. We could also use a loop or an apply function to write individual files per molecule.

write.SDF(mmv.sdf, "mmv-all.sdf", cid = TRUE)

It would be nice to stay in the one R script for conversion to CML too but for now, I just run Open Babel from the command line. Note that the -xp flag is required to include the properties in CML:

babel -xp mmv-all.sdf mmv-all.cml

That’s it; here’s my OSMinformatics Github repository, here’s the output.

Filed under: open science, programming, R, statistics Tagged: cheminformatics, conversion, malaria, osm, smiles

To leave a comment for the author, please follow the link and comment on their blog: What You're Doing Is Rather Desperate » R. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)