Science at the speed of ligth

October 15, 2013
By

(This article was first published on Marginally significant » Rstats, and kindly contributed to R-bloggers)

May be is not going that fast, but at the speed of R at least. And R is pretty quick. This has pros and cons. I think that understanding the drawbacks is key to maximize the good things of speed, so here are a few examples.

I have a really awful excel file with a dozen sheets calculating simple diversity indexes and network parameters from my dissertation. I also paste in there the output of Atmar and Patterson Nested Calculator (an .exe program!) and of the MATLAB code that Jordi Bascompte kindly send me to run his nestedness null model. I also used Pajek to plot the networks and calculate some extra indices. It took me at least a year to perform all those calculations and believe me, it will take me another year to be able to reproduce those steps again. That was only 6 years ago. Now, I can have nicer plots and calculate way more indexes than I want in less than 5 minutes using the bipartite package, and yes, is fully reproducible. On the other hand I really understood what I was doing, while running bipartite is completely black-boxy for most people.

Last year I also needed a plant phylogeny to test phylogenetic diversity among different communities. I was quite impressed to find the very useful Phylomatic webpage. I only had to prepare the data in the right format and get the tree. Importing the tree to R proved challenging for a newcomer and I had to tweak the tree in Mesquite beforehand. So yes, time-consuming and not reproducible, but I thought it was extremely fast and cool way to get phylogenies. Just one year after that, I can do all that from my R console thanks to the ropensci people (package taxize). Again, faster, easier, but I also need less knowledge on how that phylogeny is built. I am attaching the simple workflow I used below, as it may be useful. Thanks to Scott Chamberlain for advice on how to catch the errors on the retrieved family names.

library(taxize)
#vector with my species
spec <- c("Poa annua", "Abies procera", "Helianthus annuus")

#prepare the data in the right format (including retrieving family name)
names <- lapply(spec, itis_phymat_format, format='isubmit')

#I still have to do manually the ones with errors
names[grep("^na/", names, value = FALSE, perl = TRUE)]
#names[x] <- "family/genus/genus_species/" #enter those manually

#get and plot the tree
tree <- phylomatic_tree(taxa = names, get = "POST", taxnames=FALSE, parallel=FALSE)
tree$tip.label <- capwords(tree$tip.label)
plot(tree, cex = 0.5)

Finally, someone told me he found an old professor’s lab notebook with schedules of daily tasks (sorry I am terrible with details). The time slot booked to perform an ANOVA by hand was a full day! In this case, you really have to think very carefully which analysis you want to do beforehand. Nowadays speed is not an issue to perform most analysis (but our students will still laugh at our slow R code in 5 years!). Speed can help advance science, but with a great power comes great responsibility. Hence, now is more necessary than ever to understand what we do, and why we do it. I highly recommend to read about recent discussions on the use of sensible default values or the problem of increasing researcher degrees of freedom if you are interested in that topic.


To leave a comment for the author, please follow the link and comment on his blog: Marginally significant » Rstats.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.