The stupidest R code ever

August 17, 2011
By

(This article was first published on Statistics, genetics, programming, academics » R, and kindly contributed to R-bloggers)

Let’s start this blog off right, with the stupidest R mistake I’ve ever made (I think).

In the R package that I write, R/qtl, one of the main file formats is a comma-delimited file, where the blank cells in the second row are important, as they distinguish the initial phenotype columns from the genetic marker columns.

I’d gotten some reports that if there were many phenotypes, the import of such a file could take an extremely long time. I ignored the problem (as it wasn’t a problem for me), but eventually it did become a problem for me, and when I investigated, I found the following code.

# determine number of phenotypes based on initial blanks in row 2
n <- ncol(data)
temp <- rep(FALSE,n)
for(i in 1:n) {
  temp[i] <- all(data[2,1:i]=="")
  if(!temp[i]) break
}
if(!any(temp)) # no phenotypes!
  stop("You must include at least one phenotype (e.g., an index).")
n.phe <- max((1:n)[temp])

Here data is the input matrix, and I use a for loop over columns, looking for the first cell for which all preceding cells were empty. If you can understand the code above, I’m sure you’ll agree that it is really stupid. I think the code was in the package for at least five years, possibly even eight.

For a file with 200 individuals and 1500 phenotypes, it would take about 60 seconds to load; after the fix (below), it took about 2 seconds. I spent 58 seconds looking for the first non-blank cell in the second row!

In April, 2009, I fixed it (see the commit at the github repository) by replacing the above with the following.

if(data[2,1] != "")
  stop("You must include at least one phenotype (e.g., an index).")
n.phe <- min(which(data[2,] != ""))-1

If you don’t quite understand what I’m talking about, here’s a picture of one of these comma-delimited files; this one has three phenotypes.

Open source means everyone can see my stupid mistakes. Version control means everyone can see every stupid mistake I’ve ever made.


To leave a comment for the author, please follow the link and comment on his blog: Statistics, genetics, programming, academics » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , ,

Comments are closed.