I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform. In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible, because my data sets are large, many, and changing fast. The final result surprised me a little.
I use the KDD Cup 2009 data sets as my reference for this experiment. (You will need to register to download the data.) It is a realistic example of the type of customer data that I usually work with. It has 50,000 observations of 15,000 variables. To load it into R you'll need a reasonably beefy machine. My workstation has 16GB of memory; if yours have less then use a sample of the data.
We load the data into R and propose a few ways in which we may identify the columns we need:
#!/usr/bin/Rscript ## zerovar.R  find the fastest way of eliminating observations with zero variance ## © 2010 Allan Engelhardt, http://www.cybaea.net ## Read the data file. ## We have already converted it to R format and saved it, so we can do load("train.RData") ## instead of something like # train < read.delim(file="../orange_large_train.data.bz2") ## Some suggestions for zero variance functions: zv.1 < function(x) { ## The literal approach y < var(x, na.rm = TRUE) return(is.na(y)  y == 0) } zv.2 < function(x) { ## As before, but avoiding direct comparison with zero y < var(x, na.rm = TRUE) return(is.na(y)  y < .Machine$double.eps ^ 0.5) } zv.3 < function(x) { ## Maybe it is faster to check for equality than to compute? y < x[!is.na(x)] return(all(y == y[1])) } zv.4 < function(x) { ## Taking out the special case may speed things up? ## (At least for this data set where this case is common.) z < is.na(x) if ( all(z) ) return(TRUE); y < x[!z] return(all(y == y[1])) }
Now we just have to load the very useful rbenchmark package and let the machine figure it out:
library("rbenchmark") cat("Running benchmarks:\n") benchmark( zv1 = { sapply(train, zv.1) }, zv2 = { sapply(train, zv.2) }, zv3 = { sapply(train, zv.3) }, zv4 = { sapply(train, zv.4) }, replications = 5, columns = c("test", "elapsed", "relative", "sys.self"), order = "elapsed" )
The answer (on my machine) is that it is faster to calculate than to check for equality:
Running benchmarks: test elapsed relative sys.self 1 zv1 78.619 1.000000 6.395 2 zv2 79.276 1.008357 6.586 3 zv3 113.024 1.437617 1.735 4 zv4 118.579 1.508274 1.716
The two functions based on the core variance function are easily the fastest (despite having to do arithmetic) while taking out the special case in the equality functions is a Bad Idea.
Can you think of an even faster way to do it?
Jump to comments.
You may also like these posts:

Area Plots with Intensity Coloring
I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not support natively so we have to roll our own code for that. Unfortunately, lines(..., type=l) does not recycle the colour col= argument, so we end up with rather more loops than I thought would be necessary. We also get a nice opportunity to use the underappreciated read.fwf function.

Employee productivity as function of number of workers revisited
We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis for the FTSE100 constituent companies and find that the relation still holds four years later and across a continent.

A warning on the R save format
The save() function in the R platform for statistical computing is very convenient and I suspect many of us use it a lot. But I was recently bitten by a “feature” of the format which meant I could not recover my data. I recommend that you save data in a data format (e.g. CSV or CDF), not using the save() function which is really for objects (data and code). What is your approach?

R code for Chapter 2 of NonLife Insurance Pricing with GLM
We continue working our way through the examples, case studies, and exercises of what is affectionately known here as “the two bears book” (Swedish björn = bear) and more formally as NonLife Insurance Pricing with Generalized Linear Models by Esbjörn Ohl…

R code for Chapter 1 of NonLife Insurance Pricing with GLM
Insurance pricing is backwards and primitive, harking back to an era before computers. One standard (and good) textbook on the topic is NonLife Insurance Pricing with Generalized Linear Models by Esbjorn Ohlsson and Born Johansson. We have been doing som…
Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...