# bigkmeans also works well for ordinary matrix objects: The biganalytics package

May 4, 2011
By

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The bigmemory is an excellent package for handling big matrix in R. There are several sister packages provided by “The Bigmemory Project“: biganalytics for analysis, bigtabulate for tabulation, bigalgebra for linear algebra functionality, synchronicity for synchronization via mutexes and interprocess communication and message passing.

biganalytics provides a few functions for analysis: linear regression model, generalized linear regression model, and clustering. In this post, I would like to focus on clustering, namely, bigkmeans function. There are several algorithms regarding k-means, for example, Hartigan-Wong method, Lloyd method, Forgy method, and MacQueen method. bigkmeans implements the last. The authors say in the manual that bigkmeans also work for the ordinay matrix objects. Where does bigkmeans excel ordinary kmeans? I decided to experiment it.

The “Gisette Data Set” was used. This dataset is famous for the hand-written digit recognition problem, one of datasets of the NIPS 2003 feature selection challenge. It contains 13,500 records and 5,000 features.

The experiments were conducted on two conditions:
1. kmeans with data.frame
2. bigkmeans with marix

Here is the sorce code:
First of all, load biganalytics package, and set the parameters for conducting k-means alogrithms.

`library(biganalytics) # condition for conducting k-means algorithmsize <- c(1000, 3000, 5000, 7500, 10000, 11000)centers <- 2iter.max <- 50nstart <- 100algorithm <- "MacQueen"nsize <- length(size)`

Second, read the dataset as a data.frame object, and also convert it to a matrix object to use in bigkmeans. Please notice that the data file was generated by combining “gisette_train.data”, “gisette_test.data”, and “gisette_valid.data”.

`# read datagisette.km <- read.table("../data/gisette_all.data", sep="",                          header=FALSE)gisette.bkm <- as.matrix(gisette.km)`

Third, generate the object for maintaining the calcultion time, and measure the calculation time in those two cases, varying the size of the dataset.

`# generate objects for maintainig calculation timecalc.time <- matrix(NA, nrow=nsize, ncol=3,                    dimnames=list(size, c("kmeans with data.frame",                                          "bigkmeans with matrix")                             )             ) # measure calculation timefor (i in 1:nsize) {  size.i <- size[i]  gisette.km.i <- gisette.km[1:size.i, ]  gisette.bkm.i <- gisette.bkm[1:size.i, ]   # 1.kmeans with data.frame  cat("1.kmeans with data.frame", "\n")  calc.time[i, 1] <- system.time(                       kmeans(gisette.km.i, centers, iter.max,                              nstart, algorithm)                     )[3]  rm(gisette.km.i)  gc()   # 2.bigkmeans with matrix  cat("2.bigkmeans with matrix", "\n")  calc.time[i, 2] <- system.time(                       bigkmeans(gisette.bkm.i, centers, iter.max,                                 nstart)                     )[3]  rm(gisette.bkm.i)  gc()}`

Finally, plot the result.

`col <- c("blue", "red")matplot(size, calc.time, type="l",        col=col, lty=1, xlab="N", ylab="time[s]")legend(2000, 6000, size, calc.time, col=col, lty=1, cex=0.8)`

The result is shown below:

It is clearly shown that bigkmeans is faster than kmeans even for an ordinary matrix object: by 1.26 at N=5000, 1.39 at N=7500, 1.83 at N=10000, and almost twice at N=11000.

For datasets with fewer features, I’ll try in the near future.

The Bigmemory Project(vignette)
Big data analysis in R(sorry, in Japanese)

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.