**sfchaos' blog**, and kindly contributed to R-bloggers)

The bigmemory is an excellent package for handling big matrix in R. There are several sister packages provided by “The Bigmemory Project“: biganalytics for analysis, bigtabulate for tabulation, bigalgebra for linear algebra functionality, synchronicity for synchronization via mutexes and interprocess communication and message passing.

biganalytics provides a few functions for analysis: linear regression model, generalized linear regression model, and clustering. In this post, I would like to focus on clustering, namely, bigkmeans function. There are several algorithms regarding k-means, for example, Hartigan-Wong method, Lloyd method, Forgy method, and MacQueen method. bigkmeans implements the last. The authors say in the manual that bigkmeans also work for the ordinay matrix objects. Where does bigkmeans excel ordinary kmeans? I decided to experiment it.

The “Gisette Data Set” was used. This dataset is famous for the hand-written digit recognition problem, one of datasets of the NIPS 2003 feature selection challenge. It contains 13,500 records and 5,000 features.

The experiments were conducted on two conditions:

1. kmeans with data.frame

2. bigkmeans with marix

Here is the sorce code:

First of all, load biganalytics package, and set the parameters for conducting k-means alogrithms.

Created by Pretty R at inside-R.org

Second, read the dataset as a data.frame object, and also convert it to a matrix object to use in bigkmeans. Please notice that the data file was generated by combining “gisette_train.data”, “gisette_test.data”, and “gisette_valid.data”.

# read data

gisette.km <- read.table("../data/gisette_all.data", sep="",

header=FALSE)

gisette.bkm <- as.matrix(gisette.km)

Created by Pretty R at inside-R.org

Third, generate the object for maintaining the calcultion time, and measure the calculation time in those two cases, varying the size of the dataset.

# generate objects for maintainig calculation time

calc.time <- matrix(NA, nrow=nsize, ncol=3,

dimnames=list(size, c("kmeans with data.frame",

"bigkmeans with matrix")

)

)

# measure calculation time

for (i in 1:nsize) {

size.i <- size[i]

gisette.km.i <- gisette.km[1:size.i, ]

gisette.bkm.i <- gisette.bkm[1:size.i, ]

# 1.kmeans with data.frame

cat("1.kmeans with data.frame", "\n")

calc.time[i, 1] <- system.time(

kmeans(gisette.km.i, centers, iter.max,

nstart, algorithm)

)[3]

rm(gisette.km.i)

gc()

# 2.bigkmeans with matrix

cat("2.bigkmeans with matrix", "\n")

calc.time[i, 2] <- system.time(

bigkmeans(gisette.bkm.i, centers, iter.max,

nstart)

)[3]

rm(gisette.bkm.i)

gc()

}

Created by Pretty R at inside-R.org

Finally, plot the result.

Created by Pretty R at inside-R.org

It is clearly shown that bigkmeans is faster than kmeans even for an ordinary matrix object: by 1.26 at N=5000, 1.39 at N=7500, 1.83 at N=10000, and almost twice at N=11000.

For datasets with fewer features, I’ll try in the near future.

LINK:

The Bigmemory Project(vignette)

Big data analysis in R(sorry, in Japanese)

**leave a comment**for the author, please follow the link and comment on his blog:

**sfchaos' blog**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...