Massive online data stream mining with R

Posted on March 25, 2013 by BNOSAC - Belgium Network of Open Source Analytical Consultants in R bloggers | 0 Comments

[This article was first published on BNOSAC - Belgium Network of Open Source Analytical Consultants, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A few weeks ago, the stream package has been released on CRAN. It allows to do real time analytics on data streams. This can be very usefull if you are working with large datasets which are already hard to put in RAM completely, let alone to build some statistical model on it without getting into RAM problems.

Most of the standard statistical algorithms require access to all data points and make several iterations over the data and are less suited for usage in R on big datasets.

Streaming algorithms on the other hand are characterised by

single passes over the data,
using a limited amount of storage space and RAM
work in a limited amount of time
be ready to use the model at any time

The stream package is currently focussed on clustering algorithms available in MOA (http://moa.cms.waikato.ac.nz/details/stream-clustering/) and also eases interfacing with some clustering already available in R which are suited for data stream clustering. Classification algorithms based on MOA are on the todo list.

Current available clustering algorithms are BIRCH, CluStream, ClusTree, DBSCAN, DenStream, Hierarchical, Kmeans and Threshold Nearest Neighbor.

The stream package allows you to easily extend the use of the models with different data sources. These can be SQL sources, Hadoop, Storm, Hive, simple csv files, flat files or other connections. It is quite easy to extend it towards other connections. As an example, the following code available at this gist (https://gist.github.com/jwijffels/5239198) allows it to connect to an ffdf from the ff package. This allows to do clustering on ff objects.

Below, you can find a toy example showing streaming clustering in R based on data in an ffdf.

Load the packages & the Data Stream Data for ffdf objects

require(devtools)
require(stream)
require(ff)
source_gist("5239198")

Set up a data stream

myffdf <- as.ffdf(iris)
myffdf <- myffdf[c("Sepal.Length","Sepal.Width","Petal.Length","Petal.Width")]
mydatastream <- DSD_FFDFstream(x = myffdf, k = 100, loop=TRUE) 
mydatastream

Build the streaming clustering model

#### Get some points from the data stream
get_points(mydatastream, n=5)
mydatastream

#### Cluster (first part)
myclusteringmodel <- DSC_CluStream(k = 100)
cluster(myclusteringmodel, mydatastream, 1000)
myclusteringmodel
plot(myclusteringmodel)

#### Cluster (second part)
kmeans <- DSC_Kmeans(3)
recluster(kmeans, myclusteringmodel)
plot(kmeans, mydatastream, n = 150, main = "Streaming model - with 3 clusters")

This approach is a standard 2-step approach which combines streaming micro clustering with macro clustering using a basic kmeans algorithm.

If you need help in understanding how your data can help you, if you need training and support on the efficient use of R, let us know how we can help you out.

To leave a comment for the author, please follow the link and comment on their blog: BNOSAC - Belgium Network of Open Source Analytical Consultants.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Massive online data stream mining with R

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)