RMOA: Massive online data stream classifications with R & MOA

Posted on May 18, 2014 by BNOSAC - Belgium Network of Open Source Analytical Consultants in R bloggers | 0 Comments

[This article was first published on BNOSAC - Belgium Network of Open Source Analytical Consultants, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

For those of you who don't know MOA. MOA stands for Massive On-line Analysis and is an open-source framework that allows to build and run experiments of machine learning or data mining on evolving data streams. The website of MOA (http://moa.cms.waikato.ac.nz) indicates it contains machine learning algorithms for classification, regression, clustering, outlier detection and recommendation engines.

For R users who work with a lot of data or encounter RAM issues when building models on large datasets, MOA and in general data streams have some nice features. Namely:

It uses a limited amount of memory. So this means no RAM issues when building models.
Processes one example at a time, and will run over it only once
Works incrementally – so that a model is directly ready to be used for prediction purposes

Unfortunately it is written in Java and not easily accessible for R users to use. For users mostly interested in clustering, the stream package already facilites this (this blog item gave an example when using ff alongside the stream package). In our day-to-day use cases, classification is a more common request. The stream package only allows to do clustering. So hence the decision to make the classification algorithms of MOA easily available to R users as well. For this the RMOA package was created and is available on github (https://github.com/jwijffels/RMOA).

The current features of RMOA are:

Easy to set up data streams on data in RAM (data.frame/matrix), data in files (csv, delimited, flat table) as well as out-of memory data in an ffdf (ff package).
Easy to set up a MOA classification model
There are 26 classification models available which range from
1. Classification Trees (AdaHoeffdingOptionTree, ASHoeffdingTree, DecisionStump, HoeffdingAdaptiveTree, HoeffdingOptionTree, HoeffdingTree, LimAttHoeffdingTree, RandomHoeffdingTree)
2. Bayes Rule (NaiveBayes, NaiveBayesMultinomial)
3. Ensemble learning
  - Bagging (LeveragingBag, OzaBag, OzaBagAdwin, OzaBagASHT)
  - Boosting (OCBoost, OzaBoost, OzaBoostAdwin)
  - Stacking (LimAttClassifier)
  - Other (AccuracyUpdatedEnsemble, AccuracyWeightedEnsemble, ADACC, DACC, OnlineAccuracyUpdatedEnsemble, TemporallyAugmentedClassifier, WeightedMajorityAlgorithm)
4. Active learning (ActiveClassifier)
Easy R-familiar interface to train the model on streaming data with a familiar formula interface as in trainMOA(model, formula, data, subset, na.action = na.exclude, ...)
Easy to predict new data alongside the model as in predict(object, newdata, type = "response", ...)

Feel free to use it and we welcome any feedback in your day-to-day RMOA usage experiences at https://github.com/jwijffels/RMOA in order to improve the package. For more documentation on MOA for R users: see http://jwijffels.github.io/RMOA/

An example of R code which constructs a HoeffdingTree and a boosted set of HoeffdingTrees is shown below.

## 
## Installation from github
## 
library(devtools)
install.packages("ff")
install.packages("rJava")
install_github("jwijffels/RMOA", subdir="RMOAjars/pkg")
install_github("jwijffels/RMOA", subdir="RMOA/pkg")

## 
## HoeffdingTree example
## 
require(RMOA)
hdt <- HoeffdingTree(numericEstimator = "GaussianNumericAttributeClassObserver")
hdt
## Define a stream - e.g. a stream based on a data.frame
data(iris)
iris <- factorise(iris)
irisdatastream <- datastream_dataframe(data=iris)
  
## Train the HoeffdingTree on the iris dataset
mymodel <- trainMOA(model = hdt, 
  formula = Species ~ Sepal.Length + Sepal.Width + Petal.Length, 
  data = irisdatastream)
## Predict using the HoeffdingTree on the iris dataset
scores <- predict(mymodel, newdata=iris, type="response")
table(scores, iris$Species)
scores <- predict(mymodel, newdata=iris, type="votes")
head(scores)

## 
## Boosted set of HoeffdingTrees
## 
irisdatastream <- datastream_dataframe(data=iris)
mymodel <- OzaBoost(baseLearner = "trees.HoeffdingTree", ensembleSize = 30)
mymodel <- trainMOA(model = mymodel, 
  formula = Species ~ Sepal.Length + Sepal.Width + Petal.Length, 
  data = irisdatastream)
  
## Predict 
scores <- predict(mymodel, newdata=iris, type="response")
table(scores, iris$Species)
scores <- predict(mymodel, newdata=iris, type="votes")
head(scores)

To leave a comment for the author, please follow the link and comment on their blog: BNOSAC - Belgium Network of Open Source Analytical Consultants.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

RMOA: Massive online data stream classifications with R & MOA

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)