RMOA: Massive online data stream classifications with R & MOA
[This article was first published on BNOSAC  Belgium Network of Open Source Analytical Consultants, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here)
Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
For those of you who don't know MOA. MOA stands for Massive Online Analysis and is an opensource framework that allows to build and run experiments of machine learning or data mining on evolving data streams. The website of MOA (http://moa.cms.waikato.ac.nz) indicates it contains machine learning algorithms for classification, regression, clustering, outlier detection and recommendation engines.
For R users who work with a lot of data or encounter RAM issues when building models on large datasets, MOA and in general data streams have some nice features. Namely:
 It uses a limited amount of memory. So this means no RAM issues when building models.
 Processes one example at a time, and will run over it only once
 Works incrementally – so that a model is directly ready to be used for prediction purposes
Unfortunately it is written in Java and not easily accessible for R users to use. For users mostly interested in clustering, the stream package already facilites this (this blog item gave an example when using ff alongside the stream package). In our daytoday use cases, classification is a more common request. The stream package only allows to do clustering. So hence the decision to make the classification algorithms of MOA easily available to R users as well. For this the RMOA package was created and is available on github (https://github.com/jwijffels/RMOA).
The current features of RMOA are:
 Easy to set up data streams on data in RAM (data.frame/matrix), data in files (csv, delimited, flat table) as well as outof memory data in an ffdf (ff package).
 Easy to set up a MOA classification model

There are 26 classification models available which range from
 Classification Trees (AdaHoeffdingOptionTree, ASHoeffdingTree, DecisionStump, HoeffdingAdaptiveTree, HoeffdingOptionTree, HoeffdingTree, LimAttHoeffdingTree, RandomHoeffdingTree)
 Bayes Rule (NaiveBayes, NaiveBayesMultinomial)

Ensemble learning
 Bagging (LeveragingBag, OzaBag, OzaBagAdwin, OzaBagASHT)
 Boosting (OCBoost, OzaBoost, OzaBoostAdwin)
 Stacking (LimAttClassifier)
 Other (AccuracyUpdatedEnsemble, AccuracyWeightedEnsemble, ADACC, DACC, OnlineAccuracyUpdatedEnsemble, TemporallyAugmentedClassifier, WeightedMajorityAlgorithm)
 Active learning (ActiveClassifier)

Easy Rfamiliar interface to train the model on streaming data with a familiar formula interface as in
trainMOA(model, formula, data, subset, na.action = na.exclude, ...)

Easy to predict new data alongside the model as in
predict(object, newdata, type = "response", ...)
Feel free to use it and we welcome any feedback in your daytoday RMOA usage experiences at https://github.com/jwijffels/RMOA in order to improve the package. For more documentation on MOA for R users: see http://jwijffels.github.io/RMOA/
An example of R code which constructs a HoeffdingTree and a boosted set of HoeffdingTrees is shown below.
## ## Installation from github ## library(devtools) install.packages("ff") install.packages("rJava") install_github("jwijffels/RMOA", subdir="RMOAjars/pkg") install_github("jwijffels/RMOA", subdir="RMOA/pkg") ## ## HoeffdingTree example ## require(RMOA) hdt < HoeffdingTree(numericEstimator = "GaussianNumericAttributeClassObserver") hdt ## Define a stream  e.g. a stream based on a data.frame data(iris) iris < factorise(iris) irisdatastream < datastream_dataframe(data=iris) ## Train the HoeffdingTree on the iris dataset mymodel < trainMOA(model = hdt, formula = Species ~ Sepal.Length + Sepal.Width + Petal.Length, data = irisdatastream) ## Predict using the HoeffdingTree on the iris dataset scores < predict(mymodel, newdata=iris, type="response") table(scores, iris$Species) scores < predict(mymodel, newdata=iris, type="votes") head(scores) ## ## Boosted set of HoeffdingTrees ## irisdatastream < datastream_dataframe(data=iris) mymodel < OzaBoost(baseLearner = "trees.HoeffdingTree", ensembleSize = 30) mymodel < trainMOA(model = mymodel, formula = Species ~ Sepal.Length + Sepal.Width + Petal.Length, data = irisdatastream) ## Predict scores < predict(mymodel, newdata=iris, type="response") table(scores, iris$Species) scores < predict(mymodel, newdata=iris, type="votes") head(scores)
To leave a comment for the author, please follow the link and comment on their blog: BNOSAC  Belgium Network of Open Source Analytical Consultants.
Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job.
Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.