RcppMLPACK2 and the MLPACK Machine Learning Library

February 19, 2017
By

(This article was first published on Rcpp Gallery, and kindly contributed to R-bloggers)

mlpack

mlpack is, to quote, a scalable machine learning library, written in C++,
that aims to provide fast, extensible implementations of cutting-edge machine learning
algorithms
. It has been written by Ryan Curtin and others, and is
described in two papers in BigLearning (2011) and
JMLR (2013). mlpack uses
Armadillo as the underlying linear algebra library, which, thanks to
RcppArmadillo, is already a rather
well-known library in the R ecosystem.

RcppMLPACK1

Qiang Kou has created the
RcppMLPACK package on CRAN for easy-to-use
integration of mlpack with R. It integrates the
mlpack sources, and is, as a CRAN package, widely available on all
platforms.

However, this RcppMLPACK package is also based on a
by-now dated version of mlpack. Quoting again: mlpack provides these
algorithms as simple command-line programs and C++ classes which can then be integrated into
larger-scale machine learning solutions.
Version 2 of the mlpack sources
switched to a slightly more encompassing build also requiring the Boost
libraries ‘program_options’, ‘unit_test_framework’ and ‘serialization’. Within the context of an R
package, we could condition out the first two as R provides both the direct interface (hence no need
to parse command-line options) and also the testing framework. However, it would be both difficult
and potentially undesirable to condition out the serialization which allows
mlpack to store and resume machine learning tasks.

We refer to this version now as RcppMLPACK1.

RcppMLPACK2

As of February 2017, the current version of mlpack is 2.1.1. As it
requires external linking with (some) Boost libraries as well as with
Armadillo, we have created a new package
RcppMLPACK2 inside a new
GitHub organization RcppMLPACK.

Linux

This package works fine on Linux provided mlpack,
Armadillo and Boost are installed.

OS X / macOS

For maxOS / OS X, James Balamuta has tried to set up a homebrew
recipe but there are some tricky interaction with the compiler suites used by both brew and R on
macOS.

Windows

For Windows, one could do what Jeroen Ooms has done and build
(external) libraries. Volunteers are encouraged to get in touch via the issue tickets at GitHub.

Installation from source

Release are available from a drat repository hosted
in the GitHub orgranization RcppMLPACK. So

drat:::add("RcppMLPACK")         # first add the repo
install.package("RcppMLPACK2")   # install the pacage
update.packages()                # or update to newer one (if one exists)

will use this. If you prefer to rather pick a random commit state,

remotes::install_github("rcppmlpack/rcppmlpack2")

will work as well.

Example: Logistic Regression

To illustrate mlpack we show a first simple example also included in the
package. As the rest of the Rcpp Gallery, these are “live” code examples.

#include 				// MLPACK, Rcpp and RcppArmadillo

#include  	// particular algorithm used here

// [[Rcpp::depends(RcppMLPACK)]]

// [[Rcpp::export]]
Rcpp::List logisticRegression(const arma::mat& train,
                              const arma::irowvec& labels,
                              const Rcpp::Nullable<Rcpp::NumericMatrix>& test = R_NilValue) {
    
    // MLPACK wants Row which is an unsigned representation
    // that R does not have
    arma::Row<size_t> labelsur, resultsur;

    // TODO: check that all values are non-negative
    labelsur = arma::conv_to<arma::Row<size_t>>::from(labels);

    // Initialize with the default arguments.
    // TODO: support more arguments>
    mlpack::regression::LogisticRegression<> lrc(train, labelsur);
    
    arma::vec parameters = lrc.Parameters();

    Rcpp::List return_val;
    
    if (test.isNotNull()) {
        arma::mat test2 = Rcpp::as<arma::mat>(test);
        lrc.Classify(test2, resultsur);
        arma::vec results = arma::conv_to<arma::vec>::from(resultsur);
        return_val = Rcpp::List::create(Rcpp::Named("parameters") = parameters,
                                        Rcpp::Named("results") = results);
    } else {
        return_val = Rcpp::List::create(Rcpp::Named("parameters") = parameters);
    }

    return return_val;

}

We can then call this function with the same (trivial) data set as used in the first unit test for
it:

logisticRegression(matrix(c(1, 2, 3, 1, 2, 3), nrow=2, byrow=TRUE), c(1L, 1L, 0L))
$parameters
[1]  67.9550 -13.6328 -13.6328

Example: Naive Bayes Classifier

A second examples shows the NaiveBayesClassifier class.

#include 				// MLPACK, Rcpp and RcppArmadillo

#include  	// particular algorithm used here

// [[Rcpp::depends(RcppMLPACK)]]

// [[Rcpp::export]]
arma::irowvec naiveBayesClassifier(const arma::mat& train,
                                   const arma::mat& test,
                                   const arma::irowvec& labels,
                                   const int& classes) {
    
    // MLPACK wants Row which is an unsigned representation
    // that R does not have
    arma::Row<size_t> labelsur, resultsur;

    // TODO: check that all values are non-negative
    labelsur = arma::conv_to<arma::Row<size_t>>::from(labels);

    // Initialize with the default arguments.
    // TODO: support more arguments>
    mlpack::naive_bayes::NaiveBayesClassifier<> nbc(train, labelsur, classes);
    
    nbc.Classify(test, resultsur);
    
    arma::irowvec results = arma::conv_to<arma::irowvec>::from(resultsur);
    
    return results;
}

We need a quick helper function to get test data, again mimicking the unit tests:

#include 				// MLPACK, Rcpp and RcppArmadillo

#include  	// particular algorithm used here

// [[Rcpp::depends(RcppMLPACK)]]


// [[Rcpp::export]]
Rcpp::List getData(const char* trainFilename, const char* testFilename) {
    arma::mat trainData, testData;
    mlpack::data::Load(trainFilename, trainData, true); // note implicit transpose
    mlpack::data::Load(testFilename, testData, true);

    // Get the labels, then remove them from data
    arma::rowvec trainlabels = trainData.row(trainData.n_rows -1);
    arma::rowvec testlabels = testData.row(testData.n_rows -1);
    trainData.shed_row(trainData.n_rows - 1);
    testData.shed_row(trainData.n_rows - 1);
    return(Rcpp::List::create(Rcpp::Named("trainData")   = Rcpp::wrap(trainData),
                              Rcpp::Named("testData")    = Rcpp::wrap(testData),
                              Rcpp::Named("trainlabels") = trainlabels,
                              Rcpp::Named("testlabels")  = testlabels));
}

Now that we can fetch the data from R, and use it to call the classifier:

rl <- getData("/home/edd/git/mlpack/src/mlpack/tests/data/trainSet.csv", # should add to RcppMLACK2
              "/home/edd/git/mlpack/src/mlpack/tests/data/testSet.csv")
trainData <- rl[["trainData"]]
testData <- rl[["testData"]]
trainlabels <- rl[["trainlabels"]]
testlabels <- rl[["testlabels"]]
res <- naiveBayesClassifier(trainData, testData, trainlabels, 2)
## res was a rowvector but comes back as 1-row matrix                                       
all.equal(res[1,],  testlabels)
[1] TRUE

As we can see, the computed classification on the test set corresponds to the expected
classification in testlabels.

To leave a comment for the author, please follow the link and comment on their blog: Rcpp Gallery.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)