mlr loves OpenML

September 8, 2016
By

(This article was first published on mlr-org, and kindly contributed to R-bloggers)

OpenML stands for Open Machine Learning and is an
online platform, which aims at supporting collaborative machine learning
online. It is an Open Science project that allows its users to share data, code
and machine learning experiments.

At the time of writing this blog I am in Eindoven at an OpenML
workshop
, where developers and scientists
meet to work on improving the project. Some of these people are R users and they (we)
are developing an R package that
communicates with the OpenML platform.

graphic

OpenML in R

The OpenML R package can list and download data sets and machine
learning tasks (prediction challenges). In R one can run algorithms on the
these data sets/tasks and
then upload the results to OpenML. After successful uploading, the website shows how well the
algorithm performs. To run the algorithm on a given task the OpenML R package
builds on the mlr package. mlr understands
what a task is and can run learners on that task. So all the OpenML package
needs to do is convert the OpenML objects to objects mlr understands and then
mlr deals with the learning.

A small case study

We want to create a little study on the OpenML
website
, in which we compare different types of Support
Vector Machines. The study gets an ID assigned to it, which in our case is 27.
We use the function ksvm (with different settings of the function argument type)
from package kernlab, which is integrated in mlr (“classif.ksvm”).

graphic

For details on installing and setting up the OpenML R package please see the
guide on GitHub.

Let’s start conducting the study:

  • Load the packages and list all tasks which have between 100 and 500
    observations.
library("OpenML")
library("mlr")
library("farff")
library("BBmisc")
dsize = c(100, 500)
taskinfo_all = listOMLTasks(number.of.instances = dsize)
  • Select all supervised classification tasks that do 10-fold cross-validation
    and choose only one task per data set. To keep the study simple and fast to compute,
    select only the first three tasks.
taskinfo_10cv = subset(taskinfo_all, task.type == "Supervised Classification" & 
                    estimation.procedure == "10-fold Crossvalidation" &
                    evaluation.measures == "predictive_accuracy" &
                    number.of.missing.values == 0 &
                    number.of.classes %in% c(2, 4))
taskinfo = taskinfo_10cv[1:3, ]
  • Create the learners we want to compare.
lrn.list = list(
  makeLearner("classif.ksvm", type = "C-svc"),
  makeLearner("classif.ksvm", type = "kbb-svc"),
  makeLearner("classif.ksvm", type = "spoc-svc")
)
  • Run the learners on the three tasks.
grid = expand.grid(task.id = taskinfo$task.id, 
                   lrn.ind = seq_along(lrn.list))

runs = lapply(seq_row(grid), function(i) {
  message(i)
  task = getOMLTask(grid$task.id[i])
  ind = grid$lrn.ind[i]
  runTaskMlr(task, lrn.list[[ind]])
})
  • And finally upload the runs to OpenML. The upload function (uploadOMLRun)
    returns the ID of the uploaded run object. When uploading runs that are part
    of a certain study, tag it with study_ and the study ID. After uploading the runs appear
    on the website and can be found using the tag or via the
    study homepage.
## please do not spam the OpenML server by uploading these
## tasks. I already did that.
run.id = lapply(runs, uploadOMLRun, tags = "study_27")
  • To show the results of our study, list the run evaluations and make a nice plot.
evals = listOMLRunEvaluations(tag = "study_27")

evals$task.id = as.factor(evals$task.id)
evals$setup.id = as.factor(evals$setup.id)

library("ggplot2")
ggplot(evals, aes(x = setup.id, y = predictive.accuracy, 
                  color = data.name, group = task.id)) + 
  geom_point() + geom_line()

plot of chunk unnamed-chunk-8

Now you can go ahead and create a bigger study using the techniques you have learned.

Further infos

If you are interested in more, check out the OpenML
blog, the
paper
and the GitHub repos.

To leave a comment for the author, please follow the link and comment on their blog: mlr-org.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)