Some Thoughts on “Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?”

November 11, 2014
By

(This article was first published on Blog - Applied Predictive Modeling, and kindly contributed to R-bloggers)

Sorry for the blogging break. I’ve got a few planned for the next few weeks based on some work I’ve been doing.

In the meantime, you should check out “Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?” by Manuel Fernandez-Delgado at JMLR. They took a large number of classifiers and ran them against a large number of data sets from UCI. I was a reviewer on this paper (it heavily relies on caret) and have been interested in seeing peoples reaction when it was made public.

My thoughts:

  • Before reading this manuscript, take a few minutes and read this oldie by David Hand.
  • Obviously, it pays to tune your model. That is not the most profound statement to make but the paper does a good job of quantifying the impact of tuning the hyper-parameters.

  • Random forest takes the top spot on a regular basis. I was surprised by this since, in my experience, boosting does better and bagging does almost as well.

  • The authors believe that the parallel version of random forest in caret had an edge in performance. That’s hard to believe since it does’t do anything more than split the forest across different cores. That’s it. I took it out of the package for a bit because, if you are tuning the model, parallelizing the cross-validation is faster. I put it back in a few verisons ago since I knew people would want it after reading this manuscript.

  • I was hoping that the authors would take a better graphical and analytical approach to comparing the methods. Table after table numbs my soul.

  • Despite the number of models used, it would have been nice to see more emphasis on recent deep-learning models and boosting via the gbm package.

To leave a comment for the author, please follow the link and comment on their blog: Blog - Applied Predictive Modeling.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)