CACM Highlights R

[This article was first published on Mad (Data) Scientist, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The Association for Computing Machinery is the main professional organization for computer science, largely for academia but still with a broad membership. ACM publishes a number of journals, most of them for research but its flagship publication is a magazine, the Communications of the ACM.

The current issue of the CACM includes an article, “Bringing Big Data to the Big Tent,” that is mainly about R and Spark. After discussing the wide usage that R has developed, it raises a question as to whether R, specifically CRAN, is too disorganized. CMU CS professor Jim Herbsleb is quoted as saying

There’s a lot duplication of effort out there, a lot of missed opportunities, where one scientist has developed a tool for him or herself, and with a few tweaks, or if it conformed to a particular standard used a particular data format, it could be useful to a much wider community,

I understand his point, but I strongly disagree. I really like the free-form way that CRAN (and Bioconductor etc.) works, and appreciate the fact that when I need some utility, not only is CRAN likely to have it, but it’s likely to have several versions, by different authors, giving me a lot of choice. Besides, the various libraries available in the CS world include a lot of duplication too, yet no one seems to mind.

I do believe that there should be more structure to CRAN. The Task Views are nice, but are often nowhere near comprehensive, and some tend to be out of date. I’ve also proposed that there be a Yelp-style review system for CRAN packages.

Speaking of CRAN and Spark, the new version of my partools package (which I informally call Snowdoop) went on CRAN a few days ago. I continue to believe that Hadoop and Spark are not appropriate for the majority of R users who work with large data, and I offer partools as one alternative. The package is much more extensive than the last version. I’ll be blogging about it in the near future (and from time to time afterward, with more news and examples) but in the meantime, I recommend reading the vignette for an introduction to usage of the package. See also my recent talk. Note: Although the DESCRIPTION file says that partools requires a Unix-family OS, it should work fine with Windows.

To leave a comment for the author, please follow the link and comment on their blog: Mad (Data) Scientist. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)