the Art of R Programming [guest post]

[This article was first published on Xi'an's Og » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

(This post is the preliminary version of a book review by Alessandra Iacobucci, to appear in CHANCE. Enjoy [both the review and the book]!)

As Rob J. Hyndman enthusiastically declares in his blog, “this is a gem of a book”. I would go even further and argue that The Art of R programming is a whole mine of gems. The book is well constructed, and has a very coherent structure.

After an introductory chapter, where the reader gets a quick overview on R basics that allows her to work through the examples in the following chapters, the rest of the book can be divided in three main parts. In the first part (Chapters 2 to 6) the reader is introduced to main R objects and to the functions built to handle and operate on each of them. The second part (Chapters 7 to 13) is focussed on general programming issues: R structures and object-oriented nature, I/O, string handling and manipulating issues, and graphics. Chapter 13 is all devoted to the topic of debugging. The third part deals with more advanced topics, such as speed of execution and performance issues (Chapter 14), mix-matching functions written in R and C (or Python), and parallel processing with R. Even though this last part is intended for more experienced programmers, the overall programming skills of the intended reader “may range anywhere from those of a professional software developer to `I took a programming course in college’.” (p.xxii).

With a fluent style, Matloff is able to deal with a large number of topics in a relatively limited number of pages, resulting in an astonishingly complete yet handy guide. At almost every page we discover a new command, most likely the command we had always looked for and done without by means of more or less cumbersome roundabouts. As a matter of fact, it is possible that there exists a ready-made and perfectly suited R function for nearly anything that comes up to one’s mind. Users coming from compiled programming languages may find it difficult to get used to this wealth of functions, just as they may feel uncomfortable not declaring variable types, not initializing vectors and arrays, or getting rid of loops. Nevertheless, through numerous examples and a precise knowledge of its strengths and limitations, Matloff masterly introduces the reader to the flexibility of R. He repeatedly underlines the functional nature of R in every part of the book and stresses from the outset how this feature has to be exploited for an effective programming.

“One of the most effective ways to achieve speed in R code is to use operations that are {\em vectorized}, meaning that a function applied to a vector is actually applied individually to each element.” (p.40). 

The result is so convincing that it pushes even the strictest code purist to free herself from prejudices and surrender to the  pleasures of an interpreted language. This probably was the hardest challenge in writing The Art of R programming, and the author brilliantly met it.

The climax is unquestionably attained in the final chapters, where Matloff introduces some advanced and unusual topics with remarkable clarity and briskness. Within a few pages, he manages to tackle the object-oriented side of R, to advise and instruct the reader on debugging and performance issues, to show how to deal with R and C (or Python) mixed codes, and finally to open new perspectives by presenting the different approaches to parallel R. There is even a mention of GPU programming, a short paragraph certainly inexhaustive, but still instructive. To my knowledge, this is the only R handbook in which parallel programming with R is tackled with some degree of detail (I only found a hint of it in R in a nutshell, yet no programming details are given therein.). Also, the importance and prominence given to debugging are commendable, since this topic is often and mistakenly disregarded in most programming handbooksexcept those explicitly written on the subject. Among the sharpest passages of the book, I definitely include the ones on scope and environment issues, to which are devoted both a long section in Chapter 17 and a tiny simple yet enlightening example as early as page 9.

“Note carefully the role of w. The R interpreter found that there was no local variable of that name, so it ascended to the next level […] where it found a variable w with value 12. […]. It is possible (though not desirable) to deliberately allow name conflicts in this hierarchy. […] In such a situation the innermost environment is used first.” (p.153).

The message is clear: know exactly what you want to implement, keep track of all your objects, and scoping will not be an issue but another tool.

“In C, we would not have functions defined within functions […]. Yet, since functions are objects, it is possible–and sometimes desirable from the point of view of the encapsulation goal of object-oriented programming—to define a function within a function; we are simply creating an object, which we can do anywhere.” (p.152-3).

Another little gem is Section 7.9 on recursion, a concept that Matloff presents in a very clear and intuitive way. This section ends with one the most inspired extended examples proposed in the book, where recursion is used to implement a binary search tree. Other interesting extended examples are those about discrete-event simulation (Section 7.8.3), Markov Chains (Section 8.4.2) and polynomial regression (Section 9.1.7), though these applications may be a little too challenging for readers lacking a solid background in Statistics.

Although The Art of R programming is a book of many virtues, there are in my opinion some flaws:

The presence of lines of R code starting from the first few pages encourages the user to test her understandings straight away while reading, making The Art of R programming a sort of plug-and-play guide through R. Unfortunately, the pleasure of real-time testing is spoiled by two things. First, the reader has to copy those codes line by line. This is unquestionably useful for the many simple examples scattered throughout the book. However, it may become an inexhaustible source of typos, both pointless and annoyingnot to mention time-consumingwhen it comes to more complicated programs like those expounded in the many Extended Example sections. Second, the databases are unavailable so some applications are simply unusable (I managed to find the abalone data set for extended examples of Sections 2.9.2 and 4.4.3 thus discovering this interesting repository, but for the rest my research was rather inconclusive.) I am referring here to virtually all the extended examples in Chapters 5 and 6 on data frames, factors and tables. In particular, I find the application on the aids for learning Chinese dialect (Section 5.4.3) so over-elaborate to be nearly worthless. I would certainly suggest designing a dedicated package assembling all the necessary material for a fully profitable training with the book, like the package mcsm conceived by Robert and Casella for reproducing the results contained in their book on Monte Carlo methods with R.

In addition, surely R can handle huge databases with great ease, and maybe I am giving way to my personal preferences here, but I find that two whole chapters on data frames and factors (adding up to almost 40 pages!) are perhaps too much. On the contrary, I believe that the “traditional” graphic package would have deserved more space and consideration, not only in the devoted chapter (Chapter 12) but generally throughout the book. Indeed, the author suggests some good handbooks on the subject by Murrel and Wickham, but these are too detailed and advanced to be used for general purposes.

Despite an overall concise style, there are some long-winded passage and repetitions, especially in the applications, where certain lines of code are definitely redundant. I was likewise puzzled by the total absence in the book of the command separator ;, which would have considerably shortened and lightened some unnecessarily long examples. Also, a separate and more detailed index of R commands and functions would be helpful.

Finally, a minor but curious point about the assignment operator. I find the issue of <- vs. = particularly fascinating and a bit perturbing, since this leaves in fact an ambiguity in the definition of such a fundamental operator. Still, there seem to be two main streams and no general agreement. Reading on various blogs and discussion forums, I found no decisive nor robust argument in favor of either. Matloff approaches the issue of <- vs. = in assignments as soon as page 4. As he says, “The standard assignment operator in R is <-. You can also use =, but this is discouraged, as it does not work in some special situations.”. I was really eager to see these “special situations” shown in concrete examples. Unfortunately, they are nowhere to be listed in the book.

Notwithstanding these minor defaults, The Art of R programming is enriching, enjoyable and definitely worthwhile keeping as a reference while working with R. I highly recommend it to programmers, academic researchers and students in computational statistics willing to be quickly operational in writing R software.  And it is undoubtedly a really useful reading for any R user.


Filed under: Books, R, Statistics, University life Tagged: C, Norman Matloff, programming, R, software

To leave a comment for the author, please follow the link and comment on their blog: Xi'an's Og » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.