“Programming with Data – a Guide to the S Language” by John Chambers

June 5, 2010
By

(This article was first published on R-Chart, and kindly contributed to R-bloggers)




Another unashamedly subjective book review - but first a little more about my history as it relates to R...

A few years ago, I was doing some systems administration work at an industrial company for whom IT was a cost center. I wanted to illustrate to senior management that the growth of data was significant enough that the
existing hardware was going to have to be replaced. I anticipated that their response would be

What is the current resource usage?
What is the rate of increase?
When do we have to act?

To make a long story short, I create a script that printed resource usage each day to a file. Knowing that long lists of numbers of a non-financial nature often make the eyes of management glaze over, I looked at a few charting solutions. Although I could have used Excel, I prefered open source solutions and the server in question was running Linux. I discovered R at this time and quickly created a compelling chart.

After mentioning this to my brother (who is also involved in software development), he sent me a copy of "Programming with Data - a Guide to the S Language" by John Chambers for my birthday. John Chambers is the creator of the S programming language and is a member of the board of the R foundation. A guide to the language written by the author of the language seemed to be a good place to start. And in some ways this is true.

Programming with Data is not a book about statistics at all. It is about the design and use of the S (and by implication R) programming language. As such, it is a fascinating source of insight about the language from its creator. Insights into how the language was used early on (e.g. visualizing wafer production quality) and how S compares with other languages provide a historical context that can help understand the strengths of the language and its intended purpose.

Some books are made to be understood through running the example code. This books is not one of them, in fact doing so would be somewhat frustrating. Code snippets often are not in the context of a complete script. Data used in the examples is often not provided. The value of the code snippets is to provide an idea of the functionality of given R/S commands that can be investigated later.

There are a number of terms that are used in a manner that I found to be rather confusing. The first Object Oriented (OO) programming language that I used extensively in a professional setting was java. As such, there are specific reasons that I expect an OO language to be considered useful (encapsulation, inheritence, polymorphism, code-reuse, etc). In a number of places in the book, the fact that "everything in R is an object" is presented as if it was self evident as to why this was good. The reason an object is "good" in java is different why it is "good" in R. Basically, in R, the good thing about everything being an object is that a data set of an arbitrary structure can be encapsulated in a single referencable entity. You don't end up with a bunch of scattered primitives at the end of a calculation, but a collection with a structure that makes sense and the ability to interrogate and manipulate this structure in a meaningful way.

Other terms were used in unexpected ways to me as well. The term database generally makes my mind jump to something akin to a relational database. But database is used to describe what I think of as a "session" based upon other interactive programming environments (e.g. Ruby irb sessions), web applications (a session being one instance of a users work between log on and log off), and databases (seen in Oracle's v$session table).

The perceived semantic ambiguity suggests something about R that I think needs to be kept in mind when communicating with the wider programming community. There is a noticable divide between software development and scientific/statistical communities. While both communities use many of the same tools (Linux, vi, emacs, python), there are different "traditions" that have shaped our thinking and modes of expression. This fact needs to be kept in mind when communucating about R and its capabilities.

This book was written over a decade ago. Again, it is great for getting a sense of the historical place of S and R. Because it is by John Chambers, a central figure in the S and R community, it is invaluable for getting a sense of how he designed and intended the language to be used. The challenges I had understanding the terms in the book (although somewhat bothersome at the time) are actually very helpful. They caused me to realize the differences in background of those who use R and to understand its development as somewhat independent of other programming languages. As R continues to grow, these differences need to be clearly understood to foster clear communication going forward.

To leave a comment for the author, please follow the link and comment on his blog: R-Chart.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags:

Comments are closed.