When R is brought up as a possibility for doing statistics or data mining or any sort of predictive analytics among non R users, someone will invariably point out that R has a “steep learning curve”, and the response among those gathered usually includes a significant amount of head nodding. Even those who have put in heroic efforts to help people learn R sometimes say scary things: e.g. in his introduction to Rattle, a Data Mining GUI for R, Graham Williams writes:
R offers a breadth and depth in statistical computing beyond what is available in commercial closed source products. Yet R remains, primarily, a programming language for the highly skilled statistician, and out of the reach of many. (The R Journal Vol. ½, December 2009)
Are things really that bad? Is R that difficult to learn? I think not – unless you take the statement to refer to the absolutely worst case scenario that you can think of: the effort that would be involved in learning R by a person who doesn’t have any background in statistics or data analysis and who has absolutely no interest in learning R. In this context the statement that R has a steep learning curve conveys the same truth as the assertion that Chinese or Italian or Japanese or English or any other language is difficult to learn if you don’t know anything about the people who speak these languages, don’t want to know, and wouldn’t have an opportunity to practice the language anyway. There is some truth to this, but so what? Context is everything. If you have some background in statistics and a desire to learn some more, learning R is not going to be an insurmountable problem. Once you have some version of R installed on your favorite computing platform, any book that provides carefully worked examples of the kinds of statistical analyses that interest you should be all that is required to make you productive. These days, there are dozens or maybe even hundreds of statistics books that use R. Two of my personal favorites are John Fox’s classic An R and S Plus Companion to Applied Regression (Sage Publications, 2002) and Data Analysis and Graphics Using R: An Example-Based Approach (Cambridge Series in Statistical and Probabilistic Mathematics) by John Maindonald and W. John Braun (2010). Books, of course, are only the tip of the iceberg of the resources available for the statistical cognoscenti to learn R. Have a look at the links on Inside-R as places to start on the web.
This discussion does raise the issue of what one means by learning a language. Knowing how to run the R scripts required to get some simple analysis done may not entitle you to say that you know R (significantly more is required to become an R developer) anymore than understanding enough Italian to get around Rome while eating well entitles you to say that you know Italian – but again, so what? It is possible to do some pretty impressive statistics using R without any deeper understanding of the R language than what is required to run the applicable models. On the other hand, if you don’t know any statistics and you don’t want to know more than you have to, then that might be a big problem. But, it’s the statistics part that has the steep learning curve, not R. Maybe this is the point of William’s comment: R remains the language of highly skilled statisticians because it is the language most capable of expressing what highly trained statisticians think about, and R is out of the reach of the many who just don’t care much about statistics. But, if you are reading this post this many probably doesn’t include you.
So suppose you are not a statistician but belong to some other data analysis culture, data mining, for example, and you want to learn R. Well, like any of the world’s great natural languages R approachable from many different starting points. The entry point may be different but the path to knowledge is going to be similar. Find something you care about and start formulating simple sentences. There are fewer R language guides available for people who have some data mining skills, but that is changing. Seni and Elder have recently published a very nice little book on ensemble methods: Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions (Synthesis Lectures on Data Mining and Knowledge Discovery) that I highly recommend. And, of course the masters Hastie, Tibshirani and Freidman authors of the classic text, Elements of Statistical Learning, have made both the text and much of the their code available. Moreover, if you are a data miner with a background in computer science you can probably code rings around the highly trained statisticians. If so, you may find the books by Chambers, Software for Data Analysis and Gentlemen, R Programming for Bioinformatics of great value.
Finally, to be perfectly fair, I should acknowledge Graham Williams’ quote in the context in which he wrote it. Like any other language, R is much easier to learn when you have access to the best learning tools, CDs, dictionaries, grammars etc. Rattle is one such high value learning tool and so is Revolution’s R Productivity Environment. But, more about these on another day.