So you want to be a Data Science superstar

[This article was first published on R – Robert Grant's stats blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Big house? Five cars? There’s no one universal way to do it, but get a coffee and read on through this bumper post to find your own way with the advice of real experts.

Last summer, Mrs G and I were in that ridiculously long line for the cablecar in San Francisco, like predictable British tourists, and got talking to the guys next to us. One of them, Jason Jackson, was just about to start studies in business including a good dose of quantitative research and data analysis. So, we’ve stayed in touch on Twitter. Recently, he asked me what the single best resource is for getting started in data science, and I found this a surprisingly tough question.

‘Data science’ is a term widely used in business and more computing-oriented circles, while it is not always recognised in slow-moving academia, where ‘statistics’ still holds sway. They are not the same thing. DS is a mix of skills to manipulate, analyse and interpret data, drawn from statistics, computer science and machine learning. It’s hard to be world-class at all of those, but there are probably a few really irritating people like that out there. To be autonomous and not get ripped off as a freelancer or entrepreneur, you should also know how to construct and work with databases and websites, and be able to make some data visualisations. It is probably sensible to devote little, if any, energy to Big Data. I mean, just watch a few YouTube videos about Spark and you’ll be OK.

If you want to study statistics, the route to take and resources to use are well mapped-out, but DS is not so clear. And remember that DS is only one step away from BS; there are plenty of websites promising a lot and providing little. Many of the ‘great resources’ you find online turn out to be vacuous efforts to separate you from your do$h, blatant self-promotion, or just badly-explained home-made videos. I thought it would be a nice opportunity to elicit some opinions from people I respect, even if we all end up disagreeing. So, I sent the following around anyone I could think would have an interesting view on this, including as far as possible people outside the classical statistics world:

Colleagues & friends,
I am writing a blog post and would love it if you would contribute just a few sentences of your views. I was asked recently what the best single resource is for teaching oneself data science (which I take to be a crossover between computer science / programming skills, classical statistics and machine learning). I am really not sure what the answer is, but I think it is a really important one and worth airing some different views. People trained initially in statistics, like me, are often negative about the concept of data science, but I think this is a mistake and we stand to up our game and learn a lot of cool tricks along the way.
It could be an online course, software to play around with, a book or anything else.
For my suggestion, I am going to lay claim to Hastie, Tibshirani & Friedman’s book “The Elements of Statistical Learning” [EoSL], combined with googling ” in r” and then playing around in R late into the night when you really have other things you should be getting on with.

Why specifically R? Because it has by far the biggest library of packages tackling everything from statistics to machine learning to interfacing with databases to text analysis to you name it. And it’s free.

Let’s start the replies with with Bob Carpenter (Columbia), who was not a fan of ‘EoSL’:

I didn’t like Hastie et al.’s book, because I found it nearly impossible to understand from first principles. Now I find it trivially easy, of course, which is probably why they didn’t understand how hard it would be for beginners. More seriously, I would shy away from recommending a pure frequentist approach and recommend something more Bayesian.

On that Bayesian point, I have looked a bit at ‘Bayesian Reasoning and Machine Learning’ by and like the look of it. I haven’t read it thoroughly though, and I think it would make a better second or third textbook than a first. Bob continued:

For computer scientists getting into stats, I’d recommend Gelman and Hill’s book on multilevel regression. It’s too high level to teach you basic stats and probabilities, but it’s an awesome tutorial on modeling. I liked Bishop’s book [“Pattern Recognition and Machine Learning”] much better than EoSL — but then it’s more algorithm focused and gives a decent intro to probability theory. I’m a computer scientist. But it’s rather incoherent in covering so many different things that aren’t probabilistic (perceptrons, SVMs, etc.)

Well, as I see it, the mixture of probabilistic algorithms and heuristic non-probabilistic ones (particularly around unsupervised learning) is an interesting characteristic of data-science-as-useful-though-incoherent-mashup. And while we’re on the subject of tutorials in modeling, let’s not forget good old Cox & Snell, whose book is still unique and fresh in its over-the-statistician’s-shoulder view of real analysis in action, complexities, compromises and all. Mike Betancourt (Warwick), who, like Bob, came to statistics after training in another field, also came down in favour of Bishop:

Firstly I should note that I hate Elements of Statistical Learning. It’s a cookbook with lots of technical results that apply in unrealistic settings and little intuition that helps in practice. I much prefer Bishop who motivates each algorithm from a generative perspective and then ties that perspective into the examples.

Personally, I looked at both books when I wanted to learn about ML, and chose against Bishop, perhaps because unlike these two, my first degree was in math. Laurent Gatto (Cambridge) suggested some online learning:

I enjoyed the Statistical Learning Stanford Online course [1] and book [2] from the same authors you mentioned. Although I haven’t taken the course myself, I think the set of Data Science Coursera courses from Roger Peng et al. from Johns Hopkins [3] is probably quite good.
[1] http://online.stanford.edu/course/statistical-learning-winter-2014
[2] http://www-bcf.usc.edu/~gareth/ISL/
[3] https://www.coursera.org/specializations/jhu-data-science

You can always spot a true academic by the way they use proper referencing in emails. Or SMS, or Twitter…

The next theme that I got was in favour of getting your hands dirty with real data (which is the sort of thing I had in mind for tinkering late at night when you really should be doing something else). Here’s Laurent again:

I think the most crucial factor to teaching oneself data science (or programming) is a practical use case to guide the student. It’s so easy to get started with a nice resource or book and then get carried away by everyday business. I think a simple enough, yet non-trivial problem to tackle is really helpful to ground the study material in ones real-life applications.

I think they are absolutely right that just-in-time self-taught programming for a real task and a deadline is very fast and effective. The trick is then keeping up the practice afterwards and polishing the rough edges of programming. And programming in particular is an added layer of difficulty for the novice data scientist (unless you still believe you can get by pointing and clicking in various IBM products which we do not mention on this blog). As statistician Rebecca Killick (Lancaster) put it:

My research is more and more on the borderline between classical statistics and machine learning for which I need good programming skills. I wouldn’t call myself a data scientist but many of my more theoretical colleagues probably consider me to be one. I would contribute the following book: “Machine Learning: An Algorithmic Perspective” by Stephen Marsland, again with the relevant googling of how to do things practically in R (the book gives Python examples). I also learnt much of my Python knowledge from the Appendix (and googling).

Ah yes, Python. That is also very popular in data science circles, probably more among people approaching from a web/computer science angle than a stats angle, and I’ve not got enough brain space to absorb another language, but there’s no denying its popularity, flexibility and power. It’s doubtless faster than R in most settings (though perhaps not judicious use of Rcpp, the ‘seamless’ interface between high-level R and low-level C++, which is my power tool of choice). Here’s Bob Carpenter again:

If you want one recommendation from me for statisticians getting into software, it’s Hunt and Thomas’s book, The Pragmatic Programmer. It’s too high level to teach you to program, but it’s an awesome tutorial on being a solid developer and managing projects of all scales. For domain scientists getting into both, I’d recommend both this and Bishop.

And economist Nick Latimer (Sheffield):

I’ve never really been taught much statistics (aside from little bits on economics courses) or programming. Hence I am not very good at either. However, in teaching myself how to do these things the most useful thing for me was googling Stata error messages and messing around with datasets and code until I got it to do what I wanted it to do (much as you say you did with R). Seeing the code written by other people is also very useful, mainly to show you the many different ways (usually more efficient than mine) to do the same thing.

Likewise biochemist Jon Houseley (Cambridge):

My experience with R books has not been fruitful, and I am also of the Googling “how do I do XXXX in r” school. Most texts on R seem to require more statistical and/or programming knowledge than I possess. However, our bioinformatics unit runs a series of courses for biologists needing to perform basic data analysis in R – the course materials provide step-by-step guides for simple tasks and are freely available here: http://www.bioinformatics.babraham.ac.uk/training.html

Here’s Rasmus Bååth (Lund), a statistician whose hobbies include hard programming challenges like recreating Bayesian software inside a website (for fun):

For the programming part of data science it’s relatively straightforward, there are tons of great blogs (where R-bloggers is the main pusher, http://www.r-bloggers.com/) and great tutorials (if you are completely new to R http://tryr.codeschool.com/ is one of the best!). For the stats part I found it much more difficult to find good resources online, and you’ll easily find lots of conflicting advice (p-value based statistics vs. Bayes comes to mind…). For visualization Cleveland’s old book is a gold mine (http://amzn.com/0963488406 ), and the ggplot2 book (https://github.com/hadley/ggplot2-book) and cookbook (http://www.cookbook-r.com/Graphs/) shows you how to do it in practice. A great source for (practical) statistical theory is also Richard McElreath’s video lectures (https://youtu.be/WFv2vS8ESkk) and upcoming book (http://bit.ly/1NfLlsN).

Bob Carpenter pointed me to a blog post by Peter Norvig (head of research at Google): http://norvig.com/21-days.html One quote from that I’m going to throw into the mix here is about taking time and treating it as a serious life-changing challenge.

The key is deliberative practice: not just doing it again and again, but challenging yourself with a task that is just beyond your current ability, trying it, analyzing your performance while and after doing it, and correcting any mistakes. Then repeat. And repeat again. There appear to be no real shortcuts: even Mozart, who was a musical prodigy at age 4, took 13 more years before he began to produce world-class music.

I really recommend reading this post as it has a lot of wise advice in there, and even if you (dear reader) don’t believe me when I tell you unpalatable facts about learning, you might take it from Norvig!

And here’s Con Ariti (LSHTM, ex-CapitalOne):

I recommend the ‘little’ statistical learning book with the R labs. There is also a good example book by O’Reilly publishing I think is called “Doing data science” that has some examples and is based on a course at NYU. It is good for showing how DS is done in the real world and how much could be learnt from statistics!

Con’s ‘little’ book was Laurent Gatto’s reference 2. This is Hastie & friends’ shorter and less theoretical book ‘Introduction to Statistical Learning’ – I like them both, but don’t imagine that by reading the little one you’ll escape the algebra.

Now for a word from medical statistician Charles Opondo (Oxford):

“Best single resource” – the internet! I think the best way is to start with a personal/work/task related problem that one understands well, and by understanding the complexities and limitations of available tools and solutions then one can begin to understand the subject. I think the internet as a whole is the ‘best single source’ because good books, courses and online resources are always replaced with the next best thing, and there’s always bound to be that single source that does one, just one thing, exceptionally better than any book or course ever would.

to which I replied:

Would you advise a beginner to play rather than agonise over the theoretical foundations then?

and he said:

Absolutely – one sometimes finds, upon deeper exploration, that there is no consensus or clarity on some aspects of foundation, and that it is enough to work with methods and approaches as currently understood (talking for myself and my recent exploration of causal inference).

and I couldn’t resist:

Hmmm yes! Especially frustrating for the novice because the writings of the professors give every impression of being unquestionably the final word on the subject.

Finally, there was something of a defence of statistics. Now, I don’t imagine DS is the new stats, or that stats has had its day, but Royal Statistical Society president Peter Diggle (Lancaster, ex-CSIRO) wrote on “Statistics: a data science for the 21st century” as his presidential address, and noted that stats has some crucially important stuff to offer DS:

we can assert that uncertainty is ubiquitous and that probability is the correct way to deal with uncertainty. We understand the uncertainty in our data by building stochastic models, and in our conclusions by probabilistic inference. And on the principle that prevention is better than cure we also minimize uncertainty by the application of the design principles that Fisher laid down 80 years ago, and by using efficient methods of estimation.

So, in conclusion, it seems there is no silver bullet but rather a selection of different approaches when people offer up materials for learning these skills. Regular readers will know I’m a fan of the American Statistical Association’s GAISE guidelines for teaching stats in a modern, evidence-based way. But even that did not foresee the approach of DS. Basically, if t-tests get mentioned in the first third of any course, video, book or website, then you are looking at a reheated statistics course. The old Snedecor 1930s syllabus just doesn’t work, because so many of the ideas it leaves you with are not going to be priorities in a DS application. How to we tackle that then, to teach statistics rigorously but leave graduates able to flex across into machine learning and programming? Here’s Peter Diggle:

Given a solid mathematical foundation, my suggested list of topics for a Master of Science degree in statistics is
(a) design,
(b) probability and stochastic processes,
(c) likelihood-based inference,
(d) computation, including numerical methods and programming,
(e) communication, including scientific writing for both technical and lay audiences, and
(f) scientific method, and the foundations of at least one substantive area of application.

I love point (e) but I am going to be more radical than Peter. I think Bayes has to be there from the start, along with getting everyone to practise proposing and justifying data-generating processes all the time, not just shoe-horned into (d). Then the exploratory stats and graphics, and any model, works to test and refine that a priori process (and not in a p<0.05 mechanistic way). Needless to say, there is no ideal material for this, but you may note that I own a web domain called bayescamp.com – now what should I do with that, do you suppose?

A final point from me is about mathematics. Nobody raised this (except Peter Diggle in the context of a university degree course), but if you don't have confident reading and writing skills in the highly condensed and abstracted language we call algebra (including matrix algebra), you will find it hard to absorb some ideas. Pseudocode can get you some of the way, but it is probably worth setting aside a few weeks to brush up your math. Gentle's book 'Matrix Algebra' is ideal for this, I think. You will need to carve out and defend serious chunks of uninterrupted, undistracted time. The process will hurt, let's not kid ourselves, but it will pay you back for the rest of your life.


To leave a comment for the author, please follow the link and comment on their blog: R – Robert Grant's stats blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)