Site icon R-bloggers

Getting started in applied statistics / datascience

[This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Where to start to start?

I was recently asked by a colleague manager from another organisation what direction they could give to a staff member interested in building skills in the whole “big data” thing. A search of the web shows hundreds if not thousands of sites and blog posts aimed at budding data scientists, but most of them seem (to my admittedly very non-rigorous glance) to be collections of resources and techniques; too detailed and specific for my purposes, and aimed at people already a bit into the journey. So here’s something oriented a bit more to someone who’s still wondering what this thing is you might be going to get into.

Why is data suddenly so sexy?

First, while a lot of the publicity you hear is about “big data”, the real revolution in recent decades is bigger than big data. It’s about data creation, storage, access, analytical techniques and tools:

  1. In the last third of the twentieth century there were big advances in applied statistics, with new methods like robust statistics, bootstrapping, a bigger range of graphics, and mixed effects and additive models to deal with a bunch of situations beyond the crude assumptions needed for the previous generations of techniques (the world of ANOVA, linear regression and t statistics which unfortunately is still the impression many people have from those one or two mandatory stats papers);
  2. Roughly overlapping with that and still ongoing, the rise in computing power has made those methods practicable and cheap;
  3. Then, the last 10 years has seen an explosion of data capture and storage, as our digital traces are increasingly being logged somewhere and more and more of our lives create those traces.

A lot of the new data is web-related, but the overall cumulative impact of those three things above is not necessarily just enormous piles of Twitter and Facebook data. So I’d be careful not to focus on “big data” from the start but instead build a core of statistics and computing skills which can then be applied to larger data. The techniques actually specific to big data are relatively small compared to the core skillset, and should be fairly easy to learn if they get a solid grounding.

What’s it take to learn?

Second, I would emphasise that this stuff is hard and there’s lots of it to learn. It can’t be learnt by a few two day courses and a brief apprenticeship, although both those things can help. To be successful you need specialist tertiary education or its equivalent, plus a commitment to continuous creative destruction of your knowledge and skills and to life long learning. My team for example comprises mostly people with quantitative PhDs or Masters degrees, and we have two training sessions per week at which everyone is continually learning new stuff (and teaching it to the others). We start each weekly team meeting reporting back one thing each of us learnt in the last week; often it’s some tool or technique that didn’t even exist six months ago.

My thinking is heavily influenced by Drew Conway’s data science Venn diagram, a modified version of which is below (at the bottom of this post is the R code that drew this):

Basically, a good applied statistician or datascientist (I’m not going to argue about language here) needs to combine computing, statistical and content knowledge skills. It’s the growth in computing power that’s changing capabilities in the field, but knowing stuff and techniques is important too.

However, as we’ve only got one lifetime each, developing specialist knowledge in particular domain areas is expensive. My advice on the content knowledge circle of the Venn diagram is to get good at quickly understanding issues and questions that others can bring to you, rather than try to be a domain specialist. This could be controversial; for example, I was once criticised by statisticians and others for recruiting team members on the basis of statistical and data management skills rather than domain knowledge in XXX. The reality is, we work with others who are the specialists in XXX and its policy problems, but need help in the data area. I look for people with data skills (or potential skills) who can quickly build up familiarity with the domain, rather than limit the range an already difficult job search.

Getting started on statistical computing

That leaves hacking and statistics. In my thinking I break this down into four pragmatic areas where skills need to be developed. I say pragmatic because I don’t have some theory dictating why these four areas, it’s more that when we’re planning training or other skills development, it seems to fall into these categories:

This series of John Hopkins Coursera online courses online courses has had good recommendations: and covers the full range of things, using up to date tools. It’s a commitment, but the fact is there’s a lot to learn. I’d suggest at some point early in the journy getting enrolled in that or a similar course to see if you’ve got the stomach for it. If you haven’t written computer code before, for example, there’s probably a particular psychological hurdle to overcome before you decide this is for you (and you can’t handle data properly without doing it in code).

The “range of things” as I would see them (which is pretty much similar to the curriculum of that course linked above) would be:

Statistics

Learning statistics properly takes effort, and mathematics, and lots of time in front of a computer practicing. One problem is a lot of the statistics learnt at university in non-statistics degrees teaches techniques rather than principles, and often dated at that. To get an idea of what you’re getting into:

One day I’ll do a more extended post on other books-I-love dealing with topics like modelling strategies, time series, surveys, etc.

A computing tool for statistics

A choice needs to be made for a computer language in which to start learning. If you spend more than an hour thinking about R v SAS, R v Python, or R v Julia you’re wasting your time because the reality is if you’re going to get any good at this, you need to be multilingual. However, you have to start somewhere and my recommendation is for R. It’s free, easy to get, forms the lingua franca in the academy, and its open source approach means new techniques get operationalised in it quicker than in SAS, as do bindings to other languages like JavaScript (pretty much essential for fancy modern data presentation). Ideally you learn R and statistics together – R is a computer language written by statisticians for statisticians, so it helps you fall into a statistical way of thinking.

Down the track you need to get familiar with other more general languages – like HTML and JavaScript for web dissemination, LaTeX for static reports and presentations, and probably a general purpose language like Python for generally doing Stuff to data. You also need to get familiar with the basics of the computer’s operating system and using a shell session to get it to do stuff. But other than the minimum that can wait until you’ve broken the ice (I’m assuming people are starting from non-familiarity with coding) with a statistically-oriented language.

Reproducible research eg version control, making things reproducible end to end, etc.

Databases, data management, SQL, tidying and cleaning data

More

So that’s only a beginning. It’s an exciting area. Hopefully I’ve given some indicators that might be useful for someone out there, wondering if they (or someone else) should get into this stuff, and what it will take.

Drawing that diagram

Finally, here’s the code that drew my own version of the Drew Conway data science diagram. I wanted to tweak his original

library(showtext)
library(grid)
library(RColorBrewer)

.add.google("Poppins", "my")
showtext.auto()
palette <- brewer.pal(3, "Set1")

radius <- 0.3
strokecol <- "grey50"
linewidth <- 4
fs <- 11

draw_diagram <- function(){
grid.newpage()
grid.circle(0.33, 0.67, radius, gp = 
               gpar(col = strokecol,
                    fill = palette[1],
                    alpha = 0.2,
                    lwd = linewidth))

grid.circle(0.67, 0.67, radius, gp =
            gpar(col = strokecol,
                 fill = palette[2],
                 alpha = 0.2,
                 lwd = linewidth))

grid.circle(0.5, 0.33, radius, gp =
               gpar(col = strokecol,
                    fill = palette[3],
                    alpha = 0.2,
                    lwd = linewidth))

grid.text("Hacking", 0.25, 0.75, rot = 45, gp =
             gpar(family = "my",
                  size = fs * 2.3,
                  col = palette[1],
                  face = "bold"))

grid.text("Statistics", 0.75, 0.75, rot = -45, gp =
             gpar(family = "my",
                  size = fs * 2.3,
                  col = palette[2],
                  face = "bold"))

grid.text("Contentnknowledge", 0.5, 0.25, rot = 0, gp =
             gpar(family = "my",
                  size = fs * 2.3,
                  col = palette[3],
                  face = "bold"))

grid.text("Danger:nno context", 0.5, 0.75,
          gp = gpar(family = "my",
                    size = fs))

grid.text("Danger: nonunderstandingnof probability", 0.32, 0.48, rot = 45,
          gp = gpar(family = "my",
                    size = fs))

grid.text("Traditionalnresearch", 0.66, 0.46, rot = -45,
          gp = gpar(family = "my",
                    size = fs))


grid.text("Data science /nappliednstatistics", 0.5, 0.55, 
          gp = gpar(family = "my",
                    size = fs * 1.2,
                    face = "bold"))
}

draw_diagram()

To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.