It seems that the title “data science” has taken the world by storm. It’s a title that conjures up almost mystical abilities of a person garnering information from oceans of data with ease. It’s where a data scientist can wave his or her hand like a Jedi Knight and simply tell the data what it should be.
What is interesting about the field of data science is it’s perceived (possibly real) threat to other fields, namely statistics. It seems to me that the two fields are distinct areas. Though the two fields can exist separately on their own each is weak without the other. Hilary Mason (of Bitly) shares her definition of a data scientist. I suppose my definition differs from Hilary Mason’s data science definition. Statisticians need to understand the science and structure of data, and data scientists need to understand statistics. Larry Wasserman over at the Normal Deviate blog shares his thoughts on statistics and data science. There are others blogs but these two are probably sufficient.
Data science is emerging as a field of absolutes and that is something that the general public can wrap their heads around. It’s no wonder that statistician are feeling threatened by data scientists. Here are two (albeit extreme) examples:
If a statistician presents an estimate to a journalist and says “here is the point estimate of the number of people listening to a given radio station and states that the margin of error is +/- 3% with a 90% confidence interval” there is almost always a follow-up discussion about the margin of error and how the standard error was calculated (simple random, stratified, cluster) why is it a 90% confidence interval rather than a 95% confidence interval. And then someone is bound to ask what a confidence interval is anyway? Then extend this even further and the statistician gives the journalist a p-value? Now there is an argument between statisticians about hypothesis testing and the terms “frequentist” and “Bayesian” start getting thrown around.
It’s no wonder that people don’t want to work with statisticians. Not only are they confusing to the general public but the statisticians can’t even agree (even if it’s a friendly disagreement) on what is correct. Now if we take the following data scientist example:
A data scientist looks through a small file of 50 billion records where people have listened to songs through a registration-based online radio station (e.g. Spotify, Pandora, TuneIn, etc.). This data scientist then merges and matches the records to a handful of public data sources to give the dataset a dimensionality of 10000. The data scientist then simply reports that there are X number of listeners in a given metro area listening for Y amount of time and produces a a great SVG graph that can be dynamically updated each week with the click of a button on a website. It is a fairly simple task and just about everyone can understand what is means.
I feel that there will always be a need for a solid foundation in statistics. There will always exists natural variation that must be measures and accounted. There will always be data that is so expensive that only a limited number of observations can feasibility be collected. Or suppose that a certain set of data is so difficult to actually obtain that only a handful of observations can even be collected. I would conjecture that a data scientist would not have a clue what to do what that data without help from someone with a background in statistics. At the same time if a statistician was told that there is a 50 billion by 10000 dimension dataset sitting on a Hadoop cluster then I would also guess that many statisticians would be hard pressed to set the data up to analyze without consulting a data scientist. But at the same time a data scientist would probably struggle if they were asked to take those 10000 dimensions and reduce that down to a digestible and understandable set.
Take another example of genetic sequencing. A data scientist could work the data and discover that in one sequence there is something different. Then a domain expert can come in and find that the mutation is in the BRCA1 gene and that the BRCA1 gene relates to breast cancer. A statistician can then be consulted and find the risk and probability that the particular mutation will result in an increased mortality and what the probability will be that the patient will ultimately get breast cancer.
Ultimately, the way I see it the two disciplines need to come together and become one. I see no reason why is can’t be part of the curriculum in statistics department to teach students how to work with real world data. Those working in the data science and statistics fields need to have the statistical training while having the ability to work with data regardless of the location, format, or size.