Professor Hans Rosling certainly is a remarkable figure. I recommend watching his performances. Especially the BBC’s “Joy of Stats” is exemplary. Rosling sells passion for data, visual clarity and great deal of comedy. He represents the data-driven paradigm in science. What is it? And is it as exciting and promising as the documentary suggests?
Data-driven scientists (data miners) such as Rosling believe that data can tell a story, that observation equals information, that the best way towards scientific progress is to collect data, visualize them and analyze them (data miners are not specific about what analyze means exactly). When you listen to Rosling carefully he sometimes makes data equivalent to statistics: a scientist collects statistics. He also claims that “if we can uncover the patterns in the data then we can understand“. I know this attitude: there are massive initiatives to mobilize data, integrate data, there are methods for data assimilation and data mining, and there is an enormous field of scientific data visualization. Data-driven scientists sometimes call themselves informaticians or data scientists. And they are all excited about big data: the larger is the number of observations (N) the better.
Rosling is right that data are important and that science uses statistics to deal with the data. But he completely ignores the second component of statistics: hypothesis (here equivalent to model or theory). There are two ways to define statistics and both require data as well as hypotheses: (1) Frequentist statistics makes probabilistic statements about data, given the hypothesis. (2) Bayesian statistics works the other way round: it makes probabilistic statements about the hypothesis, given the data. Frequentist statistics prevailed as a major discourse as it used to be computationally simpler. However, it is also less consistent with the way we think – we are nearly always ultimately curious about the Bayesian probability of the hypothesis (i.e. “how probable it is that things work a certain way, given what we see”) rather then in the frequentist pobability of the data (i.e. “how likely it is that we would see this if we repeated the experiment again and again and again”).
In any case, data and hypothesis are two fundamental parts of both Bayesian and frequentist statistics. Emphasisizing data at the expense of hypothesis means that we ignore the actual thinking and we end up with trivial or arbitrary statements, spurious relationships emerging by chance, maybe even with plenty of publications, but with no real understanding. This is the ultimate and unfortunate fate of all data miners. I shall note that the opposite is similarly dangerous: Putting emphasis on hypotheses (the extreme case of hypothesis-driven science) can lead to a lunatic abstractions disconnected from what we observe. Good science keeps in mind both the empirical observations (data) and theory (hypotheses, models).
Is it any good to have large data (high N)? In other words, does high number of observations lead to better science? It doesn’t. Data have their value only when confronted with a useful theory. Theories can get strong and robust support even from relatively small data (Fig. 1a, b). Hypotheses and relationships that need very large data to be demonstrated (Fig. 1c, d) are weak hypotheses and weak relationships. Testing simple theories is more of a hassle with very large data than with small data, especially in the computationally intensive Bayesian framework. Finally, collection, storage and handling of very large data costs a lot of effort, time and money.
Figure 1 Strong effects (the slope of the linear model y=f(x)) can get strong support even from small data (a). Collecting more data does not increase the support very much (b) and is just a waste of time, effort, storage space and money. Weak effects will find no support in small data (c) and will be supported only by very large datasets (d). In case of (d) there is such a large amount of unexplained variability and the effect is so weak, that the hypothesis that y=f(x) does not seem very interesting – there is probably some not yet imagined cause of the variability. Note that as Bayesian I can afford to speak about direct support for hypotheses (unlike frequentists who can only reject them).
My final argument is that data are not always an accurate representation of what they try to measure. Especially in life sciences and social sciences (the “messy” fields) data are regularly contaminated by measurement errors, subjective biases, incomplete coverages, non-independence, detectability problems, aggregation problems, poor metadata, nomenclature problems and so on. Collecting more data may enhance such problems and can lead to spurious patterns. On the other hand, if the theory-driven approach is adopted, these biases can be made an integral part of the model, fitted to the data, and accounted for. What is then visualized are not the raw biased data but the (hopefully) unbiased model predictions of the real process of interest.
So why many scientists find data-driven research and large data exciting? It has nothing to do with science. The desire to have datasets as large as possible and to create giant data mines is driven by our instinctive craving for plenty (richness), and by boyish tendency to have a “bigger” toy (car, gun, house, pirate ship, database) than anyone else. And whoever guards the vaults of data holds power over all of the other scientists who crave the data.
But most importantly, data-driven science is less intellectually demanding then hypothesis-driven science. Data mining is sweet, anyone can do it. Plotting multivariate data, maps, “relationships” and colorful visualizations is hip and catchy, everybody can understand it. By contrary, thinking about theory can be pain and it requires a rare commodity: imagination.