The Art of Exploratory Data Analysis

Ron Pearson (aka TheNoodleDoodler)

11 years ago

[This article was first published on ExploringDataBlog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This blog is about the art of exploratory data analysis, which is also the subject of my new book, Exploring Data in Engineering, the Sciences, and Medicine (http://www.oup.com/us/ExploringData). This art is appropriate in situations where you are faced with an existing dataset that you want to understand better. As < place w:st="on">< placename w:st="on">Stanford < placetype w:st="on">University statistics professor Persi Diaconis describes it in “Theories of data analysis: From magical thinking through classical statistics” (Chapter 1 of Exploring Data Tables, Trends, and Shapes, D.C. Hoaglin, F. Mosteller, and J.W. Tukey, eds., Wiley, 1985), this art involves the following activities:

We look at numbers and try to find patterns. We pursue leads suggested by background information, imagination, patterns perceived, and experience with other data analyses.

Later posts will discuss various things in more detail, but the objective of this first post is to set the stage for these later discussions, emphasizing the following three points. First, data analysis is an exercise in approximation rather than a purely mathematical activity that leads to definite, correct answers. In particular, any answer other than “4” to the question, “how much is 2 + 2?” is simply wrong (barring tricks, of course, like casting the problem in base 3 arithmetic). Similarly, other mathematical questions like “what is the square root of 3?” or “what is the probability of drawing a sample from a Gaussian distribution that lies 2 or more standard deviations away from the mean?” also have clear, correct answers. In contrast, data analysis questions like, “how is the waiting time between eruptions related to the duration of eruptions for the Old Faithful geyser in < place w:st="on">< placename w:st="on">Yellowstone < placetype w:st="on">National Park?” do not have similarly well-defined answers. Certainly, we can provide quantitative characterizations that attempt to describe this relationship (for example, the product-moment correlation coefficient between these values computed from a set of 272 observations is approximately 0.901), but this can usually be done in more than one way (e.g., the Spearman rank correlation coefficient between these same numbers is somewhat smaller, at 0.778), and however we compute these numbers they are generally only partial characterizations. The practical consequence of this difference between data analysis problems and purely mathematical problems is the necessity of somehow dealing with unexplainable variation in the data.

This need for approximation in analyzing real-world data has been widely recognized and it motivates the widespread use of probability and statistics in analyzing data. This is not the only possible approach – set-theoretic methods based on the “unknown but bounded” uncertainty model are also possible, for example, although they are not nearly as popular or as well developed as statistical methods – and there are even those who are unwilling to accept uncertainty in describing real-world data. One colorful example is Alfred William Lawson’s Principle of Zig-Zag-and-Swirl, discussed briefly in my book and in more detail in L.D. Henry’s biography of Lawson, Zig-Zag-and-Swirl (see the Other Interesting Books section of The Exploring Data Store at the end of this post for details). Lawson dabbled in many things, from playing and managing minor league baseball to writing a Utopian novel (the late Martin Gardner characterized it as “the worst work of fiction ever published”). He is credited with introducing the term “aircraft” into general use not long after the Wright brothers first flight, and he obtained the first < country-region w:st="on">< place w:st="on">U.S. airmail contract before his company fell into receivership. He is of interest here because he attempted to develop his own laws of physics, advocating the development of something he termed “Supreme Mathematics” to deal with the complexity of the real world. As an illustrative example, he asked us to consider “a germ, moving across a blood corpuscle in the body of a man who is walking down the aisle of a flying airplane.” To describe the true complexity of the germs path, including the man’s motion in the airplane, the airplane’s flight over the earth, the earth’s rotation about the sun, and so forth, Lawson argued that ordinary mathematics was inadequate and needed to be replaced by his Supreme Mathematics. The fundamental difficulty is that, if we are really attempting to describe anything interesting about the germ, we probably don’t care at all about whatever eighteenth-order influence Jupiter’s gravity may have on the germ’s motion. If we are willing to live with approximations, we can avoid the necessity of Supreme Mathematics.

To proceed, then, we need to make working assumptions about the nature of our approximations, providing practical descriptions of the uncertainty inherent in our data. As noted, this data approximation problem can be approached in a number of different ways, but the most popular approach is via probability theory, describing our data uncertainty with some sort of random variable model. To make this idea useful, we need to adopt a specific random variable model, which brings us to the second key aspect of data analysis considered here: exactly how do we describe data uncertainty? Historically, the Gaussian distribution has been adopted so extensively as a data characterization that it has frequently been simply assumed without question. In fact, a contemporary of Poincare’s once observed that:

Experimentalists tend to regard it as a mathematical result that data values obey a Gaussian distribution, whereas mathematicians tend to regard it as an experimental result.

In practice, this assumption sometimes represents a reasonable approximation of reality, but sometimes it is horribly inadequate. This point is important because adopting the Gaussian distribution as a working assumption leads to a lot of very useful data analysis techniques (ordinary least squares regression methods, to name only one example), but in cases where the Gaussian distribution is a poor approximation, the results obtained using these methods can be extremely misleading.

The third key point of this post is that simple graphical tools for assessing the reasonableness popular working assumptions like this one are extremely useful in exploratory data analysis, as are alternative analysis approaches that perform better when these assumptions are inappropriate. Future posts will describe some of these tools and alternatives in greater detail, but the following example provides a useful illustration of both how we might assess real data distributions and the fact that these distributions can sometimes be very non-Gaussian. Both of these points are illustrated in the four plots in the figure below. The upper two plots show nonparametric density estimates, computed from two different datasets. The curves in these plots are estimates of the probability density function appropriate to these datasets, if we regard each record in the dataset as a statistically independent random sample, drawn from some unknown probability distribution. (If you are familiar with histograms, think of these curves as “smoothed histograms.”)

The upper left plot shows the density estimated from the measurements of the carapace length of 200 crabs, from the crabs dataset included in the MASS package that is part of the open-source R programming language discussed briefly at the end of this post. In this case, the estimated probability density appears reasonably similar to the “bell-shaped curve” that characterizes the Gaussian probability distribution, so this working assumption may be fairly reasonable here. This assumption is further supported by the bottom left plot, which shows the normal quantile-quantile (Q-Q) plot constructed from these 200 measurements. The construction and interpretation of these plots is described in detail in my book and in various other places (see, for example, the Wikipedia entry on Q-Q plots at http://en.wikipedia.org/wiki/Q-Q_plot), but the key point here is that they represent an informal graphical test of the assumption that the Gaussian distribution is a reasonable approximation. If the data values fall approximately on the reference line – as in this plot – the Gaussian assumption may be a reasonable basis for analysis. Conversely, the plot on the lower right shows marked deviations from this reference line, suggesting that the Gaussian distribution may not be reasonable there. This plot was constructed from 272 measurements of the waiting times between eruptions of the Old Faithful geyser in < place w:st="on">< placename w:st="on">Yellowstone < placetype w:st="on">National Park, and the nonparametric density estimate shown in the upper right plot was also constructed from this data sequence. (These numbers were taken from the faithful dataset that is included as part of the base installation of the R programming language.) This upper plot shows two pronounced peaks – one at approximately 50 minutes and the other at approximately 80 minutes – suggesting that the distribution of waiting times is bimodal, consistent with the “kink” seen in the Q-Q plot below it. The point here is that, while the Gaussian distribution may be reasonable as an approximate description of the crab carapace length data, it is not at all reasonable for the < place w:st="on">Old Faithful waiting time data.

In future posts, I plan to discuss some of these ideas in more detail, illustrating them with examples. In no particular order, a few of the topics that appear on my radar are:

Metadata (the information describing the contents of a dataset that is all too often missing, incomplete, or incorrect);
Boxplots, modified boxplots, violinplots, and beanplots – useful tools for characterizing the range of variation of a numerical variable over different data subgroups;
Various types of data anomalies, including outliers, inliers, missing data, and misalignment errors: what they are, how to detect them, and what to do about them;
Interestingness measures as useful characterizations of categorical variables;
Data transformations and the things they can do, both expected and unexpected, sometimes good and sometimes very bad;
And anything else related to exploratory data analysis that strikes me as interesting along the way.

The results presented here were obtained using the open-source software package R, which I like a lot because it is both freely available and fabulous in the range of computational tools it provides. For details, documentation, and download instructions, go to the Comprehensive R Archive Network (CRAN) at the following Web site:

http://cran.r-project.org/

I should emphasize, however, that this blog is not about the R programming language per se, but rather about the kinds of data analysis that R supports extremely well.

I hope you find this blog useful and interesting, and I welcome your comments and questions. I can’t promise to respond right away, but I do promise to carefully consider any questions and comments that come my way and to respond to them when and how I can.

To leave a comment for the author, please follow the link and comment on their blog: ExploringDataBlog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.