**ExploringDataBlog**, and kindly contributed to R-bloggers)

*Exploring Data in Engineering, the Sciences, and Medicine*(http://www.oup.com/us/ExploringData). This art is appropriate in situations where you are faced with an existing dataset that you want to understand better. As

*Exploring Data Tables, Trends, and Shapes*, D.C. Hoaglin, F. Mosteller, and J.W. Tukey, eds., Wiley, 1985), this art involves the following activities:

We look at numbers and try to find patterns. We pursue leads suggested by background information, imagination, patterns perceived, and experience with other data analyses.

This need for approximation in analyzing real-world data has been widely recognized and it motivates the widespread use of probability and statistics in analyzing data. This is not the only possible approach – set-theoretic methods based on the “unknown but bounded” uncertainty model are also possible, for example, although they are not nearly as popular or as well developed as statistical methods – and there are even those who are unwilling to accept uncertainty in describing real-world data. One colorful example is Alfred William Lawson’s Principle of Zig-Zag-and-Swirl, discussed briefly in my book and in more detail in L.D. Henry’s biography of Lawson, *Zig-Zag-and-Swirl *(see the *Other Interesting Books* section of *The Exploring Data Store* at the end of this post for details). Lawson dabbled in many things, from playing and managing minor league baseball to writing a Utopian novel (the late Martin Gardner characterized it as “the worst work of fiction ever published”). He is credited with introducing the term “aircraft” into general use not long after the Wright brothers first flight, and he obtained the first

*specific*random variable model, which brings us to the second key aspect of data analysis considered here: exactly how do we describe data uncertainty? Historically, the Gaussian distribution has been adopted so extensively as a data characterization that it has frequently been simply assumed without question. In fact, a contemporary of Poincare’s once observed that:

Experimentalists tend to regard it as a mathematical result that data values obey a Gaussian distribution, whereas mathematicians tend to regard it as an experimental result.

*sometimes*represents a reasonable approximation of reality, but sometimes it is horribly inadequate. This point is important because adopting the Gaussian distribution as a working assumption leads to a lot of very useful data analysis techniques (ordinary least squares regression methods, to name only one example), but in cases where the Gaussian distribution is a poor approximation, the results obtained using these methods can be extremely misleading.

**crabs**dataset included in the

**MASS**package that is part of the open-source

*R*programming language discussed briefly at the end of this post. In this case, the estimated probability density appears reasonably similar to the “bell-shaped curve” that characterizes the Gaussian probability distribution, so this working assumption may be fairly reasonable here. This assumption is further supported by the bottom left plot, which shows the normal quantile-quantile (Q-Q) plot constructed from these 200 measurements. The construction and interpretation of these plots is described in detail in my book and in various other places (see, for example, the Wikipedia entry on Q-Q plots at http://en.wikipedia.org/wiki/Q-Q_plot), but the key point here is that they represent an informal graphical test of the assumption that the Gaussian distribution is a reasonable approximation. If the data values fall approximately on the reference line – as in this plot – the Gaussian assumption may be a reasonable basis for analysis. Conversely, the plot on the lower right shows marked deviations from this reference line, suggesting that the Gaussian distribution may not be reasonable there. This plot was constructed from 272 measurements of the waiting times between eruptions of the Old Faithful geyser in

**faithful**dataset that is included as part of the base installation of the

*R*programming language.) This upper plot shows two pronounced peaks – one at approximately 50 minutes and the other at approximately 80 minutes – suggesting that the distribution of waiting times is bimodal, consistent with the “kink” seen in the Q-Q plot below it. The point here is that, while the Gaussian distribution may be reasonable as an approximate description of the crab carapace length data, it is not at all reasonable for the

- Metadata (the information describing the contents of a dataset that is all too often missing, incomplete, or incorrect);
- Boxplots, modified boxplots, violinplots, and beanplots – useful tools for characterizing the range of variation of a numerical variable over different data subgroups;
- Various types of data anomalies, including outliers, inliers, missing data, and misalignment errors: what they are, how to detect them, and what to do about them;
- Interestingness measures as useful characterizations of categorical variables;
- Data transformations and the things they can do, both expected and unexpected, sometimes good and sometimes very bad;
- And anything else related to exploratory data analysis that strikes me as interesting along the way.

*R*, which I like a lot because it is both freely available and fabulous in the range of computational tools it provides. For details, documentation, and download instructions, go to the Comprehensive R Archive Network (CRAN) at the following Web site:

*not*about the R programming language

*per se,*but rather about the kinds of data analysis that R supports extremely well.

**leave a comment**for the author, please follow the link and comment on his blog:

**ExploringDataBlog**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...