In my last two posts, I have discussed cases where the mean is of little or no use as a data characterization. One of the specific examples I discussed last time was the case of the Pareto type I distribution, for which the density is given by: p(x) = aka/... [Read more...]

In my last post, I described three situations where the average of a sequence of numbers is not representative enough to be useful: in the presence of severe outliers, in the face of multimodal data distributions, and in the face of infinite-variance distributions. The post generated three interesting comments that ... [Read more...]

Of all possible single-number characterizations of a data sequence, the average is probably the best known. It is also easy to compute and in favorable cases, it provides a useful characterization of “the typical value” of a sequence of numbers. It is not the only such “typical value,” however, nor ... [Read more...]

My last two posts have been about mixture models, with examples to illustrate what they are and how they can be useful. Further discussion and more examples can be found in Chapter 10 of Exploring Data in Engineering, the Sciences, and Medicine. One important topic I haven’t covered is how ... [Read more...]

In response to my last post, Chris had the following comment: I am actually trying to better understand the distinction between mixture models and mixture distributions in my own work. You seem to say mixture models apply to a small set of models – namely regression models.This comment suggests that ... [Read more...]

Last time, I discussed some of the advantages and disadvantages of robust estimators like the median and the MADM scale estimator, noting that certain types of datasets – like the rainfall dataset discussed last time – can cause these estimators to fail spectacularly. An extremely useful idea in working with datasets like ... [Read more...]

Over the years, I have looked at a lot of data contaminated with outliers, the subject of Chapter 7 of Exploring Data in Engineering, the Sciences, and Medicine. That chapter adopts the definition of an outlier presented by Barnett and Lewis in their book Outliers in Statistical Data 2nd Edition, that ... [Read more...]

On April 22, David Landy posed a question about the distribution of interestingness values in response to my April 3rd post on “Interestingness Measures.” He noted that the survey paper by Hilderman and Hamilton that I cited there makes the following comment:
“Our belief is that a useful measure of interestingness ... [Read more...]

In my last post, I discussed the use of odds ratios to characterize the association between edibility and binary mushroom characteristics for the mushrooms characterized in the UCI mushroom dataset. I did not, however, describe those co...

[Read more...] In my last two posts, I have used the UCI mushroom dataset to illustrate two things. The first was the use of interestingness measures to characterize categorical variables, and the second was the use of binary confidence intervals...

[Read more...] In my last post, I considered the UCI mushroom dataset and characterized the variables included there using four different interestingness measures. When I began drafting this post, my intention was to consider the question of how the different m...

[Read more...] Probably because I first encountered them somewhat late in my professional life, I am fascinated by categorical data types. Without question, my favorite book on the subject is Alan Agresti’s Categorical Data Analysis (Wiley Series in Probabili...

[Read more...] My last four posts have dealt with boxplots and some useful variations on that theme. Just after I finished the series, Tal Galili, who maintains the R-bloggers website, pointed me to a variant I hadn’t seen before. It's called a bee...

[Read more...] This post is the last in a series of four on boxplots and some of their extensions. Previous posts in this series have discussed basic boxplots, modified boxplots based on a robust asymmetry measure, and violin plots, an alternative that essentia...

[Read more...] This post is the third in a series of four on boxplots and closely related data visualization techniques for comparing subsets of a dataset, or comparing different datasets that we hope or expect to be similarly distributed. The previous two post...

[Read more...] In my last post, I discussed boxplots in their simplest forms, illustrating some of the useful options available with the boxplot command in the open-source statistical software package R. As I noted in that post, the basic boxplot is both useful...

[Read more...] Boxplots are a simple and reasonably popular way of summarizing the range of variation of a real-valued variable across different subsets of data. Typical examples might include diastolic blood pressure across a group of patients, broken dow...

[Read more...] This blog is about the art of exploratory data analysis, which is also the subject of my new book, Exploring Data in Engineering, the Sciences, and Medicine (http://www.oup.com/us/ExploringData). This art is appropriate in situations where y...

