Blog Archives

The Long Tail of the Pareto Distribution

In my last two posts, I have discussed cases where the mean is of little or no use as a data characterization.  One of the specific examples I discussed last time was the case of the Pareto type I distribution, for which the density is given by:                        p(x) = aka/xa+1defined for all x > k, where k and a...

Some Additional Thoughts on Useless Averages

In my last post, I described three situations where the average of a sequence of numbers is not representative enough to be useful: in the presence of severe outliers, in the face of multimodal data distributions, and in the face of infinite-variance distributions.  The post generated three interesting comments that I want to respond to here.First and foremost, I...

When are averages useless?

Of all possible single-number characterizations of a data sequence, the average is probably the best known.  It is also easy to compute and in favorable cases, it provides a useful characterization of “the typical value” of a sequence of numbers.  It is not the only such “typical value,” however, nor is it always the most useful one: two other...

Fitting mixture distributions with the R package mixtools

My last two posts have been about mixture models, with examples to illustrate what they are and how they can be useful.  Further discussion and more examples can be found in Chapter 10 of Exploring Data in Engineering, the Sciences, and Medicine.  One important topic I haven’t covered is how to fit mixture models to datasets like the Old Faithful geyser...

Mixture distributions and models: a clarification

In response to my last post, Chris had the following comment:           I am actually trying to better understand the distinction between mixture models and mixture distributions in my own work.  You seem to say mixture models apply to a small set of models – namely regression models.This comment suggests that my caution about the difference between mixed-effect models and mixture distributions...

A Brief Introduction to Mixture Distributions

Last time, I discussed some of the advantages and disadvantages of robust estimators like the median and the MADM scale estimator, noting that certain types of datasets – like the rainfall dataset discussed last time – can cause these estimators to fail spectacularly.  An extremely useful idea in working with datasets like this one is that of mixture distributions,...

The pros and cons of robust data characterizations

Over the years, I have looked at a lot of data contaminated with outliers, the subject of Chapter 7 of Exploring Data in Engineering, the Sciences, and Medicine.  That chapter adopts the definition of an outlier presented by Barnett and Lewis in their book Outliers in Statistical Data 2nd Edition

The distribution of interestingness

On April 22, David Landy posed a question about the distribution of interestingness values in response to my April 3rd post on “Interestingness Measures.”  He noted that the survey paper by Hilderman and Hamilton that I cited there makes the following comment: “Our belief is that a useful measure of interestingness should generate index values that are reasonably distributed throughout...

The distribution of interestingness

On April 22, David Landy posed a question about the distribution of interestingness values in response to my April 3rd post on “Interestingness Measures.”  He noted that the survey paper by Hilderman and Hamilton that I cited there makes the following comment: “Our belief is that a useful measure of interestingness should generate index values that are reasonably distributed throughout...

Computing Odds Ratios in R

In my last post, I discussed the use of odds ratios to characterize the association between edibility and binary mushroom characteristics for the mushrooms characterized in the UCI mushroom dataset.  I did not, however, describe those co...