# Articles by Ron Pearson (aka TheNoodleDoodler)

### The Long Tail of the Pareto Distribution

September 17, 2011 |

In my last two posts, I have discussed cases where the mean is of little or no use as a data characterization.  One of the specific examples I discussed last time was the case of the Pareto type I distribution, for which the density is given by:                        p(x) = aka/... [Read more...]

### Some Additional Thoughts on Useless Averages

August 27, 2011 |

In my last post, I described three situations where the average of a sequence of numbers is not representative enough to be useful: in the presence of severe outliers, in the face of multimodal data distributions, and in the face of infinite-variance distributions.  The post generated three interesting comments that ... [Read more...]

### When are averages useless?

August 20, 2011 |

Of all possible single-number characterizations of a data sequence, the average is probably the best known.  It is also easy to compute and in favorable cases, it provides a useful characterization of “the typical value” of a sequence of numbers.  It is not the only such “typical value,” however, nor ... [Read more...]

### Fitting mixture distributions with the R package mixtools

August 6, 2011 |

My last two posts have been about mixture models, with examples to illustrate what they are and how they can be useful.  Further discussion and more examples can be found in Chapter 10 of Exploring Data in Engineering, the Sciences, and Medicine.  One important topic I haven’t covered is how ... [Read more...]

### Mixture distributions and models: a clarification

July 16, 2011 |

In response to my last post, Chris had the following comment:           I am actually trying to better understand the distinction between mixture models and mixture distributions in my own work.  You seem to say mixture models apply to a small set of models – namely regression models.This comment suggests that ... [Read more...]

### A Brief Introduction to Mixture Distributions

June 18, 2011 |

Last time, I discussed some of the advantages and disadvantages of robust estimators like the median and the MADM scale estimator, noting that certain types of datasets – like the rainfall dataset discussed last time – can cause these estimators to fail spectacularly.  An extremely useful idea in working with datasets like ... [Read more...]

### The pros and cons of robust data characterizations

June 6, 2011 |

Over the years, I have looked at a lot of data contaminated with outliers, the subject of Chapter 7 of Exploring Data in Engineering, the Sciences, and Medicine.  That chapter adopts the definition of an outlier presented by Barnett and Lewis in their book Outliers in Statistical Data 2nd Edition, that ... [Read more...]

### The distribution of interestingness

May 21, 2011 |

On April 22, David Landy posed a question about the distribution of interestingness values in response to my April 3rd post on “Interestingness Measures.”  He noted that the survey paper by Hilderman and Hamilton that I cited there makes the following comment: “Our belief is that a useful measure of interestingness ... [Read more...]

### Computing Odds Ratios in R

May 7, 2011 |

In my last post, I discussed the use of odds ratios to characterize the association between edibility and binary mushroom characteristics for the mushrooms characterized in the UCI mushroom dataset.  I did not, however, describe those co...

### Measuring association using odds ratios

April 23, 2011 |

In my last two posts, I have used the UCI mushroom dataset to illustrate two things.  The first was the use of interestingness measures to characterize categorical variables, and the second was the use of binary confidence intervals...

### Screening for predictive characteristics … and a mea culpa

April 12, 2011 |

In my last post, I considered the UCI mushroom dataset and characterized the variables included there using four different interestingness measures.  When I began drafting this post, my intention was to consider the question of how the different m...

### Interestingness Measures

April 3, 2011 |

Probably because I first encountered them somewhat late in my professional life, I am fascinated by categorical data types.  Without question, my favorite book on the subject is Alan Agresti’s Categorical Data Analysis (Wiley Series in Probabili...

### The Many Uses of Q-Q Plots

March 23, 2011 |

My last four posts have dealt with boxplots and some useful variations on that theme.  Just after I finished the series, Tal Galili, who maintains the R-bloggers website, pointed me to a variant I hadn’t seen before.  It's called a bee...

### Boxplots & Beyond IV: Beanplots

March 5, 2011 |

This post is the last in a series of four on boxplots and some of their extensions.  Previous posts in this series have discussed basic boxplots, modified boxplots based on a robust asymmetry measure, and violin plots, an alternative that essentia...

### Boxplots and Beyond III: Violin Plots

February 15, 2011 |

This post is the third in a series of four on boxplots and closely related data visualization techniques for comparing subsets of a dataset, or comparing different datasets that we hope or expect to be similarly distributed.  The previous two post...

### Boxplots and Beyond – Part II: Asymmetry

February 6, 2011 |

In my last post, I discussed boxplots in their simplest forms, illustrating some of the useful options available with the boxplot command in the open-source statistical software package R.  As I noted in that post, the basic boxplot is both useful...

### Boxplots and Beyond – Part I

January 29, 2011 |

Boxplots are a simple and reasonably popular way of summarizing the range of variation of a real-valued variable across different subsets of data.  Typical examples might include diastolic blood pressure across a group of patients, broken dow...