Blog Archives

Classifying the UCI mushrooms

In my last post, I considered the shifts in two interestingness measures as possible tools for selecting variables in classification problems.  Specifically, I considered the Gini and Shannon interestingness measures applied to the 22 categorical mushroom characteristics from the UCI mushroom dataset.  The proposed variable selection strategy was to compare these values when computed from only edible mushrooms...

Read more »

Interestingness comparisons

Interestingness comparisons

In three previous posts (April 3, 2011,  April 12, 2011,and May 21, 2011), I have discussed interestingness measures, which characterize the distributional heterogeneity of categorical variables.  Four specific measures are discussed in Chapter 3 of Exploring Data in Engineering, the Sciences and Medicine: the Bray measure, the Gini measure, the Shannon measure, and the Simpson measure.  All four of...

Read more »

David Olive’s median confidence interval

David Olive’s median confidence interval

As I have discussed in a number of previous posts, the median represents a well-known and widely-used estimate of the “center” of a data sequence.  Relative to the better-known mean, the primary advantage of the median is its much reduced outlier sensitivity.  This post briefly describes a simple confidence interval for the median that is discussed in a paper...

Read more »

Gastwirth’s location estimator

Gastwirth’s location estimator

The problem of outliers – data points that are substantially inconsistent with the majority of the other points in a dataset – arises frequently in the analysis of numerical data.  The practical importance of outliers lies in the fact that even a few of these points can badly distort the results of an otherwise reasonable data analysis.  This outlier-sensitivity...

Read more »

Measuring associations between non-numeric variables

Measuring associations between non-numeric variables

It is often useful to know how strongly or weakly two variables are associated: do they vary together or are they essentially unrelated?  In the case of numerical variables, the best-known measure of association is the product-moment correlation coefficient introduced by Karl Pearson at the end of the nineteenth century.  For variables that are ordered but not necessarily numeric...

Read more »

Moving window filters and the pracma package

Moving window filters and the pracma package

In my last post, I discussed the Hampel filter, a useful moving window nonlinear data cleaning filter that is available in the R package pracma.  In this post, I briefly discuss this moving window filter in a little more detail, focusing on two important practical points: the choice of the filter’s local outlier detection threshold, and the question of...

Read more »

Cleaning time-series and other data streams

Cleaning time-series and other data streams

The need to analyze time-series or other forms of streaming data arises frequently in many different application areas.  Examples include economic time-series like stock prices, exchange rates, or unemployment figures, biomedical data sequences like electrocardiograms or electroencephalograms, or industrial process operating data sequences like temperatures, pressures or concentrations.  As a specific example, the figure below shows four data sequences:...

Read more »

Harmonic means, reciprocals, and ratios of random variables

Harmonic means, reciprocals, and ratios of random variables

In my last few posts, I have considered “long-tailed” distributions whose probability density decays much more slowly than standard distributions like the Gaussian.  For these slowly-decaying distributions, the harmonic mean often turns out to be a much better (i.e., less variable) characterization than the arithmetic mean, which is generally not even well-defined theoretically for these distributions.  Since the harmonic...

Read more »

The Zipf and Zipf-Mandelbrot distributions

The Zipf and Zipf-Mandelbrot distributions

In my last few posts, I have been discussing some of the consequences of the slow decay rate of the tail of the Pareto type I distribution, along with some other, closely related notions, all in the context of continuously distributed data.  Today’s post considers the Zipf distribution for discrete data, which has come to be extremely popular as...

Read more »

Is the “Long Tail” a Useless Concept?

Is the “Long Tail” a Useless Concept?

In response to my last post, “The Long Tail of the Pareto Distribution,” Neil Gunther had the following comment:            “Unfortunately, you've fallen into the trap of using the ‘long tail’ misnomer. If you think about it, it can't possibly be the length of the tail that sets distributions like Pareto and Zipf apart; even the negative exponential and Gaussian...

Read more »