Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Probably because I first encountered them somewhat late in my professional life, I am fascinated by categorical data types.  Without question, my favorite book on the subject is Alan Agresti’s Categorical Data Analysis (Wiley Series in Probability and Statistics), which provides a well-integrated, comprehensive treatment of the analysis of categorical variables.  One of the topics that Agresti does not discuss, however, is that of interestingness measures, a useful quantitative characterization of categorical variables that comes from the computer science literature rather than the statistics literature.  As I discuss in Chapter 3 of Exploring Data in Engineering, the Sciences, and Medicine, interestingness measures are essentially numerical characterizations of the extent to which a categorical variable is uniformly distributed over its range of possible values: variables where the levels are equally represented are deemed “less interesting” than those whose distribution varies widely across these levels.  Many different interestingness measures have been proposed, and Hilderman and Hamilton give an excellent survey, describing 13 different measures in detail.  (I have been unable to find a PDF version on-line, but the reference is R.J. Hilderman and H.J. Hamilton, “Evaluation of interestingness measures for ranking discovered knowledge,” in Proceedings of the 5th Asia-Pacific Conference on Knowledge Discovery and Data Mining, D. Chueng, G.J. Williams, and Q. Li, eds., Hong Kong, April, 2001, pages 247 to 259.)  In addition, the authors present five behavioral axioms for characterizing interestingness measures.  In Exploring Data, I consider four normalized interestingness measures that satisfy the following three of Hilderman and Hamilton’s axioms:

1.   The minimum value principle: the measure exhibits its minimum value when the variable is uniformly distributed over its range of possible values.  For the normalized measures considered here, this means the measure takes the value 0.
2.   The maximum value principle: the measure exhibits its maximum value when the variable is completely concentrated on one of its several possible values.  For the normalized measures considered here, this means the measure takes the value 1.
3.   The permutation-invariance principle: re-labeling the levels of the variable does not change the value of the interestingness measure.

To compute the interestingness measures considered here, it is necessary to first compute the empirical probabilities that a variable assumes each of its possible values.  To be specific, assume that the variable x can take any one of M possible values, and let pi denote the fraction of the N observed samples of x that assume the ith possible value.  All of the interestingness measures considered here attempt to characterize the extent to which these empirical probabilities are constant, i.e. the extent to which pi is approximately equal to 1/M for all i.  Probably the best known of the four interestingness measures I consider is the Shannon entropy from information theory, which is based on the average of the product pi log pi over all i.  A second measure is the normalized version of Gini’s mean difference from statistics, which is the average distance that pi lies from pj for all i distinct from j, and a third – Simpson’s measure – is a normalized version of the variance of the pi values.  The fourth characterization considered in Exploring Data is Bray’s measure, which comes from ecology and is based on the average of the smaller of pi and 1/M.  The key point here is that, because these measures are computed in different ways, they are sensitive to different aspects of the distributional heterogeneity of a categorical variable over its range of possible values.  Specifically, since all four of these measures assume the value 0 for uniformly distributed variables and 1 for variables completely concentrated on a single value, they can only differ for intermediate degrees of heterogeneity.

The R procedures bray.proc, gini.proc, shannon.proc, and simpson.proc are all available from the Exploring Data  companion website, each implementing the corresponding interestingness measure.  To illustrate the use of these procedures, they are applied here to the UCI Machine Learning Repository Mushroom dataset, which gives 23 categorical characterizations of 8,124 different species of mushrooms, taken from The Audubon Society Field Guide to North American Mushrooms (G.H. Lincoff, Pres., published by Alfred A. Knopf, New York, 1981).  A typical characterization is gill color, which exhibits 12 distinct values, each corresponding to a one-character color code (e.g., “p” for pink, “u” for purple, etc.).  To evaluate the four interestingness measures for any of these attributes, it is necessary to first compute its associated empirical probability vector.  This is easily done with the following R function:

ComputePvalues <- function(x){
xtab <- table(x)
pvect <- xtab/sum(xtab)
pvect
}

Given this empirical probability vector, the four R procedures listed above can be used to compute the corresponding interestingness measure.  As a specific example, the following sequence gives the values for the four interestingness measures for the seven-level variable “Habitat” from the mushroom dataset:

>      x <- mushroom$Habitat > pvect <- ComputePvector(x) > bray.proc(pvect) > 0.427212 > gini.proc(pvect) > 0.548006 > shannon.proc(pvect) > 0.189719 > simpson.proc(pvect) > 0.129993 These results illustrate the point noted above, that while all of these measures span the same range – from 0 for completely uniform level distributions to 1 for variables completely concentrated on one level – in general, the values of the four measures are different. These differences arise from the fact that the different interestingness measures weight specific types of deviations from uniformity differently. To see this, note that the seven levels of the habitat variable considered here have the following distribution: > table(mushroom$Habitat)
>            d         g           l       m         p          u       w
>   3148   2148   832   292   1144   368   192

It is clear from looking at these numbers that the seven different habitat levels are not all equally represented in this dataset, with the most common level (“d”) occurring about 15 times as often as the rarest level (“w”).  The average representation is approximately 1160, so the two most populous levels occur much more frequently than average, one level occurs with about average frequency (“p”), and the other four levels occur anywhere between half as often as average and one tenth as often.  It is clear that the Gini measure is the most sensitive to these deviations from homogeneity, at least for this example, while the Simpson measure is the least sensitive.  These observations raise the following question: to what extent is this behavior typical, and to what extent is it specific to this particular example?  The rest of this post examines this question further.

The four plots shown above compare these different interestingness measures for all of the variables included in the UCI mushroom dataset.  The upper left plot shows the Gini measure plotted against the Shannon measure for all 23 of these categorical variables.  The dashed line in the plot is the equality reference line: if both measures were identical, all points would lie along this line.  The fact that all points lie above this line means that – at least for this dataset – the Gini measure is always larger than the Shannon measure.  The upper right plot shows the Simpson measure plotted against the Shannon measure, and here it is clear that the Simpson measure is generally smaller than the Shannon measure, but there are exceptions.  Further, it is also clear from this plot that the differences between the Shannon and Simpson measures are substantially less than those between the Shannon and Gini measures, again, at least for this dataset.  The lower left plot shows the Bray measure plotted against the Shannon measure, and this plot has the same general shape as the one above for the Gini versus Shannon measures.  The differences between the Bray and Shannon measures appear slightly less than those between the Gini and Shannon measures, suggesting that the Bray measure is slightly smaller than the Gini measure.  The lower right plot of the Bray versus the Gini measures shows that this is generally true: some of the points on this plot appear to fall exactly on the reference line, while the others fall slightly below this line, implying that the Bray measure is generally approximately equal to the Gini measure, but where they differ, the Bray measure is consistently smaller.  Again, it is important to emphasize that the results presented here are based on a single dataset, and they may not hold in general, but these observations do what a good exploratory analysis should: they raise the question.

Another question raised by these results is whether there is anything special about the few points that violate the general rule that the Simpson measure is less than the Shannon measure.  The above figure repeats the upper right plot from the four shown earlier, with the Simpson measure plotted against the Shannon measure.  Here, the six points that correspond to binary characterizations are represented as solid circles, and they correspond to six of the eight points that fall above the line, implying that the Simpson measure is greater than the Shannon measure for these points.  The fact that these are the only binary variables in the dataset suggests – but does not prove – that the Simpson measure tends to be greater than the Shannon measure for binary variables.  Of the other two points above the reference line, one represents the only three-level characterization in the mushroom dataset (specifically, the point corresponding to “RingNum,” a categorical variable based on the number of rings on the mushroom, labeled on the plot).  The last remaining point above the equality reference line corresponds to one of four four-level characterizations in the dataset; the other three four-level characterizations fall below the reference line, as do all of those with more than four levels.  Taken together, these results suggest that the number of levels in a categorical variable influences the Shannon and Simpson measures differently.  In particular, it appears that those cases where the Simpson measure exceeds the Shannon measure tend to be variables with relatively few levels.

In fact, these observations raise a subtle but extremely important point in dealing with categorical variables.  The metadata describing the mushroom dataset indicates that the variable VeilType is binary, with levels universal (“u”) and partial (“p”), but the type listed for all 8,124 mushrooms in the dataset is “p.”  As a consequence, VeilType appears in this analysis as a one-level variable.  Since the basic numerical expressions for all four of the interestingness measures considered here all become indeterminate for one-level variables, this case is tested for explicitly in the R procedures used here, and it is assigned a value of 0: this seems reasonable, representing a classification of “fully homogeneous, evenly spread over the range of possible values,” as it is difficult to imagine declaring a one-level variable to be strongly heterogeneous or “interesting.”  Here, this result is a consequence of defining the number of levels for this categorical variable from the data alone, neglecting the possibility of other values that could be present but are not.  In particular, if we regard this variable as binary – in agreement with the metadata – all four of the interestingness measures would yield the value 1, corresponding to the fact that the observed values are fully concentrated in one of two possible levels.  This example illustrates the difference between internal categorization – i.e., determination of the number of levels for a categorical variable from the observed data alone – and external categorization, where the number of levels is specified by the metadata.  As this example illustrates, the difference can have important consequences for both numerical data characterizations and their interpretation.

Finally, the above figure shows the Gini measure plotted against the Shannon measure for all 23 of the mushroom characteristics, corresponding to the upper left plot in the first figure, but with three additions.  First, the solid curve represents the result of applying the lowess nonparametric smoothing procedure to the scatterplot of Gini versus Shanon measures.  Essentially, this curve fills in the line that your eye suggests is present, approximating the nonlinear relationship between these two measures that is satisfied by most of the points on the plot.  Second, the solid circle represents the variable whose results lie farthest from this curve, which is CapSurf, a four-level categorical variable describing the surface of the mushroom cap as fibrous (“f”), grooved (“g”), scaly (“y”), or smooth (“s”), distributed as follows: 2,320 “f,” 4 “g,” 2,556 “y,” and 3224 “s.”  Note that if the rare cagetory “g” were omitted from consideration, this variable would be almost uniformly distributed over the remaining three values, and this modified result (i.e., with the rare level “g” omitted from the analysis) corresponds to the solid square point that falls on the lowess curve at the lower left end.  Thus, the original CapSurf result appears to represent a specific type of deviation from uniformity that violates the curvilinear relationship that seems to generally hold – at least approximately – between the Gini and Shannon interestingness measures.

The real point of these examples has been two-fold.  First, I wanted to introduce the notion of an interestingness measure and illustrate the application of these numerical summaries to categorical variables, something that is not covered in typical introductory statistics or data analysis courses.  Second, I wanted to show that – in common with most other summary statistics – different data characteristics influence alternative measures differently.  Thus, while it appears that the Simpson measure is generally slightly smaller than the Shannon measure – at least for the examples considered here – this behavior depends strongly on the number of levels the categorical variable can assume, with binary variables consistently violating this rule of thumb (again, at least for the examples considered here).  Similarly, while it appears that the Gini measure always exceeds the Shannon measure, often substantially so, there appear to be certain types of heterogeneous data distributions for which this difference is substantially smaller than normal.  Thus, as is often the case, it can be extremely useful to compute similar characterizations of the same type and comapre them, since unexpected differences can illuminate features of the data that are not initially apparent and which may turn out to be interesting.