**ExploringDataBlog**, and kindly contributed to R-bloggers)

*interestingness measures*, a useful quantitative characterization of categorical variables that comes from the computer science literature rather than the statistics literature. As I discuss in Chapter 3 of Exploring Data in Engineering, the Sciences, and Medicine, interestingness measures are essentially numerical characterizations of the extent to which a categorical variable is uniformly distributed over its range of possible values: variables where the levels are equally represented are deemed “less interesting” than those whose distribution varies widely across these levels. Many different interestingness measures have been proposed, and Hilderman and Hamilton give an excellent survey, describing 13 different measures in detail. (I have been unable to find a PDF version on-line, but the reference is R.J. Hilderman and H.J. Hamilton, “Evaluation of interestingness measures for ranking discovered knowledge,” in

*Proceedings of the 5*, D. Chueng, G.J. Williams, and Q. Li, eds., Hong Kong, April, 2001, pages 247 to 259.) In addition, the authors present five behavioral axioms for characterizing interestingness measures. In

^{th}Asia-Pacific Conference on Knowledge Discovery and Data Mining*Exploring Data*, I consider four normalized interestingness measures that satisfy the following three of Hilderman and

_{i }denote the fraction of the N observed samples of x that assume the i

^{th}possible value. All of the interestingness measures considered here attempt to characterize the extent to which these empirical probabilities are constant, i.e. the extent to which p

_{i}is approximately equal to 1/M for all i. Probably the best known of the four interestingness measures I consider is the

_{i}log p

_{i }over all i. A second measure is the normalized version of Gini’s mean difference from statistics, which is the average distance that p

_{i}lies from p

_{j}for all i distinct from j, and a third – Simpson’s measure – is a normalized version of the variance of the p

_{i}values. The fourth characterization considered in

*Exploring Data*is Bray’s measure, which comes from ecology and is based on the average of the smaller of p

_{i}and 1/M. The key point here is that, because these measures are computed in different ways, they are sensitive to different aspects of the distributional heterogeneity of a categorical variable over its range of possible values. Specifically, since all four of these measures assume the value 0 for uniformly distributed variables and 1 for variables completely concentrated on a single value, they can only differ for intermediate degrees of heterogeneity.

*R*procedures

**bray.proc**,

**gini.proc**,

**shannon.proc**, and

**simpson.proc**are all available from the

*Exploring Data*companion website, each implementing the corresponding interestingness measure. To illustrate the use of these procedures, they are applied here to the UCI Machine Learning Repository Mushroom dataset, which gives 23 categorical characterizations of 8,124 different species of mushrooms, taken from

*The Audubon Society Field Guide to North American Mushrooms*(G.H. Lincoff, Pres., published by Alfred A. Knopf, New York, 1981). A typical characterization is gill color, which exhibits 12 distinct values, each corresponding to a one-character color code (e.g., “p” for pink, “u” for purple, etc.). To evaluate the four interestingness measures for any of these attributes, it is necessary to first compute its associated empirical probability vector. This is easily done with the following

*R*function:

*R*procedures listed above can be used to compute the corresponding interestingness measure. As a specific example, the following sequence gives the values for the four interestingness measures for the seven-level variable “Habitat” from the mushroom dataset:

It is clear from looking at these numbers that the seven different habitat levels are not all equally represented in this dataset, with the most common level (“d”) occurring about 15 times as often as the rarest level (“w”). The average representation is approximately 1160, so the two most populous levels occur much more frequently than average, one level occurs with about average frequency (“p”), and the other four levels occur anywhere between half as often as average and one tenth as often. It is clear that the Gini measure is the most sensitive to these deviations from homogeneity, at least for this example, while the Simpson measure is the least sensitive. These observations raise the following question: to what extent is this behavior typical, and to what extent is it specific to this particular example? The rest of this post examines this question further.

*R*procedures used here, and it is assigned a value of 0: this seems reasonable, representing a classification of “fully homogeneous, evenly spread over the range of possible values,” as it is difficult to imagine declaring a one-level variable to be strongly heterogeneous or “interesting.” Here, this result is a consequence of defining the number of levels for this categorical variable from the data alone, neglecting the possibility of other values that could be present but are not. In particular, if we regard this variable as binary – in agreement with the metadata – all four of the interestingness measures would yield the value 1, corresponding to the fact that the observed values are fully concentrated in one of two possible levels. This example illustrates the difference between

*internal categorization*– i.e., determination of the number of levels for a categorical variable from the observed data alone – and

*external categorization*, where the number of levels is specified by the metadata. As this example illustrates, the difference can have important consequences for both numerical data characterizations and their interpretation.

**leave a comment**for the author, please follow the link and comment on their blog:

**ExploringDataBlog**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...