(This article was first published on

**ExploringDataBlog**, and kindly contributed to R-bloggers)Numerically-coded data sequences can exhibit a very wide range of distributional characteristics, including near-Gaussian (historically, the most popular working assumption), strongly asymmetric, light- or heavy-tailed, multi-modal, or discrete (e.g., count data). In addition, numerically coded values can be effectively categorical, either ordered, or unordered. A specific example that illustrates the range of distributional behavior often seen in a collection of numerical variables is the Boston housing dataframe ( ) from the Boston area: 12 of these variables have class “numeric” and the remaining two have class “integer”. The integer variable

**Boston**

**MASS**package in*R*. This dataframe includes 14 numerical variables that characterize 506 suburban housing tracts in the**chas**is in fact a binary flag, taking the value 1 if the tract bounds the Charles river and 0 otherwise, and the integer variable**rad**is described as “an index of accessibility to radial highways,’’ assuming one of nine values: the integers 1 through 8, and 24. The other 12 variables assume anywhere between 26 unique values (for the zoning variable**zn**) to 504 unique values (for the per capita crime rate**crim**). The figure below shows nonparametric density estimates for four of these variables: the per-capita crime rate (**crim**, upper left plot), the percentage of the population designated “lower status” by the researchers who provided the data (**lstat**, upper right plot), the average number of rooms per dwelling (**rm**, lower left plot), and the zoning variable (**zn**, lower right plot). Comparing the appearances of these density estimates, considerable variability is evident: the distribution of**crim**is very asymmetric with an extremely heavy right tail, the distribution of**lstat**is also clearly asymmetric but far less so, while the distribution of**rm**appears to be almost Gaussian. Finally, the distribution of**zn**appears to be tri-modal, mostly concentrated around zero, but with clear secondary peaks at around 20 and 80.Each of these four plots also includes some additional information about the corresponding variable: three vertical reference lines at the mean (the solid line) and the mean offset by plus or minus three standard deviations (the dotted lines), and the value of the normalized Shannon entropy, listed in the title of each plot. This normalized entropy value is discussed in detail in Chapter 3 of Exploring Data in Engineering, the Sciences, and Medicine and in two of my previous posts (April 3, 2011 and May 21, 2011), and it forms the basis for the spacing measure described below. First, however, the reason for including the three vertical reference lines on the density plots is to illustrate that, while popular “Gaussian expectations” for data are approximately met for some numerical variables (the

**rm**variable is a case in point here), often these expectations are violated so much that they are useless. Specifically, note that under approximately Gaussian working assumptions, most of the observed values for the data sequence should fall between the two dotted reference lines, which should correspond approximately to the smallest and largest values seen in the dataset. This description is reasonably accurate for the variable**rm**, and the upper limit appears fairly reasonable for the variable**lstat**, but the lower limit is substantially negative here, which is not reasonable for this variable since it is defined as a percentage. These reference lines appear even more divergent from the general shapes of the distributions for the**crim**and**zn**data, where again, the lower reference lines are substantially negative, infeasible values for both of these variables.The reason the reference values defined by these lines are not particularly representative is the extremely heterogeneous nature of the data distributions, particularly for the variables Shannon measure does not give a reliable indication of distributional heterogeneity here. In particular, note that the Shannon measure for the Shannon entropy value of 0.585.

**crim**– where the distribution exhibits a very long right tail – and**zn**– where the distribution exhibits multiple modes. For categorical variables, distributional heterogeneity can be assessed by measures like the normalized Shannon entropy, which varies between 0 and 1, taking the value zero when all levels of the variable are equally represented, and taking the value 1 when only one of several possible values are present. This measure is easily computed and, while it is intended for use with categorical variables, the procedures used to compute it will return results for numerical variables as well. These values are shown in the figure captions of each of the above four plots, and it is clear from these results that the**crim**variable is zero to three decimal places, suggesting a very homogeneous distribution, while the variables**lstat**and**rm**– both arguably less heterogeneous than**crim**– exhibit slightly larger values of 0.006 and 0.007, respectively. Further, the variable**zn**, whose density estimate resembles that of**crim**more than that of either of the other two variables, exhibits the much largerThe basic difficulty here is that all observations of a continuously distributed random variable Shannon entropy – along with the other heterogeneity measures discussed in Chapter 3 of Shannon entropy value is zero to three significant figures. In marked contrast, the variable Shannon entropy is much larger here, accurately reflecting the pronounced distributional heterogeneity of this variable.

*should*be unique. The normalized*Exploring Data*– effectively treat variables as categorical, returning a value that is computed from the fractions of total observations assigned to each possible value for the variable. Thus, for an ideal continuously-distributed variable, every possible value appears once and only once, so these fractions should be 1/N for each of the N distinct values observed for the variable. This means that the normalized Shannon measure – along with all of the alternative measures just noted – should be identically zero for this case, regardless of whether the continuous distribution in question is Gaussian, Cauchy, Pareto, uniform, or anything else. In fact, the**crim**variable considered here almost meets this ideal requirement: in 506 observations,**crim**exhibits 504 unique values, which is why its normalized**zn**exhibits only 26 distinct values, meaning that each of these values occurs, on average, just over 19 times. However, this average behavior is not representative of the data in this case, since the smallest possible value (0) occurs 372 times, while the largest possible value (100) occurs only once. It is because of the discrete character of this distribution that the normalizedTaken together, these observations suggest a simple extension of the normalized Shannon entropy that can give us a more adequate characterization of distributional differences for numerical variables. Specifically, the idea is this: begin by dividing the total range of a numerical variable Shannon entropy. The four plots below illustrate this basic idea for the four Boston housing variables considered above. Specifically, each plot shows the fraction of observations falling into each of 10 equally spaced intervals, spanning the range from the smallest observed value of the variable to the largest.

*x*into M equal intervals. Then, count the number of observations that fall into each of these intervals and divide by the total number of observations N to obtain the fraction of observations falling into each group. By doing this, we have effectively converted the original numerical variable into an M-level categorical variable, to which we can apply heterogeneity measures like the normalizedAs a specific example, consider the results shown in the upper left plot for the variable Shannon entropy from this ten-level categorical variable yields 0.767, as indicated in the title of the upper left plot. In contrast, the corresponding plot for the Shannon entropy for this grouped variable is much smaller than that for the more heterogeneously distributed Shannon entropy is correspondingly larger, at 0.272. Finally, for the Shannon entropy values are also similar: 0.525 versus 0.767.

**crim**, which varies from a minimum of 0.00632 to a maximum of 89.0. Almost 87% of the observations fall into the smallest 10% of this range, from 0.00632 to 8.9, while the next two groups account for almost all of the remaining observations. In fact, none of the other groups (4 through 10) account for more than 1% of the observations, and one of these groups – group 7 – is completely empty. Computing the normalized**lstat**variable, shown in the upper right, is much more uniform, with the first five groups exhibiting roughly the same fractional occupation. As a consequence, the normalized**crim**variable: 0.138 versus 0.767. Because the distribution is more sharply peaked for the**rm**variable than for**lstat**, the occupation fractions for the grouped version of this variable (lower left plot) are less homogeneous, and the normalized**zn**variable (lower right plot), the grouped distribution appears similar to that for the**crim**variable, and the normalizedThe key point here is that, in contrast to the normalized Shannon entropy applied directly to the numerical variables in the Shannon entropy measure applied to a grouped version of the numerical variable – appears to be a potentially useful measure for this type of preliminary data characterization. For this reason, I am including it – along with a few other numerical characterizations – in the Shannon measure considered here, raising the question of which one would be most effective in this application. In addition, the choice of 10 grouping levels considered here was arbitrary, and it is by no means clear that this choice is the best one. In my next post, I will explore how sensitive the Boston housing results are to changes in these two key design parameters.

**Boston**dataframe, grouping these values into 10 equally-spaced intervals and then computing the normalized Shannon entropy gives a number that seems to be more consistent with the distributional differences between these variables that can be seen clearly in their density plots. Motivation for this numerical measure (i.e., why not just look at the density plots?) comes from the fact that we are sometimes faced with the task of characterizing a new dataset that we have not seen before. While we can – and should – examine graphical representations of these variables, in cases where we have*many*such variables, it is desirable to have a few, easily computed numerical measures to use as screening tools, guiding us in deciding which variables to look at first, and which techniques to apply to them. The spacing measure described here – i.e., the normalized**DataFrameSummary**procedure I am implementing as part of the**ExploringData**package, which I will describe in a later post. Next time, however, I will explore two obvious extensions of the procedure described here: different choices of the heterogeneity measure, and different choices of the number of grouping levels. In particular, as I have shown in previous posts on interestingness measures, the normalized Bray, Gini, and Simpson measures all behave somewhat differently than theFinally, it is worth saying something about how the grouping used here was implemented. The Shannon entropy. The three key components of this function are the

*R*code listed below is the function I used to convert a numerical variable*x*into the grouped variable from which I computed the normalized**classIntervals**function from the*R*package**classInt**(which must be loaded before use; hence, the “library(classInt)” statement at the beginning of the function), and the**cut**and**table**functions from base*R.*The**classIntervals**function generates a two-element list with components**var**, which contains the original observations, and**brks**, which contains the M+1 boundary values for the M groups to be generated. Note that the**style = “equal”**argument is important here, since we want M equal-width groups. The**cut**function then takes these results and converts them into an M-level categorical variable, assigning each original data value to the interval into which it falls. The**table**function counts the number of times each of the M possible levels occurs for this categorical variable. Dividing this vector by the sum of all entries then gives the fraction of observations falling into each group. Plotting the results obtained from this function and reformatting the results slightly yields the four plots shown in the second figure above, and applying the**shannon.proc**procedure available from the OUP companion website for*Exploring Data*yields the Shannon entropy values listed in the figure titles.UniformSpacingFunction <- function(x, nLvls = 10){#library(classInt)#xsum = classIntervals(x,n = nLvls, style="equal")xcut = cut(xsum$var, breaks = xsum$brks, include.lowest = TRUE)xtbl = table(xcut)pvec = xtbl/sum(xtbl)pvec}

To

**leave a comment**for the author, please follow the link and comment on his blog:**ExploringDataBlog**.R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...