by Antony Unwin, University of Augsburg, Germany
There are many different methods for identifying outliers and a lot of them are available in R. But are outliers a matter of opinion? Do all methods give the same results?
Articles on outlier methods use a mixture of theory and practice. Theory is all very well, but outliers are outliers because they don’t follow theory. Practice involves testing methods on data, sometimes with data simulated based on theory, better with `real’ datasets. A method can be considered successful if it finds the outliers we all agree on, but do we all agree on which cases are outliers?
The Overview Of Outliers (O3) plot is designed to help compare and understand the results of outlier methods. It is implemented in the OutliersO3 package and was presented at last year’s useR! in Brussels. Six methods from other R packages are included (and, as usual, thanks are due to the authors for making their functions available in packages).
The starting point was a recent proposal of Wilkinson’s, his HDoutliers algorithm. The plot above shows the default O3 plot for this method applied to the stackloss dataset. (Detailed explanations of O3 plots are in the OutliersO3 vignettes.) The stackloss dataset is a small example (21 cases and 4 variables) and there is an illuminating and entertaining article (Dodge, 1996) that tells you a lot about it.
Wilkinson’s algorithm finds 6 outliers for the whole dataset (the bottom row of the plot). Overall, for various combinations of variables, 14 of the cases are found to be potential outliers (out of 21!). There are no rows for 11 of the possible 15 combinations of variables because no outliers are found with them. If using a tolerance level of 0.05 seems a little bit lax, using 0.01 finds no outliers at all for any variable combination.
Trying another method with tolerance level=0.05 (mvBACON from robustX) identifies 5 outliers, all ones found for more than one variable combination by HDoutliers. However, no outliers are found for the whole dataset and only one of the three variable combinations where outliers are found is a combination where HDoutliers finds outliers. Of course, the two methods are quite different and it would be strange if they agreed completely. Is it strange that they do not agree more?
There are four other methods available in OutliersO3 and using all six methods on stackloss a tolerance level of 0.05 identifies the following numbers of outliers:
## HDo PCS BAC adjOut DDC MCD ## 14 4 5 0 6 5
Each method uses what I have called the tolerance level in a rather different way. Sometimes it is called alpha and sometimes (1-alpha). As so often with R, you start wondering if more consistency would not be out of place, even at the expense of a little individuality. OutliersO3 transforms where necessary to ensure that lower tolerance level values mean fewer outliers for all methods, but no attempt has been made to calibrate them equivalently. This is probably why
adjOutlyingness finds few or no outliers (results of this method are mildly random). The default value, according to
adjOutlyingness’s help page, is an alpha of 0.25.
The stackloss dataset is an odd dataset and small enough that each individual case can be studied in detail (cf. Dodge’s paper for just how much detail). However, similar results have been found with other datasets (milk, Election2005, diamonds, …). The main conclusion so far is that different outlier methods identify different numbers of different cases for different combinations of variables as different from the bulk of the data (i.e. as potential outliers)—or are these datasets just outlying examples?
There are other outlier methods available in R and they will doubtless give yet more different results. The recommendation has to be to proceed with care. Outliers may be interesting in their own right, they may be errors of some kind—and we may not agree whether they are outliers at all.
[Find the R code for generating the above plots here: OutliersUnwin.Rmd]