Compare outlier detection methods with the OutliersO3 package

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Antony Unwin, University of Augsburg, Germany

There are many different methods for identifying outliers and a lot of them are available in R. But are outliers a matter of opinion? Do all methods give the same results?

Articles on outlier methods use a mixture of theory and practice. Theory is all very well, but outliers are outliers because they don’t follow theory. Practice involves testing methods on data, sometimes with data simulated based on theory, better with `real’ datasets. A method can be considered successful if it finds the outliers we all agree on, but do we all agree on which cases are outliers?

The Overview Of Outliers (O3) plot is designed to help compare and understand the results of outlier methods. It is implemented in the OutliersO3 package and was presented at last year’s useR! in Brussels. Six methods from other R packages are included (and, as usual, thanks are due to the authors for making their functions available in packages).

Unwin-fig1
An O3 plot of the stackloss dataset. There is one row for each variable combination (defined by the columns to the left) for which outliers were found, and one column for each case identified as an outlier (the columns to the right).

The starting point was a recent proposal of Wilkinson’s, his HDoutliers algorithm. The plot above shows the default O3 plot for this method applied to the stackloss dataset. (Detailed explanations of O3 plots are in the OutliersO3 vignettes.) The stackloss dataset is a small example (21 cases and 4 variables) and there is an illuminating and entertaining article (Dodge, 1996) that tells you a lot about it.

Wilkinson’s algorithm finds 6 outliers for the whole dataset (the bottom row of the plot). Overall, for various combinations of variables, 14 of the cases are found to be potential outliers (out of 21!). There are no rows for 11 of the possible 15 combinations of variables because no outliers are found with them. If using a tolerance level of 0.05 seems a little bit lax, using 0.01 finds no outliers at all for any variable combination.

Unwin-fig2
An O3plot comparing outliers identified by HDoutliers and mvBACON in the stackloss dataset.

Trying another method with tolerance level=0.05 (mvBACON from robustX) identifies 5 outliers, all ones found for more than one variable combination by HDoutliers. However, no outliers are found for the whole dataset and only one of the three variable combinations where outliers are found is a combination where HDoutliers finds outliers. Of course, the two methods are quite different and it would be strange if they agreed completely. Is it strange that they do not agree more?

There are four other methods available in OutliersO3 and using all six methods on stackloss a tolerance level of 0.05 identifies the following numbers of outliers:

##    HDo    PCS    BAC adjOut    DDC    MCD 
##     14      4      5      0      6      5
Unwin-fig3
An O3 plot of stackloss using the methods HDoutliers, FastPCS, mvBACON, adjOutlyingness, DectectDeviatingCells, covMCD. The darker the cell, the more methods agree. If they all agree, the cell is coloured red and if all but one agree then orange. No case is identified by all the methods as an outlier for any combination of variables when the tolerance level is set at 0.05 for all.

Each method uses what I have called the tolerance level in a rather different way. Sometimes it is called alpha and sometimes (1-alpha). As so often with R, you start wondering if more consistency would not be out of place, even at the expense of a little individuality. OutliersO3 transforms where necessary to ensure that lower tolerance level values mean fewer outliers for all methods, but no attempt has been made to calibrate them equivalently. This is probably why adjOutlyingness finds few or no outliers (results of this method are mildly random). The default value, according to adjOutlyingness’s help page, is an alpha of 0.25.

The stackloss dataset is an odd dataset and small enough that each individual case can be studied in detail (cf. Dodge’s paper for just how much detail). However, similar results have been found with other datasets (milk, Election2005, diamonds, …). The main conclusion so far is that different outlier methods identify different numbers of different cases for different combinations of variables as different from the bulk of the data (i.e. as potential outliers)—or are these datasets just outlying examples?

There are other outlier methods available in R and they will doubtless give yet more different results. The recommendation has to be to proceed with care. Outliers may be interesting in their own right, they may be errors of some kind—and we may not agree whether they are outliers at all.

[Find the R code for generating the above plots here: OutliersUnwin.Rmd]

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)