[social4i size=”large” align=”float-right”]
(By Andrea Venturini)
Imagine you have a lot of time series – they may be short ones – related to a lot of different measures and very little time to find outliers. You need something not too sophisticated to solve quickly the mess. This is – very shortly speaking – the typical situation in which you can adopt washer.AV() function in R language. In this linked document (washer) you have the function and an example of actual application in R language: a data.frame (dati) with temperature and rain (phen) measures (value) in 4 periods of time (time) and in 20 geographical zones (zone). (20*4*2=160 arbitrary observations).
phen time zone value
1 Temperature 1 a01 2.0
2 Temperature 1 a02 20.0
160 Rain 4 a20 8.5
The example of 20 meteorological stations measuring rainfall and temperature is useful to understand in which situation you can implement the washer() methodology. This methodology considers only 3 observations in a group of time series, for instance all 20 terns between time 2 and 4: if the their shape is similar between each other than no outlier will be detected, otherwise – as it happens to the orange time series in the Rain graph above (at time 2, 3 and 4) – a non-parametric test (Sprent test) will flush out the outlier. Look at the graphs above: while the dynamic of temperature is quite linear, rain have a more fluctuating behaviour. A quite different shape – in the sense of difference from linearity of 3 points – is a strong hint of outlier presence. Let’s look atwasher output:
 phenomenon: 1
 phenomenon: 2
fen t.2 series y.1 y.2 y.3 test.AV AV n median.AV mad.AV madindex.AV
18 Rain 2 a18 5.5 6.3 17.0 5.43 -22.2 20 7.580 5.49 36.58
38 Rain 3 a18 6.3 17.0 5.9 24.25 47.2 20 -4.978 2.15 14.34
59 Temperature 2 a19 22.0 21.0 9.0 5.25 10.7 20 0.000 2.04 13.63
79 Temperature 3 a19 21.0 9.0 18.0 14.92 -21.2 20 -0.917 1.36 9.07
Sprent test identifies an outlier if test.AV is greater of 5. In the output t.2 represents the time of the second observation; series identifies the time series; y.i (i=1,2,3) are the three observations; AV is an index that approximates the shape of the 3 observations (median and mad of AV are expressed in median.AV and mad.AV); n is the group cardinality; madindex.AVis an attempt to indicate if the shape behaviour inside the group is broadly the same or random at all (see below for insights). In the example of rainfall the anomalous observation is the value 17 at time 3 and it is recognized with test.AV=24.25, but also at the preceding time (1,2,3) there is a hint of anomaly, even if a weaker one. It is important to understand that even if the trend of these 3 observations is strongly growing, the shape – in the sense of distance from linearity – is not so bad at time t.2=2 respect to t.2=3.
So, in conclusion:
1. You need a group of more than 10 time series.
2. You need at least 3 observations in the time domain.
3. Time series must have trajectories not completely random but with a similar behaviour in the sense seen above in the example.
The methodology is explained more in detail here: Andrea Venturini; the paper (“Time series outlier detection: a new non parametric methodology (washer)” – Statistica – University of Bologna – 2011 – Vol. 71 pagg. 329-344) can be downloaded here: Time series outlier detection: a new non parametric methodology (washer).