Outliers are the extreme values that a variable has, depending on the model or requirement, it could be necessary to treat them, either transforming or deleting.
Variable “Income” distribution
This is going to be our main variable in this example, which represents customer's income in $. We can observe how there are a few cases with very high values, while on the other hand, there are lots of cases with low/mid values.
If we choose to delete them…
A common question is: “How many cases do we have to leave out?”, we can choose to leave out highest 1%, so we will obtain:
Now the distribution looks very similar to last one, except now it reaches $300.000 instead of $500.000.
If we do this process iteratively -deleting highest 1%, and then to that result, we delete again highest 1%, and so on, repeating this process 10 times- we're analyzing different cut-off values in order to leave out extreme values. We obtain a curious result, silhouette remains always similar to:
Animating the example
The following animation shows in action this iterative deleting process:
As we leave out the highest 1%, silhouette keeps a similar aspect to:
In other words, there are always lots of people with low/mid income, and just a few number of cases with high income -because of distribution nature-.
Axis values change within each iteration.
If we change the histogram plot, by a density one, the result is more similar to zoom on the data left side:
When we delete the lowest or highest values of any variable, what we are doing is a “zoom” to the area where most cases are.
In this particular case, we could choose to leave out highest 0.5 or 1% of data. However it is not always recommended to delete all outliers, sometimes they represent valuable information such as fraud or a machine failure, or any other event which deserves further inspection.
Made by Pablo C. from Data Science Heroes.
R code and data available on github