Dynamic analysis on outliers

Pablo C.

7 years ago

[This article was first published on R - Data Science Heroes Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

< !DOCTYPE html>

< http-equiv="Content-Type" content="text/html; charset=utf-8">

Treating outliers

Introduction

Outliers are the extreme values that a variable has, depending on the model or requirement, it could be necessary to treat them, either transforming or deleting.

Variable “Income” distribution

This is going to be our main variable in this example, which represents customer's income in $. We can observe how there are a few cases with very high values, while on the other hand, there are lots of cases with low/mid values.

If we choose to delete them…

A common question is: “How many cases do we have to leave out?”, we can choose to leave out highest 1%, so we will obtain:

Now the distribution looks very similar to last one, except now it reaches $300.000 instead of $500.000.

If we do this process iteratively -deleting highest 1%, and then to that result, we delete again highest 1%, and so on, repeating this process 10 times- we're analyzing different cut-off values in order to leave out extreme values. We obtain a curious result, silhouette remains always similar to:

Animating the example

The following animation shows in action this iterative deleting process: As we leave out the highest 1%, silhouette keeps a similar aspect to:

In other words, there are always lots of people with low/mid income, and just a few number of cases with high income -because of distribution nature-. Axis values change within each iteration.

If we change the histogram plot, by a density one, the result is more similar to zoom on the data left side:

When we delete the lowest or highest values of any variable, what we are doing is a “zoom” to the area where most cases are.

Final thoughts

In this particular case, we could choose to leave out highest 0.5 or 1% of data. However it is not always recommended to delete all outliers, sometimes they represent valuable information such as fraud or a machine failure, or any other event which deserves further inspection.

Contact

Made by Pablo C. from Data Science Heroes.

R code and data available on github

Outliers issues and data treatment are topics of the e-learning course: Data Science with R (just email for a free demo:info@datascienceheroes.com)

To leave a comment for the author, please follow the link and comment on their blog: R - Data Science Heroes Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.