Dynamic analysis on outliers

April 24, 2015
By

(This article was first published on R - Data Science Heroes Blog, and kindly contributed to R-bloggers)




Treating outliers

Introduction

Outliers are the extreme values that a variable has, depending on the model or requirement, it could be necessary to treat them, either transforming or deleting.

Variable “Income” distribution

01_income_distribution

This is going to be our main variable in this example, which represents customer's income in $. We can observe how there are a few cases with very high values, while on the other hand, there are lots of cases with low/mid values.

If we choose to delete them…

A common question is: “How many cases do we have to leave out?”, we can choose to leave out highest 1%, so we will obtain:

02_income_p99.JPG

Now the distribution looks very similar to last one, except now it reaches $300.000 instead of $500.000.

If we do this process iteratively -deleting highest 1%, and then to that result, we delete again highest 1%, and so on, repeating this process 10 times- we're analyzing different cut-off values in order to leave out extreme values. We obtain a curious result, silhouette remains always similar to:

03_density_simple.JPG

Animating the example

The following animation shows in action this iterative deleting process:
As we leave out the highest 1%, silhouette keeps a similar aspect to:

04_fractal_outliers.gif

In other words, there are always lots of people with low/mid income, and just a few number of cases with high income -because of distribution nature-.
Axis values change within each iteration.

If we change the histogram plot, by a density one, the result is more similar to zoom on the data left side:

05_fractal_outliers_density.gif

When we delete the lowest or highest values of any variable, what we are doing is a “zoom” to the area where most cases are.

Final thoughts

In this particular case, we could choose to leave out highest 0.5 or 1% of data. However it is not always recommended to delete all outliers, sometimes they represent valuable information such as fraud or a machine failure, or any other event which deserves further inspection.

Contact

Made by Pablo C. from Data Science Heroes.

R code and data available on github

Outliers issues and data treatment are topics of the e-learning course: Data Science with R (just email for a free demo:[email protected])

To leave a comment for the author, please follow the link and comment on their blog: R - Data Science Heroes Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)