Descriptive Analytics-Part 3 : Outlier treatment

[This article was first published on R-exercises, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

downloadDescriptive Analytics is the examination of data or content, usually manually performed, to answer the question “What happened?”.

In order to be able to solve this set of exercises you should have solved the part 0 and’part 1 , and part 2 of this series but also you should run this script which contain some more data cleaning. In case you haven’t, run this script in your machine which contains the lines of code we used to modify our data set. This is the fourth set of exercise of a series of exercises that aims to provide a descriptive analytics solution to the ‘2008’ data set from here. This data set which contains the arrival and departure information for all domestic flights in the US from 2008 has become the “iris” data set for Big Data. Outliers treatment is a vital part of descriptive analytics since outliers can lead to misleading conclusions regarding our data. So it is an important skill to have in your skill set. The following exercise demonstrates some of the basic and fairly simplistic methods of treating outliers. For more sophisticated methods of dealing with outliers check out this . But keep in mind that many people claim that ‘eyes beat maths’ when it comes to outliers. Before proceeding, it might be helpful to look over the help pages for the table, subset,boxplot.stats, %in%, ifelse, rp.outlier, scores.

For this set of exercises you will need to install and load the package rapportools, outliers.


Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
Print the summary statistics and the structure of the dataset in order to see the type of variables and their extreme values, whether it makes sense or not .

Exercise 2
When it comes to categorical variables, outliers are considered to be the values of which frequency is less than 10% , barplot of flights$UniqueCarrier and flights$CancellationCode. What do you think? There are more categorical variables , so I encourage you to try them out as well.

Exercise 3
Remove the outliers that you have noticed at the barplots of the previous exercise, consider the function subset.

Exercise 4
A good way of detecting outliers from numerical variables is boxplot, make one with flights$ActualElapsedTime.

Exercise 5
Remove the outliers of flights$ActualElapsedTime using boxplot.stats .

Exercise 6
Remove outliers from flights using the subset function ,where TaxiIn is greater than 0 and less than 120.

Exercise 7
Remove outliers from flights using the subset function ,where TaxiOut is greater than 0 and less than 50.

Exercise 8
Assign NA value if the value is an outlier of flights_exp$ArrDelay using the ifelse function.

Exercise 9
Use the rp.outlier to detect and remove the outliers using the Lund Test from flights_exp$Distance , use the rapportools.

Exercise 10
Find the 2% most extreme values of flights$CRSElapsedTime using the scores with chi-square method.

Please help us to improve R-exercises:

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)