Descriptive Analytics-Part 3 : Outlier treatment
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Descriptive Analytics is the examination of data or content, usually manually performed, to answer the question “What happened?”.
In order to be able to solve this set of exercises you should have solved the part 0 and’part 1 , and part 2 of this series but also you should run this script which contain some more data cleaning. In case you haven’t, run this script in your machine which contains the lines of code we used to modify our data set. This is the fourth set of exercise of a series of exercises that aims to provide a descriptive analytics solution to the ‘2008’ data set from here. This data set which contains the arrival and departure information for all domestic flights in the US from 2008 has become the “iris” data set for Big Data. Outliers treatment is a vital part of descriptive analytics since outliers can lead to misleading conclusions regarding our data. So it is an important skill to have in your skill set. The following exercise demonstrates some of the basic and fairly simplistic methods of treating outliers. For more sophisticated methods of dealing with outliers check out this . But keep in mind that many people claim that ‘eyes beat maths’ when it comes to outliers. Before proceeding, it might be helpful to look over the help pages for the table
, subset
,boxplot.stats
, %in%
, ifelse
, rp.outlier
, scores
.
For this set of exercises you will need to install and load the package rapportools
, outliers
.
install.packages('rapportools')
library(rapportools)
install.packages('outliers')
library(outliers)
Answers to the exercises are available here.
If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.
Exercise 1
Print the summary statistics and the structure of the dataset in order to see the type of variables and their extreme values, whether it makes sense or not .
Exercise 2
When it comes to categorical variables, outliers are considered to be the values of which frequency is less than 10% , barplot
of flights$UniqueCarrier
and flights$CancellationCode
. What do you think? There are more categorical variables , so I encourage you to try them out as well.
Exercise 3
Remove the outliers that you have noticed at the barplots of the previous exercise, consider the function subset
.
Exercise 4
A good way of detecting outliers from numerical variables is boxplot
, make one with flights$ActualElapsedTime
.
Exercise 5
Remove the outliers of flights$ActualElapsedTime
using boxplot.stats
.
Exercise 6
Remove outliers from flights
using the subset
function ,where TaxiIn
is greater than 0 and less than 120.
Exercise 7
Remove outliers from flights
using the subset
function ,where TaxiOut
is greater than 0 and less than 50.
Exercise 8
Assign NA value if the value is an outlier of flights_exp$ArrDelay
using the ifelse
function.
Exercise 9
Use the rp.outlier
to detect and remove the outliers using the Lund Test from flights_exp$Distance
, use the rapportools
.
Exercise 10
Find the 2% most extreme values of flights$CRSElapsedTime
using the scores
with chi-square method.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.