# Descriptive Analytics-Part 3 : Outlier treatment

**R-exercises**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Descriptive Analytics is the examination of data or content, usually manually performed, to answer the question “What happened?”.

In order to be able to solve this set of exercises you should have solved the part 0 and’part 1 , and part 2 of this series but also you should run this script which contain some more data cleaning. In case you haven’t, run this script in your machine which contains the lines of code we used to modify our data set. This is the fourth set of exercise of a series of exercises that aims to provide a descriptive analytics solution to the ‘2008’ data set from here. This data set which contains the arrival and departure information for all domestic flights in the US from 2008 has become the “iris” data set for Big Data. Outliers treatment is a vital part of descriptive analytics since outliers can lead to misleading conclusions regarding our data. So it is an important skill to have in your skill set. The following exercise demonstrates some of the basic and fairly simplistic methods of treating outliers. For more sophisticated methods of dealing with outliers check out this . But keep in mind that many people claim that ‘eyes beat maths’ when it comes to outliers. Before proceeding, it might be helpful to look over the help pages for the ` table`

, ` subset`

,`boxplot.stats`

, ` %in%`

, ` ifelse`

, ` rp.outlier`

, ` scores`

.

For this set of exercises you will need to install and load the package ` rapportools`

,` outliers`

.

`install.packages('rapportools')`

`library(rapportools)`

`install.packages('outliers')`

`library(outliers)`

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

**Exercise 1**

Print the summary statistics and the structure of the dataset in order to see the type of variables and their extreme values, whether it makes sense or not .

**Exercise 2**

When it comes to categorical variables, outliers are considered to be the values of which frequency is less than 10% , `barplot`

of `flights$UniqueCarrier`

and `flights$CancellationCode`

. What do you think? There are more categorical variables , so I encourage you to try them out as well.

**Exercise 3**

Remove the outliers that you have noticed at the barplots of the previous exercise, consider the function `subset`

.

**Exercise 4**

A good way of detecting outliers from numerical variables is `boxplot`

, make one with `flights$ActualElapsedTime`

.

**Exercise 5**

Remove the outliers of `flights$ActualElapsedTime`

using `boxplot.stats`

.

**Exercise 6**

Remove outliers from `flights`

using the `subset`

function ,where ` TaxiIn `

is greater than 0 and less than 120.

**Exercise 7**

Remove outliers from `flights`

using the `subset`

function ,where ` TaxiOut `

is greater than 0 and less than 50.

**Exercise 8**

Assign NA value if the value is an outlier of `flights_exp$ArrDelay`

using the `ifelse`

function.

**Exercise 9**

Use the `rp.outlier`

to detect and remove the outliers using the Lund Test from `flights_exp$Distance`

, use the `rapportools`

.

**Exercise 10**

Find the 2% most extreme values of `flights$CRSElapsedTime`

using the `scores`

with chi-square method.

**leave a comment**for the author, please follow the link and comment on their blog:

**R-exercises**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.