How to Remove Outliers in R

[This article was first published on finnstats », and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Story 484865915

How to Remove Outliers in R?, What does outlier mean? It’s an observation that differs significantly from the rest of the data set’s values. Outliers can skew the results by providing false information.

We’ll go over how to eliminate outliers from a dataset in this section.

How to Remove Outliers in R

To begin, we must first identify the outliers in a dataset; typically, two methods are available.

That’s z scores and interquartile range.

Naive Bayes Classification in R » Prediction Model »

1. Interquartile range.

In a dataset, it is the difference between the 75th percentile (Q3) and the 25th percentile (Q1).

The interquartile range (IQR) is a measurement of the spread of values in the middle 50%.

If an observation is 1.5 times the interquartile range more than the third quartile (Q3) or 1.5 times the interquartile range less than the first quartile (Q1), it is considered an outlier (Q1).

Outlier = Observations > Q3 + 1.5*IQR  or < Q1 – 1.5*IQR

2. Use z-scores.

The z-score indicates the number of standard deviations a given value deviates from the mean. A z-score is calculated using the following formula:

z = (X – μ) / σ

where:

X is a single raw data value

μ is the population mean

σ is the population standard deviation

If an observation’s z-score is less than -3 or larger than 3, it’s considered an outlier.

Outlier = values with z-scores > 3 or < -3

How to Remove Outliers in R

You can find and eliminate outliers from a dataset once you’ve decided what you believe to be an outlier. We’ll use the following data frame to demonstrate how to do so

What all skills required for a data scientist? »

set.seed(123)
data <- data.frame(Apperance=rnorm(100, mean=8, sd=4),
Thickness=rnorm(100, mean=15, sd=2.3),
Softness=rnorm(100, mean=29, sd=2.5))
head(data)
  Apperance Thickness Softness
1 16.037138  15.53825 30.37900
2 10.553282  17.06525 25.33928
3  7.857956  21.40918 30.56349
4 12.113920  12.98547 29.77937
5  4.156319  17.92857 29.54414
6 10.438148  13.15290 31.03692

Method 1:- Z-score

The code below demonstrates how to calculate the z-score of each value in each column in the data frame, then eliminate rows having at least one z-score with an absolute value greater than 3.

z_scores <- as.data.frame(sapply(data, function(data) (abs(data-mean(data))/sd(data))))    

Only rows in the data frame with all z-scores less than 3 are kept.

no_outliers <- z_scores[!rowSums(z_scores>3), ]
head(no_outliers)
  Apperance Thickness   Softness
1 0.3614132 0.1129102 0.04407156
2 1.4740501 1.1390075 2.02913381
3 0.6701501 0.6016034 0.26996386
4 1.6611551 0.2010902 0.32123480
5 0.1314868 0.5332800 1.35252878
6 1.1568042 1.3903598 1.46030266

Let’s check the dimension of both the data frame.

dim(data)
100   3
dim(no_outliers)
99  3

We got one value as an outlier and removed the same for further analysis.

Method 2:-Interquartile Range

The code below explains how to eliminate rows from the data frame that have a value in column ‘Apperance’ that is 1.5 times the interquartile range less than the first quartile (Q1) or 1.5 times the interquartile range bigger than the third quartile (Q3) (Q1).

How to Calculate Mahalanobis Distance in R »

Q1 <- quantile(data$Apperance, .25)
Q3 <- quantile(data$Apperance, .75)
IQR <- IQR(data$Apperance)

Now wen keep the values within 1.5*IQR of Q1 and Q3

no_outliers <- subset(data, data$Apperance > (Q1 - 1.5*IQR) & data$Apperance < (Q3 + 1.5*IQR))
dim(no_outliers)
99   3

Now you can see 1 outlier in the Appearance column.

For the graphical representation, you can make use of the below code.

boxplot(data)

How to Identify Outliers-Grubbs’ Test in R »

The post How to Remove Outliers in R appeared first on finnstats.

To leave a comment for the author, please follow the link and comment on their blog: finnstats ».

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)