How to Remove Outliers in R?, What does outlier mean? It’s an observation that differs significantly from the rest of the data set’s values. Outliers can skew the results by providing false information.
We’ll go over how to eliminate outliers from a dataset in this section.
How to Remove Outliers in R
To begin, we must first identify the outliers in a dataset; typically, two methods are available.
That’s z scores and interquartile range.
1. Interquartile range.
In a dataset, it is the difference between the 75th percentile (Q3) and the 25th percentile (Q1).
The interquartile range (IQR) is a measurement of the spread of values in the middle 50%.
If an observation is 1.5 times the interquartile range more than the third quartile (Q3) or 1.5 times the interquartile range less than the first quartile (Q1), it is considered an outlier (Q1).
Outlier = Observations > Q3 + 1.5*IQR or < Q1 – 1.5*IQR
2. Use z-scores.
The z-score indicates the number of standard deviations a given value deviates from the mean. A z-score is calculated using the following formula:
z = (X – μ) / σ
X is a single raw data value
μ is the population mean
σ is the population standard deviation
If an observation’s z-score is less than -3 or larger than 3, it’s considered an outlier.
Outlier = values with z-scores > 3 or < -3
How to Remove Outliers in R
You can find and eliminate outliers from a dataset once you’ve decided what you believe to be an outlier. We’ll use the following data frame to demonstrate how to do so
set.seed(123) data <- data.frame(Apperance=rnorm(100, mean=8, sd=4), Thickness=rnorm(100, mean=15, sd=2.3), Softness=rnorm(100, mean=29, sd=2.5)) head(data) Apperance Thickness Softness 1 16.037138 15.53825 30.37900 2 10.553282 17.06525 25.33928 3 7.857956 21.40918 30.56349 4 12.113920 12.98547 29.77937 5 4.156319 17.92857 29.54414 6 10.438148 13.15290 31.03692
Method 1:- Z-score
The code below demonstrates how to calculate the z-score of each value in each column in the data frame, then eliminate rows having at least one z-score with an absolute value greater than 3.
z_scores <- as.data.frame(sapply(data, function(data) (abs(data-mean(data))/sd(data))))
Only rows in the data frame with all z-scores less than 3 are kept.
no_outliers <- z_scores[!rowSums(z_scores>3), ] head(no_outliers) Apperance Thickness Softness 1 0.3614132 0.1129102 0.04407156 2 1.4740501 1.1390075 2.02913381 3 0.6701501 0.6016034 0.26996386 4 1.6611551 0.2010902 0.32123480 5 0.1314868 0.5332800 1.35252878 6 1.1568042 1.3903598 1.46030266
Let’s check the dimension of both the data frame.
dim(data) 100 3 dim(no_outliers) 99 3
We got one value as an outlier and removed the same for further analysis.
Method 2:-Interquartile Range
The code below explains how to eliminate rows from the data frame that have a value in column ‘Apperance’ that is 1.5 times the interquartile range less than the first quartile (Q1) or 1.5 times the interquartile range bigger than the third quartile (Q3) (Q1).
Q1 <- quantile(data$Apperance, .25) Q3 <- quantile(data$Apperance, .75) IQR <- IQR(data$Apperance)
Now wen keep the values within 1.5*IQR of Q1 and Q3
no_outliers <- subset(data, data$Apperance > (Q1 - 1.5*IQR) & data$Apperance < (Q3 + 1.5*IQR)) dim(no_outliers) 99 3
Now you can see 1 outlier in the Appearance column.
For the graphical representation, you can make use of the below code.