Use box plots to assess the distribution and to identify the outliers in your dataset

August 14, 2015

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

After you check the distribution of the data by ploting the histogram, the second thing to do is to look for outliers. Identifying the outliers is important becuase it might happen that an association you find in your analysis can be explained by the presence of outliers.

The best tool to identify the outliers is the box plot. Through box plots we find the minimum, lower quartile (25th percentile), median (50th percentile), upper quartile (75th percentile), and maximum of an continues variable. The function to build a boxplot is boxplot().

Let see this example:

# load data

"Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

# build the box plot
boxplot(iris$Sepal.Length, main="Box plot", ylab="Sepal Length")

This will generate the following box plot. Each horizontal line starting from bottom will show the minimum, lower quartile, median, upper quartile and maximum value of Sepal.Length.

We can use box plot to explore the distribution of a continues variable accross the strata. I’m saying strata becuase the variable should be categorical. For example you want to see what is the distribution of age among individuals with and without blood pressure. In the example below, I’m showing the length of sepal in different species.

# load data

# names of variables
"Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

# build the box plot
boxplot(Sepal.Length ~ Species, data=iris,
     main="Box Plot",
     ylab="Sepal Length")

This will generate the following box plots:
If you look at the bottom of third box plot you will find an outlier. If you find in your dataset an outlier I suggest to remove it. Although, to remove an outlier should be a topic of another post, for now you can check your dataset and manually remove the observation. However, there are functions which remove outliers automatically.

Feel free to post comments if you have any question.

To leave a comment for the author, please follow the link and comment on their blog: DataScience+. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)