Make A Box Plot with Single Column Data Using Ggplot2 Tutorial

November 7, 2016
By

(This article was first published on R – Saturn Science, and kindly contributed to R-bloggers)

Last week I had my class practice making a box plot using the data on page 66 in The Practice of Statistics 4th Edition (TPS 4ed) text book.

I’m still going over the details of making a box plot with just a single vector or variable of data. Many of the problems in our textbook so far give this kind of data. To use ggplot, you need to make sure your data is in a data frame. So for this exercise, I’ll make some small adjustments and put the data into a data frame. More data frame info here.

My class is already familiar with matrices and matrix multiplication from their math class but now they needed to learn about a different type of data format, a data frame.  A data frame is a list of vectors of equal length but can have different types of data.

Our goal in the computer lab was to create a box plot from the data in the text book using ggplot. They quickly found out that ggplot will not produce a plot with a single vector of data since ggplot requires both an x and y variable for a box plot.

The class had to search for the solution of changing a single vector into a data frame so we could use ggplot. It only took a few minutes to find a solution at stackoverflow.

From stackoverflow, this helped get them going. Before using ggplot, I had them use R’s base graphics just so we could see the difference. Also, R’s base graphics will plot the single vector data.

Here is the data from page 66 and the box plot in base graphics. You can see it’s pretty basic.

male = c(127,44,28,83,0,6,78,6,5,213,73,20,214,28,11)
boxplot(male)

 

top11

Now we plot the same data in ggplot. To use ggplot, the data must first be in a data frame. I load ggplot and dplyr using the library function. I may use dplyr later so I’ll load it now.

Code for male data

library("dplyr", lib.loc="/Library/Frameworks/R.framework/Versions/3.3/Resources/library")
library("ggplot2", lib.loc="/Library/Frameworks/R.framework/Versions/3.3/Resources/library")

male = data.frame(c(127,44,28,83,0,6,78,6,5,213,73,20,214,28,11)) # data from page 66
ggplot(data = male, aes(x = "", y = male)) + 
  geom_boxplot() +
  coord_cartesian(ylim = c(0, 150)) # I set the y axis scale so the plot looks better.

 

middle2Look at the five number summary

Here we can take a quick look at the summary statistics.

summary(male)

##  c.127..44..28..83..0..6..78..6..5..213..73..20..214..28..11.
##  Min.   :  0.0                                               
##  1st Qu.:  8.5                                               
##  Median : 28.0                                               
##  Mean   : 62.4                                               
##  3rd Qu.: 80.5                                               
##  Max.   :214.0

I now put the female data into a data frame and bring both male and female together into another data frame so I can plot both using ggplot. I found a neat method on Stackoverflow showing how to do this here.

# Here is the code to plot male and female data using ggplot

a = data.frame(group = "male", value = c(127,44,28,83,0,6,78,6,5,213,73,20,214,28,11))
b = data.frame(group = "female", value = c(112,203,102,54,379,305,179,24,127,65,41,27,298,6,130,0))

plot.data = rbind(a, b) # this function will bind or join the rows. See data at bottom.

ggplot(plot.data, aes(x=group, y=value, fill=group)) +  # This is the plot function
  geom_boxplot()      # This is the geom for box plot in ggplot.

image3

The final result

Above, you can see both the male and female box plots together with different colors. Ggplot does most of the work as there are only a few lines of code. My students enjoy plotting the data from the text book and learning how to manipulate the code to produce cool plots.

They are also learning to problem solve the code as I can only help with the basics. We are finding that stackoverflow is a great resource.

The data in a data frame format

I have my students show their data especially now that it’s in a data frame with two factors. Here is what the data looks like in the data frame. Notice how both male and female are in the column “group” and the values are in the column “value”.

plot.data

##     group value
## 1    male   127
## 2    male    44
## 3    male    28
## 4    male    83
## 5    male     0
## 6    male     6
## 7    male    78
## 8    male     6
## 9    male     5
## 10   male   213
## 11   male    73
## 12   male    20
## 13   male   214
## 14   male    28
## 15   male    11
## 16 female   112
## 17 female   203
## 18 female   102
## 19 female    54
## 20 female   379
## 21 female   305
## 22 female   179
## 23 female    24
## 24 female   127
## 25 female    65
## 26 female    41
## 27 female    27
## 28 female   298
## 29 female     6
## 30 female   130
## 31 female     0

Our next unit is on probability. I haven’t decided on an R lesson yet using probability. Maybe we’ll just continue practicing with more plots with ggplot.

 

 

 

 

To leave a comment for the author, please follow the link and comment on their blog: R – Saturn Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)