Show, don’t tell! Share data insights in stunning color and display with ggplot2, a wonderful R package for visualizing data.
Ggplot2: Grammar of Graphics
The end of qualitative data analysis should be clear—beautiful data visualizations. We are visual beings, after all, and a picture tells us far more than raw numbers! Among the many visualization tools, one in particular stands out : ggplot2—a free, open-source, and easy-to-use package that has become a favorite among many R programmers. This article explains the basics of ggplot2 and guides you in creating beautiful data visualizations.
Scenario: Game of Thrones—U.S. Viewers per Episode
A few weeks ago, the last episode of Game of Thrones, Season 7, aired. And I must say, it was nothing short of spectacular! Throughout the whole episode, my eyes were wide open— fire, ice, mysteries, intrigues, dragons, and unexpected twists. “This is probably the most watched episode yet!” I thought to myself.
How many viewers saw the episode? Has this season had the greatest number of viewers so far? Is the number of viewers from season to season on the rise? I decided to search for some data and create graphics in order to answer these questions. Surfing the net, I came across a table listing the number of viewers per each episode of each season. This’ll serve as an excellent example for practicing with ggplot2. Let’s make some plots!
What Is ggplot2?
Developed in 2005, ggplot2 is one of the most famous R packages for statistical graphics. It’s very popular because it relies on an iterative mode of work, its syntax is easy to learn, and the package is admirably maintained and documented. “Iterative” means that each component (part) of the graph can be built separately with its own specific syntax (the so-called “grammar” of graphics). For example, plot type and aesthetics are two separate components, and they are each defined by separate syntaxes. At any time we can save a plot as an object and then upgrade its components when necessary.
In the next section, we’ll move away from the theory and work through an example that illustrates some basic graph components and how they can be created with ggplot2 syntax. We’ll cover the following components:
- geometrical shapes
- scales, axes and legends
In order for us even to attempt visualization, we must first ensure that we have well-prepared data. In our example, we’ve prepared a table (what’s known as a “data frame” in R) with the number of viewers per each episode of the Game of Thrones serial. It 70 total observations, each representing one episode of a season.
For Season 7, the data in the table look like this:
Season 7: Number of U.S viewers per Episode
Data from the net were saved in an Excel file, so we’ll import them into R and store them in a data frame object we’ll name GOT. Here is the code:
Great! The first component has been defined. Now that we have our data, our next step is to define the aesthetics.
Which data are to be presented, and how will we present them on the graph? What will we put on our axes, and how will the data be grouped? The answers to these questions will define our second component. We must define the variables to be represented along our horizontal and vertical axes and additional variables (if necessary) for color, shape, size, and group.
In our example, we will measure Episode along our horizontal axis and
Number_of_viewers on the vertical. Because we want to distinguish episodes by season, we’ll group and color our data according to a third variable named Season.
Aesthetics are defined through the
aes() function. This is the code for our example:
Using this syntax, we have defined our first and second components. However, executing the R code above gives us an empty canvas— there is no visualized data on the graph.
Empty plot—geometric shapes are not defined.
Why is the graph empty? Because we have not yet defined which geometric shapes will represent the data. We must specify what kind of plot we want to create.
Our next step is to define the geometric shapes that will be used to represent our data. For example, with
geom_point(), we will make a scatter plot; with
geom_line(), we will create a line graph; and with
geom_bar(), we will present a bar chart. (There are many different geometrical shapes, and I’m just mentioning the ones I use the most.)
Let’s go back to our example. Because a line graph is useful for displaying data that change continuously over time, we’ll use that type of graph to show changes in the number of viewers from season to season.
geom_line() to make a line graph in our example:
We now have our initial visualization:
First version of the graph
Wouldn’t it be nice to add points to the graph? This would make the information easier to read, since the points would emphasize the specific number of viewers for each episode. We can do this with ease, since ggplot2 allows you to use different shapes on a single plot.
Let’s add points with
Now the plot looks a little bit prettier:
Number of viewers of each episode
Having defined the base components of our graph, let’s now turn to the task of making the graph more readable.
Scales, Axes and Legends
If we take a look at the graph above without reading the article, it is not clear what the graph is about. It is very important that the visualization is simple, readable, and communicates a clear message. Adding a title, changing or removing the title of the legend, and properly setting axes values and labels will make our plot much more readable. With these components, we are specifying how our aesthetics (x and y axes, etc.) should be presented on the graph.
Let’s make some changes:
The result is the following graph:
We accomplished the following:
Scale_x_discrete()was used to change the x-axis title (“Episode number” instead of “Episode”) and its labels (“1,2…,10” instead of “Ep1,…,Ep10”). Because the variable Episode is discrete (it has finite set of values) here we are using
- Because the variable on y axis is continuous, the title and values presented on that axis are changed with component
scale_y_continuous(). The title is changed from “Number_of_viewers” to “Number of U.S. viewers in millions”, and values on the y axis are now the whole numbers “2,…,12” (in increments of one).
- The legend title is removed with
scale_colour_discrete(). We used here
scale_colour_discretebecause our legend is related to discrete Season coloring and grouping which we defined in our
- Title name was added with
Wouldn’t you like to change the background color? Or emphasize the title? This is possible with themes. Themes are used to control details such as fonts, tick marks, panel strips, and the backgrounds of the entire canvas. Let’s bold our title and change the background color of our plot.
The code is as follows:
Here’s what we did:
- With the
plot.title = element_text(face=”bold”)argument, we made our title bold.
- The background of our panel was changed to white, and the border was set to grayy. This was accomplished with the
panel.backgroundparameter, and additional borders were set with
Our final plot looks like this
We’ve completed our visualization! Now we can easily draw some conclusions related to the viewership of Game of Thrones. For example, we can clearly see from the graph that Season 7 had the most viewers. With each new season, viewership grows, and the last episode of Season 7 had more than 12 million viewers! Isn’t it great to read so much information from just one picture?
Of course, this is just scraping the surface of what we can do with ggplot2. This package gives you greater control over creating graph components, making it far easier to visualize data in a way that suits your needs.
With that said, only one question remains: Are you ready to invest just a little time to create some beautiful visualizations? If so, ggplot2 is the right tool for you. To learn more about using ggplot2, check out our free Data Visualization 101 course. With our hands-on demos and exercises, you’ll master all the cool ways to plot and visualize data in R.