[This article was first published on r – Appsilon | End­ to­ End Data Science Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Histograms with R and ggplot2

Be honest. How uninspiring are your data visualizations? Expert designers make graph design look effortless, but in reality, it can’t be further from the truth. Luckily, the R programming language provides countless ways to make your visualizations eye-catching.

Read more on our ggplot series:

This article will show you how to make stunning histograms with R’s ggplot2 library. We’ll start with a brief introduction and theory behind histograms, just in case you’re rusty on the subject. You’ll then see how to create and tweak ggplot histograms taking them to new heights.

What is a Histogram?

A histogram is a way to graphically represent the distribution of your data using bars of different heights. A single bar (bin) represents a range of values, and the height of the bar represents how many data points fall into the range. You can change the number of bins easily.

The easiest way to understand them is through visualization. The image below shows a histogram of 10,000 numbers drawn from a standard normal distribution (mean = 0, standard deviation = 1):

Image 1 – Histogram of a standard normal distribution

Although at first glance the histogram doesn’t look like much, it actually tells you a lot. When data is distributed normally (bell curve), you can draw the following conclusions:

• 68.26% of the data points are located between -1 and +1 standard deviations (34.13% in either direction).
• 95.44% of the data points are located between -2 and +2 standard deviations (47.72% in either direction).
• 99.72% of the data points are located between -3 and +3 standard deviations (49.86% in either direction).
• Anything outside the -3 and +3 standard deviation range is considered to be an outlier.

In reality, you’re rarely dealing with a perfectly normal distribution. It’s usually skewed in either direction or has multiple peaks. Keep this in mind when drawing conclusions from the shape of a histogram, alone.

Let’s see how you can use R and ggplot to visualize histograms.

We’ll use the Gapminder dataset throughout the article to visualize histograms. It’s a relatively small dataset showing life expectancy, population, and GDP per capita in countries between 1952 and 2007. We’ll use only a subset that shows countries in Europe and discard everything else.

Here’s the code you need to import libraries, load, and filter the dataset:

Here’s how the first couple of rows from gm_eu look like:

Image 2 – European countries of the Gapminder dataset

We’ll visualize the lifeExp column with histograms, as it provides enough continuous data to play around with.

Let’s make the most basic ggplot histogram first. You can use the geom_histogram() function to do so. Provided you’ve passed in the dataset and the default aesthetics:

Image 3 – Default histogram

Well, you won’t see anything like that on a website or in a magazine, so we better get our keyboard dirty with some tweaking.

Let’s start by changing the number of bins (bars). The default value is 30, and it works in most cases. If you want your histograms to look boxier, use fewer bins. On the other hand, go big if you want your histograms to look like density plots. Here’s how a histogram with 10 bins looks like:

Image 4 – Histogram with 10 bins

Let’s stick with the default number of bins for the rest of the article, as it looks somewhat better.

The coloring is painful to look at. There’s nothing wrong with gray, but it looks too boring. Here’s how to enhance your ggplot histogram to make give it some Appsilon flair — blue fill color with black borders:

Image 5 – Tweaking the fill and outline color

Much better, provided you like the blue color. Let’s dive deeper into styling and annotations next.

How to Style and Annotate ggplot Histograms

Styling

You can bring more life to your ggplot histogram. For example, we sometimes like to add a vertical line representing the mean, and two surrounding lines representing the range between -1 and +1 standard deviations from the mean. It’s a good idea to style the lines differently, just so your histogram isn’t confusing.

The following code snippet draws a black line at the mean, and dashed black lines at -1 and +1 standard deviation marks:

Image 6 – Adding vertical lines to histograms

Are you up for a challenge? Try to recreate our histogram from Image 1. Hint: use geom_segment() instead of geom_vline().

Every so often you want to make your ggplot histogram richer by combining it with a density plot. It shows more or less the same information, just in a smoother format. Here’s how you can add a density plot overlay to your histogram:

Image 7 – Adding density plots to histograms

It’s somewhat of a richer data representation than if you’d’ve gone with the histogram alone. For example, if you were to embed the above chart to a dashboard, you could let the user toggle the overlay for maximum customizability.

Do you want to build dashboards professionally? Here’s how to start a career as an R Shiny Developer.

Annotations

Finally, let’s see how you can add annotations to your ggplot histogram. Maybe you find vertical lines too intrusive, and you just want a plain textual representation of specific values.

First things first, you’ll need to create a data.frame for annotations. It should contain X and Y values, and also the labels that will be displayed:

You can now include these in a geom_text() layer. Hint: make the annotations bold, so they’re easier to spot:

Image 8 – Adding annotations to histograms

The trick with annotations is making sure there’s some gap between them, so the text doesn’t overlap.

Let’s also see how you can remove this grayish background color. The easiest approach is by adding a more minimalistic theme to the chart. The theme_classic() is one of our top picks:

Image 9 – Changing the theme

The only thing missing from our ggplot histogram is the title and axis labels. The users don’t know what they’re looking at without them.

Add Text, Titles, Subtitles, Captions, and Axis Labels to ggplot Histograms

Titles and axis labels are mandatory for production-ready charts. Subtitles or captions are optional, but we’ll show you how to add them as well. The magic happens in the labs() layer. You can use it to specify the values for title, subtitle, caption, X-axis, and Y-axis:

Image 10 – Adding title, subtitle, caption, and axis labels

It’s a good start, but the newly added elements don’t stand out. You can change the font, color, size, among other things, in the theme() layer. Just make sure to include a custom theme layer like theme_classic() before you write your styles. These would get overridden otherwise:

Image 11 – Styling title, subtitle, and caption

It’s starting to shape up now. And it also matches the color palette of our ggplot histogram. We’ve covered everything needed to get you started visualizing your data distributions with histograms, so we’ll call it a day here. But there’s so much more you can do with your visualizations. Check out some of our Shiny demos to see where advanced level R programming can take your data visualizations.

Did you know there’s another way to visualize data distributions? Read our complete guide to boxplots.

Conclusion

Today you’ve learned what histograms are, why they are important for visualizing the distribution of continuous data, and how to make them appealing with R and the ggplot2 library. It’s enough to set you on the right track, and now it’s up to you to apply this knowledge to your datasets. We’re sure you can manage it.

At Appsilon, we’ve used histograms and the ggplot2 package in developing enterprise R Shiny dashboards for Fortune 500 companies. If R and R Shiny is something you have experience with, we might have a position ready for you.

Start a career at Appsilon —  positions available.