How to Create a Data Visualization from the New York Times in R

[This article was first published on R – nandeshwar.info, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Undoubtedly, the New York Times publishes the best data visualizations and infographics that are data intensive, yet are elegant. The elegance comes from carefully studying the data, identifying the key patterns and simplifying the graphics to show these patterns or trends.

Here’s what Amanda Cox, editor of The Upshot, said in an interview “we probably pay more attention to things like typography and design, which, done properly, are really about hierarchy and clarity, and not just about making things cute.”

we probably pay more attention to things like typography and design, which, done properly, are really about hierarchy and clarity, and not just about making things cute.

Amanda Cox

Two phrases worth repeating from her quote: “clarity” and “not just about making things cute.” As analysts, we often celebrate “interesting” things, which really are shiny objects with hardly any actionable information. Interesting is not actionable. NYT’s data visualizations may not be “cute,” but they certainly make information accessible, increasing the chances of action. Paraphrasing Wayne Gretzky: You waste 100% of your analysis or data visualizations that the readers don’t take action on.

Interesting is not actionable. You waste 100% of your analysis or data visualizations that the readers don’t take action on.

Although the interactive data visualizations on the New York Times website are crazily better than the static charts, they require more work and knowledge of JavaScript frameworks.

For this article, we will produce a static chart.

I saw this graphic in a sobering article on suicide rates in the US. The article says — and the graphs clearly show — that the suicide rates have gone up in every age group, except for older adults.

nyt-best-data-visualization-plot

From 1999 to 2014, the 30-year period studied in the article, the overall suicide rate went up by 24%.

24%???

That’s shocking! More than 42,000 people committed suicide in 2014, compared with 29,199 in 1999. In 2014, death by suicide was the ninth leading cause of deaths in the US.

You don’t need to study all the numbers to see that the rates have gone up: data visualizations make this observation unforgettable.

How do we go about creating such a infographic? That’s what I wanted to find out.

I thought this was a type of a parallel coordinate plot, but if you look at it again, it is actually a line graph. Something ggplot can easily handle.

Step 1: Load Libraries

I loaded all my favorite libraries, with much gratitude to Hadley Wickham:

  • ggplot2 for plotting, of course
  • reshape2 for transforming the data frame
  • ggthemes to get beautiful themes for ggplot without writing all the customization
  • directlables to insert annotations and legends
  • stringr for some simple string manipulation
library(ggplot2)
library(reshape2)
library(ggthemes)
library(directlabels)
library(stringr)

Step 2: Data Manipulation

Yes, I could fetch the data from CDC’s site. Since I just wanted to build a proof-of-concept, I eyeballed the numbers on the plot for the overall suicide rates.

suicides <- data.frame(age_group = c('15-24', '25-44', '45-64', '65-74', '75+'), t1999 = c(15,21,20,25,42), t2014 = c(18,23,30,26,39))

I read the data this way because it is easier to write them in one line than concatenate all the rows. Plus, I get to use some reshape magic.

suicides <- melt(suicides, id.vars = "age_group", variable.name = "year", value.name = "deaths")

With melt, we can make the wide data narrow i.e. bring columns as rows. Here, I created rows for each year and age group.

Now that we have the year field with values of t1999 and t2014, we will convert them to numbers. I had to add some text before the variable name because R doesn't allow column names that start with a number.

suicides$year <- as.numeric(substr(suicides$year, 2,5))

Step 3: Selecting a Theme

Now for some fun. The package ggthemes provides many themes that make generating nice plots very easy. This package saves a lot of time because you don't have to a) find the right parameter to adjust and b) find the "good" values for those parameters. For example, you may want to change the way axis labels look. The parameter for that axis.text in the theme function. But, then you have to rely on element_text to actually change the property. These are very nice features, really. It just takes a lot of time to find and add the values that look good.

So, for this chart, I applied all the themes to see what the basic graph looked like. You can use this command to find out all the themes.

ls("package:ggthemes")[str_detect(ls("package:ggthemes"), "theme_")]

## [1] "theme_base" "theme_calc" ## [3] "theme_economist" "theme_economist_white" ## [5] "theme_excel" "theme_few" ## [7] "theme_fivethirtyeight" "theme_foundation" ## [9] "theme_gdocs" "theme_hc" ## [11] "theme_igray" "theme_map" ## [13] "theme_pander" "theme_par" ## [15] "theme_solarized" "theme_solarized_2" ## [17] "theme_solid" "theme_stata" ## [19] "theme_tufte" "theme_wsj"

Ironically, theme that got me the closest result was the Wall Street Journal theme. Weird! I know.

Step 4: Building the NYT Like Data Visualization

Create the basic plot

It is really a line plot, so we create a plot with year on the x-axis and deaths on the y-axis. We also add colors using the age group variable. All these are parameters for the aes function. We want to make the lines little bit thicker and make the colors translucent. We do so with this command:

g <- ggplot(data = suicides, aes(x = year, y = deaths, color = age_group)) + geom_line(size = 1.2, alpha = 0.7)

The resulting graph looks like this:

nyt-best-data-visualization-plot

and your reaction might be:

Well, hold on. We are making this chart better one step at a time.

Let's add the "circles" at the end:

g <- g + geom_point()

With circles at the end:
nyt-best-data-visualization-plot

Change the Line Colors

You can manually assign colors to each of the age group value with this command.

g <- g + scale_color_manual(values = c("45-64" = "black", '15-24' = "darkturquoise", '25-44' = "darkturquoise",'65-74' = "darkturquoise", '75+' = "darkturquoise"))

Now with changed line colors:
nyt-best-data-visualization-plot
I played with a few colors before I decided on darkturquoise, which I knew, of course, because that's how cool I am ?

To find all the named colors in R, you can use the colors() function. You can also use a website like http://imagecolorpicker.com/ to find the colors in an image.

head(colors())

## [1] "white" "aliceblue" "antiquewhite" "antiquewhite1" ## [5] "antiquewhite2" "antiquewhite3"

Modify the Axes

We will use the scale_*_continuous function to add breaks, set axis limits, and trim the extra space around axis labels.

g <- g + scale_x_continuous(breaks = c(1999, 2014), limits = c(1999, 2014.5), expand = c(0, 1)) + scale_y_continuous(breaks = c(20, 30, 40))

After changing the axes properties:
nyt-best-data-visualization-plot

Apply the WSJ Theme

This is straightforward. We just add the function theme_wsj from the ggthemes package to our plot.

g <- g + theme_wsj() 

Creating a chart like this:

nyt-best-data-visualization-plot

Add Text Annotations

We will use the annotate function from ggplot to add custom text labels at the desired locations. I chose some positions that looked good. You may want to change them.

Here is how it looks:

g <- g + annotate("text", x = 2000.5, y = 38, label = "suicides per\n 100,000 people")
g <- g + annotate("text", x = 2014, y = 43, label = "Age\nGroup", size = 3, fontface = "bold", hjust = .2)

nyt-best-data-visualization-plot

Remove the Legend and Clean up the Background

g <- g + theme(legend.position = "none", plot.background = element_rect(fill = "white"), panel.background = element_rect(fill = "white"))
  

After cleaning up:
nyt-best-data-visualization-plot

Add Line Legends

We will also use the geom_dl function from the directlabels package to add text annotations at the end of the lines. This is really a best practice: rather than placing legends on the chart and make the reader match all the legend colors with the line values, placing the labels next to the lines makes it very easy for the reader to access the information.

 g <- g + geom_dl(aes(label = age_group, x = year + 0.2), method = "last.qp", cex = 0.5)

After the labels:
nyt-best-data-visualization-plot

Make the Axis Ticks Visible

Like the NYT chart, we will make the axis ticks thicker as well as longer.

NYT like axis ticks:

 g <- g + theme(axis.ticks.x = element_line(size = 1), axis.ticks.length=unit(0.2,"cm"))

And that's it. We are done. Here's how our final chart looks like:

nyt-best-data-visualization-plot

print(g)

Save your version of the NYT Data Visualization

You can use the convenient wrapper ggsave function to save this plot to a png.

ggsave(filename = "NYT_Suicide_Rates.png", plot = g, width = 4, height = 4)

It took a few steps, but we got pretty close to the original NYT data visualization. A great thing about this approach: this is reproducible and repeatable, something that I mentioned while comparing Tableau and R. When you are working in teams, it becomes critical that others know how you completed a task. It is equally important even when you are working by yourself as our memory as not as strong as we claim it to be.

There you have it: a NYT data visualization using R with many thanks to Hadley Wickham and other R contributors. The ggplot library is very extensive and you can achieve many things, but the biggest roadblock is finding the parameters for the right functions or aesthetics. If you are stuck, stackoverflow is a great option and as is the ggplot documentation.

I hope this gives you many ideas and I'd love to know what you think about this approach.

Here's the complete R script: library(ggplot2) library(reshape2) library(ggthemes) library(directlabels) library(stringr) suicides <- data.frame(age_group = c('15-24', '25-44', '45-64', '65-74', '75+'),t1999 = c(15,21,20,25,42),t2014 = c(18,23,30,26,39)) suicides <- melt(suicides, id.vars = "age_group", variable.name = "year", value.name = "deaths") suicides$year <- as.numeric(substr(suicides$year, 2,5)) ls("package:ggthemes")[str_detect(ls("package:ggthemes"), "theme_")] g <- ggplot(data = suicides, aes(x = year, y = deaths, color = age_group)) + geom_line(size = 1.2, alpha = 0.7) + geom_point() g <- g + scale_color_manual(values = c("45-64" = "black", '15-24' = "darkturquoise", '25-44' = "darkturquoise",'65-74' = "darkturquoise", '75+' = "darkturquoise")) g <- g + scale_x_continuous(breaks = c(1999, 2014), limits = c(1999, 2014.5), expand = c(0, 1)) + theme_wsj() g <- g + scale_y_continuous(breaks = c(20, 30, 40)) g <- g + annotate("text", x = 2000.5, y = 38, label = "suicides per\n 100,000 people") g <- g + annotate("text", x = 2014, y = 43, label = "Age\nGroup", size = 3, fontface = "bold", hjust = .2) g <- g + theme(legend.position = "none", plot.background = element_rect(fill = "white"), panel.background = element_rect(fill = "white")) g <- g + geom_dl(aes(label = age_group, x = year + 0.2), method = "last.qp", cex = 0.5) g <- g + theme(axis.ticks.x = element_line(size = 1), axis.ticks.length=unit(0.2,"cm")) print(g) ggsave(filename = "NYT_Suicide_Rates.png", plot = g, width = 4, height = 4)

What do you think?

The post How to Create a Data Visualization from the New York Times in R appeared first on nandeshwar.info.

To leave a comment for the author, please follow the link and comment on their blog: R – nandeshwar.info.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)