NYT and WaPo Data Visualizations on Carbon Emissions Recreated in R

[This article was first published on R – nandeshwar.info, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

After President Trump declared that the U.S. was pulling out of the Paris agreement on climate change, the New York Times and Washington Post published a couple of stories. In these stories, they included a few charts. They caught my eye and I wanted to see whether I could tell similar stories and draw some more conclusions using the data and R. In this post, similar to my previous posts, you will see my efforts to create the NYT and Washington Post data visualizations recreated in R.

Reasons these graphics are good

  • Color: both the NYT and WaPo chose simple color schemes. They didn’t use the colors to “prettify”, but actually to distinguish countries. The NYT also followed the same color scheme throughout the article.
  • Chart Type: all the graphics have the right chart types. Area charts, which really are extensions of line charts, are good choices to show growth over time. I personally prefer line charts, because they gave similar information without cluttering the plot. Horizontal-bar charts are also good choices for showing a measure across categories. Turning the axis helps read the category labels with ease. Although the circle charts, a.k.a bubble charts, look good, they take a lot of space to inform the readers.

Experts Advice on Colors and Graphics

Noah-Iliinsky-steps-for-data-visualization-graphics

stephen-few-rules-advice-color

dona-wong-data-visualization-color-graphics

NYT Area Graph Visualization on CO2 Emissions NYT Stacked Area Graph Visualization on CO2 Emissions NYT Bar Graph Visualization on Per Capita Emissions

NYT Graphics

Let’s create the NYT graphics first.

Load favorite libraries

Let’s start with loading all of our favorite libraries.

library(stringr)
library(dplyr)
library(ggplot2)
library(scales)
library(ggmap)
library(readr)
library(maps)
library(tidyr)
library(rvest)

Read Data

Get the data from the source. Here, I am using the carbon emissions data by each nation. The missing values are denoted by a period and the first four lines have comments. I specified those as additional arguments for the read_csv function.

emissions_by_countries 

This data looks like this:

glimpse(emissions_by_countries)

## Observations: 17,232
## Variables: 10
## $ country                  <chr> "AFGHANISTAN", "AFGHANISTAN", "AFGHAN...
## $ year                     <int> 1949, 1950, 1951, 1952, 1953, 1954, 1...
## $ total_emissions          <int> 4, 23, 25, 25, 29, 29, 42, 50, 80, 90...
## $ gas_fuel_emissions       <int> 4, 6, 7, 9, 10, 12, 17, 17, 21, 25, 3...
## $ liquid_fuel_emissions    <int> 0, 18, 18, 17, 18, 18, 25, 33, 59, 65...
## $ solid_fuel_emissions     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ gas_flaring_emissions    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 5, 6...
## $ cement_prod_emissions    <int> NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ per_capita_emission_rate <dbl> NA, 0.00, 0.00, 0.00, 0.00, 0.00, 0.0...
## $ bunker_fuel_emissions    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...

Next, let’s do some clean-up and add some calculations. As per the data file, to get the CO2 emissions, we need to multiply the carbon emissions by a constant of 3.667, because one ton of carbon equals 3.667 tons of carbon dioxide gas.

emissions_by_countries 

Now the data looks like this:

glimpse(emissions_by_countries)

## Observations: 17,232
## Variables: 12
## $ country                      <chr> "Afghanistan", "Afghanistan", "Af...
## $ year                         <int> 1949, 1950, 1951, 1952, 1953, 195...
## $ total_emissions              <int> 4, 23, 25, 25, 29, 29, 42, 50, 80...
## $ gas_fuel_emissions           <int> 4, 6, 7, 9, 10, 12, 17, 17, 21, 2...
## $ liquid_fuel_emissions        <int> 0, 18, 18, 17, 18, 18, 25, 33, 59...
## $ solid_fuel_emissions         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ gas_flaring_emissions        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, ...
## $ cement_prod_emissions        <int> NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ per_capita_emission_rate     <dbl> NA, 0.00, 0.00, 0.00, 0.00, 0.00,...
## $ bunker_fuel_emissions        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ co2_total_emissions          <dbl> 14.668, 84.341, 91.675, 91.675, 1...
## $ co2_per_capita_emission_rate <dbl> NA, 0.00000, 0.00000, 0.00000, 0....

Get EU Countries

These charts use a group for the European Union. So, let’s get those countries from the EU’s site.

eu_countries %
  html_node(xpath = '//*[@id="year-entry2"]/table') %>% 
  html_table() %>% 
  `names%
  gather(value = countries) %>%
  select(-key) %>%
  mutate(EU = 'EU')

Line by line explanation of what’s happening in this code:

  • download the whole page using read_html
  • using Chrome’s Inspect and Copy Xpath select the table we need
  • convert the html to a data frame
  • since both columns has the same heading, give them some dummy names. dplyr doesn’t like duplicate column names
  • fold both the columns into one using gather function from the package tidyr
  • remove the extra column
  • add a field to denote the EU

This data looks like this:

glimpse(eu_countries)

## Observations: 28
## Variables: 2
## $ countries <chr> "Austria", "Belgium", "Bulgaria", "Croatia", "Cyprus...
## $ EU        <chr> "EU", "EU", "EU", "EU", "EU", "EU", "EU", "EU", "EU"...

Let’s make sure that all the EU countries exist in the carbon emissions data set.

filter(eu_countries, !(countries %in% emissions_by_countries$country)) %>% select(countries)

##   countries
## 1    France
## 2     Italy

Since France and Italy has some additional text, let’s change those:

emissions_by_countries 

Let’s now merge the EU data frame with the emissions data frame and add the EU column:

emissions_by_countries %
  mutate(EU = ifelse(is.na(EU), 'Non-EU', EU))

This data frame looks like this:

glimpse(emissions_by_countries)

## Observations: 17,232
## Variables: 13
## $ country                      <chr> "Afghanistan", "Afghanistan", "Af...
## $ year                         <int> 1949, 1950, 1951, 1952, 1953, 195...
## $ total_emissions              <int> 4, 23, 25, 25, 29, 29, 42, 50, 80...
## $ gas_fuel_emissions           <int> 4, 6, 7, 9, 10, 12, 17, 17, 21, 2...
## $ liquid_fuel_emissions        <int> 0, 18, 18, 17, 18, 18, 25, 33, 59...
## $ solid_fuel_emissions         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ gas_flaring_emissions        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, ...
## $ cement_prod_emissions        <int> NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ per_capita_emission_rate     <dbl> NA, 0.00, 0.00, 0.00, 0.00, 0.00,...
## $ bunker_fuel_emissions        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ co2_total_emissions          <dbl> 14.668, 84.341, 91.675, 91.675, 1...
## $ co2_per_capita_emission_rate <dbl> NA, 0.00000, 0.00000, 0.00000, 0....
## $ EU                           <chr> "Non-EU", "Non-EU", "Non-EU", "No...

Let’s also add the economy type column:

emissions_by_countries 

This data now looks like this:

glimpse(emissions_by_countries)

## Observations: 17,232
## Variables: 15
## $ country                      <chr> "Afghanistan", "Afghanistan", "Af...
## $ year                         <int> 1949, 1950, 1951, 1952, 1953, 195...
## $ total_emissions              <int> 4, 23, 25, 25, 29, 29, 42, 50, 80...
## $ gas_fuel_emissions           <int> 4, 6, 7, 9, 10, 12, 17, 17, 21, 2...
## $ liquid_fuel_emissions        <int> 0, 18, 18, 17, 18, 18, 25, 33, 59...
## $ solid_fuel_emissions         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ gas_flaring_emissions        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, ...
## $ cement_prod_emissions        <int> NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ per_capita_emission_rate     <dbl> NA, 0.00, 0.00, 0.00, 0.00, 0.00,...
## $ bunker_fuel_emissions        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ co2_total_emissions          <dbl> 14.668, 84.341, 91.675, 91.675, 1...
## $ co2_per_capita_emission_rate <dbl> NA, 0.00000, 0.00000, 0.00000, 0....
## $ EU                           <chr> "Non-EU", "Non-EU", "Non-EU", "No...
## $ economy_type                 <chr> "Other countries", "Other countri...
## $ chart_groups                 <chr> "Rest of world", "Rest of world",...

Let’s start plotting

Now that we have our data together, let the fun begin.

First, let’s define a formatter to convert the emissions to billions of units.

formatter_billions 

Next, let’s total the emissions by the group types as used in the NYT graphic:

by_chart_groups % 
  summarize(total_emissions = sum(co2_total_emissions*10^3, na.rm = TRUE)) %>% 
  ungroup() %>%
  mutate(chart_groups = factor(chart_groups, levels = c('United States', 'European Union', 'Other developed', 'China', 'India', 'Rest of world')))
glimpse(by_chart_groups)

## Observations: 1,132
## Variables: 4
## $ year            <int> 1751, 1752, 1753, 1754, 1755, 1756, 1757, 1758...
## $ chart_groups    <fctr> European Union, European Union, European Unio...
## $ economy_type    <chr> "Developed economies", "Developed economies", ...
## $ total_emissions <dbl> 9358184, 9361851, 9361851, 9365518, 9369185, 1...

Since we need the U.S. data along with every other group, we need to combine the U.S. data with each of the already grouped data.

us_only % 
  mutate(chart_groups = rep(c('China', 'India', 'European Union', 'Other developed', 'Rest of world', 'United States'), nrow(us_only)),
         economy_type = NA) %>% 
  mutate(chart_groups = factor(chart_groups, levels = c('United States', 'European Union', 'Other developed', 'China', 'India', 'Rest of world')))

Now, the data frame us_repeated has every group’s data, including the U.S., and also has US’s data combined with it.

glimpse(us_repeated)

## Observations: 1,290
## Variables: 4
## $ year            <int> 1800, 1800, 1800, 1800, 1800, 1800, 1801, 1801...
## $ chart_groups    <fctr> China, India, European Union, Other developed...
## $ economy_type    <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ total_emissions <dbl> 253023, 253023, 253023, 253023, 253023, 253023...

Next, let’s prepare the annotations on the graph:

annotations_txt 
glimpse(annotations_txt)

## Observations: 6
## Variables: 4
## $ chart_groups <fctr> United States, European Union, Other developed, ...
## $ x            <dbl> 1850, 1850, 1850, 1850, 1850, 1850
## $ y            <dbl> 8e+09, 8e+09, 8e+09, 8e+09, 8e+09, 8e+09
## $ txt          <fctr> billion metric tons of CO2, 28 countries includi...

Build the color palette. I used imagecolorpicker to get the colors used in the NYT graphic.

econ_type_colors 

Now, let’s build the area graphs:

g 

This command is straightforward: we are using the year column for the x-axis and the total_emissions column for the y-axis. We are then creating facets or panels a.k.a https://en.wikipedia.org/wiki/Small_multiple for the different chart groups.

We get this:

You see all of the US data plotted neatly in its gray beauty! And, your reaction may be like:

Now, let’s make you feel better and make some changes to the axes:

g 

We are limiting the x-axis to two values: 1850 and 2014. We are formatting the y-axis to show billions as a unit and to have the grid lines at 0, 4, and 8 billion.

Next, let’s remove the background behind the panel headings, the plot and also remove the axis ticks.

g 

Now, we add dotted grid lines for the y-axis:

g 

The plot looks like this:

Let’s add some space between the panels and improve the readability of the text by increasing the font sizes.

g 

Now, let’s plot another layer of data for each of the groups. Since we are plotting this data now, the U.S. data will go in the back, just like the NYT graphic.

g 

The plot looks like this:

Just a few final tweaks to move the legend and add the annotations:

g 

The final plot looks like this:

Really nice! Don’t you think?

Critique of this data visualization

Although this graph looks pretty, it doesn’t lead to any critical questioning of the issue. Yes, the U.S. emits a lot. Yes, China grew fast and needs a lot of energy. Yet, the graph fails to create an impact. Perhaps, the comparison of the recent years using a line chart or a double-axis chart will show the sudden rise of China. Or, I hate to say it, but an infographic-y think would help in this case. Something like, one person in the US causes more carbon emissions than 10 people in India (just making this up).

just as important it is to “maximize information to pixel ratio”, you should pay special attention to maximize the information to time ratio and get to actionable insights.

Stacked Area Graph

Unlike the faceted plot, this graph shows all the countries stacked on top of each other.

This graph also shows Russia, so we need to make further manipulations to remove Russia from the Rest of the World category and create its own separate category.

by_chart_groups_rus %
  group_by(year, chart_groups, economy_type) %>% 
  summarize(total_emissions = sum(co2_total_emissions*10^3, na.rm = TRUE)) %>% 
  ungroup() %>%
  bind_rows(., filter(emissions_by_countries, country == 'Russia') %>%
              group_by(year, economy_type) %>% 
              summarize(total_emissions = sum(co2_total_emissions*10^3, na.rm = TRUE)) %>% 
              ungroup() %>%
              mutate(chart_groups = 'Russia')) %>% 
  mutate(chart_groups = factor(chart_groups, levels = c('Rest of world', 'India', 'China', 'Russia', 'Other developed', 'European Union', 'United States'))) %>%
  arrange(year, chart_groups)

This data frame looks like this:

glimpse(by_chart_groups_rus)

## Observations: 1,287
## Variables: 4
## $ year            <int> 1751, 1752, 1753, 1754, 1755, 1756, 1757, 1758...
## $ chart_groups    <fctr> European Union, European Union, European Unio...
## $ economy_type    <chr> "Developed economies", "Developed economies", ...
## $ total_emissions <dbl> 9358184, 9361851, 9361851, 9365518, 9369185, 1...

Let’s create this graph now:

g 

Similar to the graph above, we create an area graph, but instead of creating facets, we use the categories to fill with a color. We also add some separation between the plots using the white color. Also, notice the position = "right" argument in the scale_y_continuous function; this will move the y-axis to the right.

This graph looks like this:

Now, let’s change the color of the axis labels and add some more space around the plot:

g 

Last step is adding the country names to the areas. I created a separate data frame with the country names and positions. For the most part, I used the cumulative total of a country to decide the location, but I had to manually add a few locations:

ann_text % 
  arrange(desc(chart_groups)) %>% 
  mutate(cumtot = cumsum(total_emissions)) %>%
  select(chart_groups, y = cumtot) %>%
  mutate(x = 2000) %>%
  mutate(x = ifelse(chart_groups == 'India', 2010, x),
         y = ifelse(chart_groups == 'India', 23.5*10^9, 
                    ifelse(chart_groups == 'China', 16*10^9, 
                           ifelse(chart_groups == 'Rest of world', 30*10^9, y))))
g 

The final plot looks like this:

Not too bad!

Critique of this data visualization

What’s worse than an area chart: a stacked area chart. Although stacked area charts (or bar charts) let us see the cumulative totals and proportions of each category, the readers have a tough time comparing the proportions and telling the differences. Faceted or paneled line charts by country could be a good choice. We can also consider a slopegraph.

carbon emissions slopegraph using R

Get the top 10 by per capita

Next graph in the NYT’s chart is the bar graph of per capita emissions for a select few countries. I selected the top 10 emitters:

top_per_capity_country % top_n(n = 10, wt = co2_per_capita_emission_rate)

Then, I plotted them as a bar graph

g 

This looks very different from NYT’s chart. These are not the top 10 in its chart.

NYT Washington Post Data Visualization on Carbon Emissions R

Get the top as shown in NYT

I guess the NYT wanted to show these specific countries, so I selected them manually here:

nyt_top_10 

And plotted it again:

g 

This is how the NYT bar chart looks like:

Add axis back and grid lines

I would argue that having an axis along with almost invisible grid lines make this chart look better. Also, I would argue that listing the data point and plotting the chart is redundant. But, in this case, we will just go with it.

g 

This is how the modified NYT bar chart looks like:

I think it adds more clarity to the data visualization. What do you think?

Seeing data differently

What if we were to see all the countries as line charts instead of area charts for a select few, do you think it will tell a different story? I wanted to find out.

First, I got only 2014 data for each country:

last_year_emission % select(year, country, co2_total_emissions) %>% mutate(totem = co2_total_emissions*10^3)

Then I plotted this data along with the labels:

g 

Here’s how all the lines shape up:

Although you can’t see all the individual country lines at the bottom, you can see the clear difference in the biggest and smaller emitters.

Focusing on the past 30 years

Do the trends differ if we look at only the past 30 years?

g  1984), aes(x = year, y = co2_total_emissions*10^3, group = country, color = economy_type, label = country)) + geom_line(size = 1) + geom_text(data = last_year_emission, aes(x = year, y = totem, label = country), nudge_x = 2, hjust = 0, check_overlap = TRUE, inherit.aes = FALSE) + scale_x_continuous(expand = c(0.2, 0))
g 

The past 30 years show the sudden increase in China’s emissions compared to the emissions of the rest of the countries:

Washington Post Graphics

The Washington Post used similar data and created its own graphs:

The WaPo graphic makes a point by separating the countries by their participation in the Paris agreement. The U.S. doesn’t have a good company in Nicaragua and Syria.

last_year_emission % 
  mutate(totem = co2_total_emissions*10^3,
         paris_agreement = ifelse(country %in% c('United States Of America', 'Nicaragua', 'Syrian Arab Republic'), 'NOT PART OF AGREEMENT', 'PART OF AGREEMENT'))

At first, I tried the packed bubble graph using the D3 import in the library bubbles

library(bubbles)
with(last_year_emission, bubbles(value = totem, label = country, color = econ_type_colors))

nyt-wapo-data-visualization-carbon-emissions data visualization created in R

That’s doesn’t look too good. Let’s try ggplot. The WaPo graphic has the filled circles of countries neatly laid out. And, if you notice, there are no axes. We have two options: either hard-code every location of a country, or, randomly divide the countries.

I don’t want to hard-code every single location of the country, so I tried to create some groupings of five countries each. I gave group numbers to all the countries ordered descending by the total emissions. Then, I created 20 bins using quartiles to put each country in a category.

last_year_emission % 
  arrange(desc(paris_agreement), desc(totem)) %>% 
  mutate(group_num = rep(1:5, 44),
         y = cut(totem, breaks = quantile(totem, probs = seq(0, 1, 0.05)))) %>%
  filter(!(is.na(y)))
g 

Here how it looks:

Arrggh. That’s ugly. Yoda be like:

The binning didn’t work out.

Let’s try log transformation of the total emissions

g 

Log transformation looks like this:


Better, but still ugly.

Let’s try a clustering way. Let’s create a few clusters first:

emission_clusters 

Then use these cluster values to distribute the countries on the x-axis.

g 

It looks like this:

Still not good. Mr. Bean doesn’t approve:

Then I found a library called packcircles, which creates Tableau-like packed circles/bubbles.

library(packcircles)

Need to get the colors column in the last year emission data frame:

last_year_emission % arrange(desc(paris_agreement), desc(totem))

Create the circle layout:

last_year_emission_cir_layout 

Which returns the x,y coordinates:

glimpse(last_year_emission_cir_layout)

## Observations: 217
## Variables: 3
## $ x      <dbl> -57236.545, 26692.650, 8478.961, 7154.090, 41817.531, 6...
## $ y      <dbl> 0.000, 0.000, -46555.283, 42031.453, 39000.068, 18710.6...
## $ radius <dbl> 57236.545, 26692.650, 23298.678, 19658.169, 15137.570, ...

Then, compute the vertices of these circles:

last_year_emission_cir_layout_vertices 
glimpse(last_year_emission_cir_layout_vertices)

## Observations: 11,067
## Variables: 3
## $ x  <dbl> 0.0000, -451.3273, -1798.1913, -4019.3513, -7079.7783, -109...
## $ y  <dbl> 0.000, 7173.641, 14234.150, 21070.177, 27573.916, 33642.797...
## $ id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...

Then, the fun part, plot it:

g 

Which looks like this:

Much better, no?

I wish there was some great learning here rather than knowing the already-known big emitters. Can you think of any other ways to explore this data set and find something that we already didn’t know? Let me know in the comments.

Complete Script

knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE, cache = TRUE, fig.path='figures/nyt-wapo-data-visualization-carbon-emissions-')
library(stringr)
library(dplyr)
library(ggplot2)
library(scales)
library(ggmap)
library(readr)
library(maps)
library(tidyr)
library(rvest)
emissions_by_countries % 
  `names%
  gather(value = countries) %>%
  select(-key) %>%
  mutate(EU = 'EU')
glimpse(eu_countries)
filter(eu_countries, !(countries %in% emissions_by_countries$country)) %>% select(countries)
emissions_by_countries % 
  mutate(chart_groups = factor(chart_groups, levels = c('United States', 'European Union', 'Other developed', 'China', 'India', 'Rest of world')))
glimpse(us_repeated)
annotations_txt 

The post NYT and WaPo Data Visualizations on Carbon Emissions Recreated in R appeared first on nandeshwar.info.

To leave a comment for the author, please follow the link and comment on their blog: R – nandeshwar.info.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)