Map the Life Expectancy in United States with data from Wikipedia

[This article was first published on DataScience+, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Recently, I become interested to grasp the data from webpages, such as Wikipedia, and to visualize it with R. As I did in my previous post, I use rvest package to get the data from webpage and ggplot package to visualize the data.

In this post, I will map the life expectancy in White and African-American in US.

Load the required packages.

## LOAD THE PACKAGES ####
library(rvest)
library(ggplot2)
library(dplyr)
library(scales)

Import the data from Wikipedia.

## LOAD THE DATA ####
le = read_html("https://en.wikipedia.org/wiki/List_of_U.S._states_by_life_expectancy")

le = le %>%
  html_nodes("table") %>%
  .[[2]]%>%
  html_table(fill=T)

Now I have to clean the data. Below I have explain the role of each code.

## CLEAN THE DATA ####
# check the structure of dataset
str(le)
'data.frame':	54 obs. of  417 variables:
 $ X1  : chr  "" "Rank\nState\nLife Expectancy, All\n(in years)\nLife Expectancy, African American\n(in years)\nLife Expectancy, Asian American\n"| __truncated__ "Rank" "1" ...
 $ X2  : chr  NA "Rank" "State" "Hawaii" ...
 $ X3  : chr  NA "State" "Life Expectancy, All\n(in years)" "81.3" ...
 $ X4  : chr  NA "Life Expectancy, All\n(in years)" "Life Expectancy, African American\n(in years)" "-" ...
 $ X5  : chr  NA "Life Expectancy, African American\n(in years)" "Life Expectancy, Asian American\n(in years)" "82.0" ...
 $ X6  : chr  NA "Life Expectancy, Asian American\n(in years)" "Life Expectancy, Latino\n(in years)" "76.8" ...
 $ X7  : chr  NA "Life Expectancy, Latino\n(in years)" "Life Expectancy, Native American\n(in years)" "-" ...
.....
.....

# select only columns with data
le = le[c(1:8)]

# get the names from 3rd row and add to columns
names(le) = le[3,]

# delete rows and columns which I am not interested
le = le[-c(1:3), ]
le = le[, -c(5:7)]

# rename the names of 4th and 5th column
names(le)[c(4,5)] = c("le_black", "le_white")

# make variables as numeric
le = le %>% 
  mutate(
    le_black = as.numeric(le_black), 
    le_white = as.numeric(le_white))

# check the structure of dataset
str(le)
'data.frame':	51 obs. of  7 variables:
 $ Rank                            : chr  "1" "2" "3" "4" ...
 $ State                           : chr  "Hawaii" "Minnesota" "Connecticut" "California" ...
 $ Life Expectancy, All
(in years): chr  "81.3" "81.1" "80.8" "80.8" ...
 $ le_black                        : num  NA 79.7 77.8 75.1 78.8 77.4 NA NA 75.5 NA ...
 $ le_white                        : num  80.4 81.2 81 79.8 80.4 80.5 80.4 80.1 80.3 80.1 ...
 $ le_diff                         : num  NA 1.5 3.2 4.7 1.6 ...
 $ region                          : chr  "hawaii" "minnesota" "connecticut" "california" ...

Since there are some differences in life expectancy between White and African-American, I will calculate the differences and will map it.

le = le %>% mutate(le_diff = (le_white - le_black))

I will load the map data and will merge the datasets togather.

## LOAD THE MAP DATA ####
states = map_data("state")
str(states)
'data.frame':	15537 obs. of  6 variables:
 $ long     : num  -87.5 -87.5 -87.5 -87.5 -87.6 ...
 $ lat      : num  30.4 30.4 30.4 30.3 30.3 ...
 $ group    : num  1 1 1 1 1 1 1 1 1 1 ...
 $ order    : int  1 2 3 4 5 6 7 8 9 10 ...
 $ region   : chr  "alabama" "alabama" "alabama" "alabama" ...
 $ subregion: chr  NA NA NA NA ...

# create a new variable name for state
le$region = tolower(le$State)

# merge the datasets
states = merge(states, le, by="region", all.x=T)
str(states)
'data.frame':	15537 obs. of  12 variables:
 $ region                          : chr  "alabama" "alabama" "alabama" "alabama" ...
 $ long                            : num  -87.5 -87.5 -87.5 -87.5 -87.6 ...
 $ lat                             : num  30.4 30.4 30.4 30.3 30.3 ...
 $ group                           : num  1 1 1 1 1 1 1 1 1 1 ...
 $ order                           : int  1 2 3 4 5 6 7 8 9 10 ...
 $ subregion                       : chr  NA NA NA NA ...
 $ Rank                            : chr  "49" "49" "49" "49" ...
 $ State                           : chr  "Alabama" "Alabama" "Alabama" "Alabama" ...
 $ Life Expectancy, All
(in years): chr  "75.4" "75.4" "75.4" "75.4" ...
 $ le_black                        : num  72.9 72.9 72.9 72.9 72.9 72.9 72.9 72.9 72.9 72.9 ...
 $ le_white                        : num  76 76 76 76 76 76 76 76 76 76 ...
 $ le_diff                         : num  3.1 3.1 3.1 3.1 3.1 ...

Now its time to make the plot. First I will plot the life expectancy in African-American in US. For few states we don’t have the data, and therefore I will color it in grey color.

## MAKE THE PLOT ####

# Life expectancy in African American
ggplot(states, aes(x = long, y = lat, group = group, fill = le_black)) + 
  geom_polygon(color = "white") +
  scale_fill_gradient(name = "Years", low = "#ffe8ee", high = "#c81f49", guide = "colorbar", na.value="#eeeeee", breaks = pretty_breaks(n = 5)) +
  labs(title="Life expectancy in African American") +
  coord_map()

Here is the plot:
Le_african_american

The code below is for White people in US.

# Life expectancy in White American
ggplot(states, aes(x = long, y = lat, group = group, fill = le_white)) + 
  geom_polygon(color = "white") +
  scale_fill_gradient(name = "Years", low = "#ffe8ee", high = "#c81f49", guide = "colorbar", na.value="Gray", breaks = pretty_breaks(n = 5)) +
  labs(title="Life expectancy in White") +
  coord_map()

Here is the plot:
Le_white

Finally, I will map the differences between white and African American people in US.

# Differences in Life expectancy between White and African American
ggplot(states, aes(x = long, y = lat, group = group, fill = le_diff)) + 
  geom_polygon(color = "white") +
  scale_fill_gradient(name = "Years", low = "#ffe8ee", high = "#c81f49", guide = "colorbar", na.value="#eeeeee", breaks = pretty_breaks(n = 5)) +
  labs(title="Differences in Life Expectancy between \nWhite and African Americans by States in US") +
  coord_map()

Here is the plot:
Le_differences

On my previous post I got a comment to add the pop-up effect as I hover over the states. This is a simple task as Andrea exmplained in his comment. What you have to do is to install the plotly package, to create a object for ggplot code above, like map_data <- ggplot(states, ... , and then to use this function ggplotly(map_plot) to plot it.

Thats all! Leave a comment below if you have any question.

    Related Post

    1. What can we learn from the statistics of the EURO 2016 – Application of factor analysis
    2. Visualizing obesity across United States by using data from Wikipedia
    3. Plotting App for ggplot2 – Part 2
    4. Mastering R plot – Part 3: Outer margins
    5. Interactive plotting with rbokeh

    To leave a comment for the author, please follow the link and comment on their blog: DataScience+.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

    Never miss an update!
    Subscribe to R-bloggers to receive
    e-mails with the latest R posts.
    (You will not see this message again.)

    Click here to close (This popup will not appear again)