Map the Life Expectancy in United States with data from Wikipedia

August 5, 2016
By

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

Recently, I become interested to grasp the data from webpages, such as Wikipedia, and to visualize it with R. As I did in my previous post, I use rvest package to get the data from webpage and ggplot package to visualize the data.

In this post, I will map the life expectancy in White and African-American in US.

Load the required packages.

## LOAD THE PACKAGES ####
library(rvest)
library(ggplot2)
library(dplyr)
library(scales)

Import the data from Wikipedia.

## LOAD THE DATA ####
le = read_html("https://en.wikipedia.org/wiki/List_of_U.S._states_by_life_expectancy")

le = le %>%
  html_nodes("table") %>%
  .[[2]]%>%
  html_table(fill=T)

Now I have to clean the data. Below I have explain the role of each code.

## CLEAN THE DATA ####
# check the structure of dataset
str(le)
'data.frame':	54 obs. of  417 variables:
 $ X1  : chr  "" "Rank\nState\nLife Expectancy, All\n(in years)\nLife Expectancy, African American\n(in years)\nLife Expectancy, Asian American\n"| __truncated__ "Rank" "1" ...
 $ X2  : chr  NA "Rank" "State" "Hawaii" ...
 $ X3  : chr  NA "State" "Life Expectancy, All\n(in years)" "81.3" ...
 $ X4  : chr  NA "Life Expectancy, All\n(in years)" "Life Expectancy, African American\n(in years)" "-" ...
 $ X5  : chr  NA "Life Expectancy, African American\n(in years)" "Life Expectancy, Asian American\n(in years)" "82.0" ...
 $ X6  : chr  NA "Life Expectancy, Asian American\n(in years)" "Life Expectancy, Latino\n(in years)" "76.8" ...
 $ X7  : chr  NA "Life Expectancy, Latino\n(in years)" "Life Expectancy, Native American\n(in years)" "-" ...
.....
.....

# select only columns with data
le = le[c(1:8)]

# get the names from 3rd row and add to columns
names(le) = le[3,]

# delete rows and columns which I am not interested
le = le[-c(1:3), ]
le = le[, -c(5:7)]

# rename the names of 4th and 5th column
names(le)[c(4,5)] = c("le_black", "le_white")

# make variables as numeric
le = le %>% 
  mutate(
    le_black = as.numeric(le_black), 
    le_white = as.numeric(le_white))

# check the structure of dataset
str(le)
'data.frame':	51 obs. of  7 variables:
 $ Rank                            : chr  "1" "2" "3" "4" ...
 $ State                           : chr  "Hawaii" "Minnesota" "Connecticut" "California" ...
 $ Life Expectancy, All
(in years): chr  "81.3" "81.1" "80.8" "80.8" ...
 $ le_black                        : num  NA 79.7 77.8 75.1 78.8 77.4 NA NA 75.5 NA ...
 $ le_white                        : num  80.4 81.2 81 79.8 80.4 80.5 80.4 80.1 80.3 80.1 ...
 $ le_diff                         : num  NA 1.5 3.2 4.7 1.6 ...
 $ region                          : chr  "hawaii" "minnesota" "connecticut" "california" ...

Since there are some differences in life expectancy between White and African-American, I will calculate the differences and will map it.

le = le %>% mutate(le_diff = (le_white - le_black))

I will load the map data and will merge the datasets togather.

## LOAD THE MAP DATA ####
states = map_data("state")
str(states)
'data.frame':	15537 obs. of  6 variables:
 $ long     : num  -87.5 -87.5 -87.5 -87.5 -87.6 ...
 $ lat      : num  30.4 30.4 30.4 30.3 30.3 ...
 $ group    : num  1 1 1 1 1 1 1 1 1 1 ...
 $ order    : int  1 2 3 4 5 6 7 8 9 10 ...
 $ region   : chr  "alabama" "alabama" "alabama" "alabama" ...
 $ subregion: chr  NA NA NA NA ...

# create a new variable name for state
le$region = tolower(le$State)

# merge the datasets
states = merge(states, le, by="region", all.x=T)
str(states)
'data.frame':	15537 obs. of  12 variables:
 $ region                          : chr  "alabama" "alabama" "alabama" "alabama" ...
 $ long                            : num  -87.5 -87.5 -87.5 -87.5 -87.6 ...
 $ lat                             : num  30.4 30.4 30.4 30.3 30.3 ...
 $ group                           : num  1 1 1 1 1 1 1 1 1 1 ...
 $ order                           : int  1 2 3 4 5 6 7 8 9 10 ...
 $ subregion                       : chr  NA NA NA NA ...
 $ Rank                            : chr  "49" "49" "49" "49" ...
 $ State                           : chr  "Alabama" "Alabama" "Alabama" "Alabama" ...
 $ Life Expectancy, All
(in years): chr  "75.4" "75.4" "75.4" "75.4" ...
 $ le_black                        : num  72.9 72.9 72.9 72.9 72.9 72.9 72.9 72.9 72.9 72.9 ...
 $ le_white                        : num  76 76 76 76 76 76 76 76 76 76 ...
 $ le_diff                         : num  3.1 3.1 3.1 3.1 3.1 ...

Now its time to make the plot. First I will plot the life expectancy in African-American in US. For few states we don’t have the data, and therefore I will color it in grey color.

## MAKE THE PLOT ####

# Life expectancy in African American
ggplot(states, aes(x = long, y = lat, group = group, fill = le_black)) + 
  geom_polygon(color = "white") +
  scale_fill_gradient(name = "Years", low = "#ffe8ee", high = "#c81f49", guide = "colorbar", na.value="#eeeeee", breaks = pretty_breaks(n = 5)) +
  labs(title="Life expectancy in African American") +
  coord_map()

Here is the plot:
Le_african_american

The code below is for White people in US.

# Life expectancy in White American
ggplot(states, aes(x = long, y = lat, group = group, fill = le_white)) + 
  geom_polygon(color = "white") +
  scale_fill_gradient(name = "Years", low = "#ffe8ee", high = "#c81f49", guide = "colorbar", na.value="Gray", breaks = pretty_breaks(n = 5)) +
  labs(title="Life expectancy in White") +
  coord_map()

Here is the plot:
Le_white

Finally, I will map the differences between white and African American people in US.

# Differences in Life expectancy between White and African American
ggplot(states, aes(x = long, y = lat, group = group, fill = le_diff)) + 
  geom_polygon(color = "white") +
  scale_fill_gradient(name = "Years", low = "#ffe8ee", high = "#c81f49", guide = "colorbar", na.value="#eeeeee", breaks = pretty_breaks(n = 5)) +
  labs(title="Differences in Life Expectancy between \nWhite and African Americans by States in US") +
  coord_map()

Here is the plot:
Le_differences

On my previous post I got a comment to add the pop-up effect as I hover over the states. This is a simple task as Andrea exmplained in his comment. What you have to do is to install the plotly package, to create a object for ggplot code above, like map_data <- ggplot(states, ... , and then to use this function ggplotly(map_plot) to plot it.

Thats all! Leave a comment below if you have any question.

    Related Post

    1. What can we learn from the statistics of the EURO 2016 – Application of factor analysis
    2. Visualizing obesity across United States by using data from Wikipedia
    3. Plotting App for ggplot2 – Part 2
    4. Mastering R plot – Part 3: Outer margins
    5. Interactive plotting with rbokeh

    To leave a comment for the author, please follow the link and comment on their blog: DataScience+.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



    If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

    Comments are closed.

    Sponsors

    Never miss an update!
    Subscribe to R-bloggers to receive
    e-mails with the latest R posts.
    (You will not see this message again.)

    Click here to close (This popup will not appear again)