Plotting US Metro Area GDP with ggplot

March 23, 2018
By

(This article was first published on r-bloggers | SHARP SIGHT, and kindly contributed to R-bloggers)



It’s clear that there are some economic shifts happening in the world, if not the US itself.

In light of this, I decided to do some simple investigation into the economic performance of US cities.

This is, by the way, one of the critical reasons to master data science. One you know a few critical skills, you will be able to very rapidly get some basic information about (almost) any topic.

In a case such as this (when you’re just personally interested), you can just scrape some data and plot it.

But if you’re working in a business, you will need to be able to generate these types of insights quickly. A large part of your job will be gathering data and quickly plotting it in ways that generate insight …

Plotting GDP data for top US cities

In the following code, we’ll scrape some data about US cities and plot a line chart using ggplot2.

There’s actually quite a bit more that we could do with this data, so feel free to create your own plots and leave the code in the comments below.

#=================
# INSTALL PACKAGES
#=================
library(tidyverse)
library(stringr)
library(forcats)
library(rvest)
library(ggthemes)


#============
# SCRAPE DATA
#============
df.metro_gdp <- read_html('https://en.wikipedia.org/wiki/List_of_U.S._metropolitan_areas_by_GDP') %>% 
  html_nodes('table') %>% 
  .[[1]] %>% 
  html_table() %>% 
  as.tibble()


#=======================
# REMOVE 'Rank' VARIABLE
#=======================
df.metro_gdp <- df.metro_gdp %>% 
  select(-Rank)


#================
# RENAME VARIABLE
#================
df.metro_gdp <- df.metro_gdp %>% rename(metro_area = `Metropolitan area`)


# inspect
df.metro_gdp


# REMOVE 'MSA' FROM metro_area
df.metro_gdp <- df.metro_gdp %>% mutate(metro_area = str_replace(metro_area, ' MSA', ''))


# COERCE TO 'metro_area' FACTOR
df.metro_gdp <- df.metro_gdp %>% mutate(metro_area = metro_area %>% as_factor())


#========================================================
# CREATE NEW VARIABLE: 
# - the original 'metro_area' variable is rather long
#   because it's  a full 'metropolitan statistical area'
# - we can abbreviate these as the plain city name
# - we'll call the new variable 'metro_brief'
#========================================================

# get unique values
df.metro_gdp %>% 
  select(metro_area) %>% 
  unique()


#---------------------------------------------------
# RECODE VALUES
# here we will create the new variable 'metro_brief'
#---------------------------------------------------
df.metro_gdp <- df.metro_gdp %>% 
  mutate(metro_area_brief = recode(metro_area,'New York–Northern New Jersey–Long Island, NY–NJ–PA' = 'New York'
         ,'Los Angeles–Long Beach–Santa Ana, CA' = 'Los Angeles'
         ,'Chicago–Joliet–Naperville, IL–IN–WI' = 'Chicago'
         ,'Dallas–Fort Worth–Arlington, TX' = 'Dallas'
         ,'Washington–Arlington–Alexandria, DC–VA–MD–WV' = 'Washington DC'
         ,'Houston–Sugar Land–Baytown, TX' = 'Houston'
         ,'San Francisco–Oakland–Fremont, CA' = 'San Francisco'
         ,'Philadelphia–Camden–Wilmington, PA–NJ–DE–MD' = 'Philadelphia'
         ,'Boston–Cambridge–Quincy, MA–NH' = 'Boston'
         ,'Atlanta–Sandy Springs–Marietta, GA' = 'Atlanta'
         ))



# INSPECT VALUES
df.metro_gdp %>% glimpse()
df.metro_gdp %>% select(metro_area_brief)


# CHECK TABLE OF CROSS-VALUES
df.metro_gdp %>% 
  #select(metro_area, metro_brief) %>% 
  group_by(metro_area, metro_area_brief) %>% 
  summarise()


#======================
# RESHAPE: WIDE TO LONG
#======================
df.metro_gdp <- df.metro_gdp %>% gather(key = year, value = gdp_nominal, -metro_area, -metro_area_brief)


#========================
# COERCE 'year' TO FACTOR
#========================
df.metro_gdp <- df.metro_gdp %>% mutate(year = year %>% as.factor())


#===========================================
# WRANGLE AND COERCE 'gdp_nominal' TO DOUBLE
#===========================================
df.metro_gdp <- mutate(df.metro_gdp, gdp_nominal = str_remove_all(gdp_nominal, ",") %>% as.double())


#================
# PLOT BASIC PLOT
#================
ggplot(df.metro_gdp, aes(x = year, y = gdp_nominal, group = metro_area_brief)) +
  geom_line(aes(color = metro_area_brief))



#==========
# FORMATTED
#==========


df.metro_gdp %>% 
  mutate(highlight_flag = if_else(metro_area_brief == 'New York', T, F)) %>%
  ggplot(aes(x = year, y = gdp_nominal, group = metro_area_brief)) +
    geom_line(aes(color = highlight_flag, alpha = highlight_flag), size = 1.5) +
    scale_color_manual(values = c('grey', 'red')) +
    scale_alpha_manual(values = c(.7, 1)) +
    labs(title = 'New York is the best performing US city by metro GDP'
         ,subtitle = str_c("Consistently, New York has a much higher GDP than other metro areas."
                           ,"\n77% higher than next highest metro in 2017.")
         ,y = "Nominal GDP\n(metro area, millions of dollars)"
         ,x = 'Year') +
    theme(legend.position = 'none'
          ,text = element_text(color = '#3A3A3A'
                               ,family = 'sans')
          ,plot.title = element_text(margin = margin(b = 10)
                                     ,face = 'bold'
                                     ,size = 20)
          ,axis.title = element_text()
          ,plot.subtitle = element_text(size = 12)
          ) +
     scale_y_continuous(labels = scales::comma_format())


And here is the finalized chart:



Sign up now, and get access to our free Data Science Crash Course

Want to learn more about data analysis and data science?

Sign up now for our email list, and you’ll get access to our free Data Science Crash Course.

In the Data Science Crash Course, you’ll learn:

  • a step-by-step data science learning plan

  • the 1 programming language you need to learn

  • 3 essential data visualizations
  • how to do data manipulation in R
  • how to get started with machine learning
  • the difference between machine learning and statistics

SIGN UP NOW

The post Plotting US Metro Area GDP with ggplot appeared first on SHARP SIGHT.

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers | SHARP SIGHT.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)