Site icon R-bloggers

Scrape HTML Table using rvest

[This article was first published on r-bloggers on Programming with R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this tutorial, we’ll see how to scrape an HTML table from Wikipedia and process the data for finding insights in it (or naively, to build a data visualization plot).

Youtube – https://youtu.be/KCUj7JQKOJA

Why?

Most of the times, As a Data Scientist or Data Analyst, your data may not be readily availble hence it’s handy to know skills like Web scraping to collect your own data. While Web scraping is a vast area, this tutorial focuses on one particular aspect of it, which is “Scraping or Extracting Tables from Web Pages”.

Code

library(tidyverse)

content <- read_html("https://en.wikipedia.org/wiki/List_of_highest-grossing_films_in_the_United_States_and_Canada")

tables <- content %>% html_table(fill = TRUE)

first_table <- tables[[1]]

first_table <- first_table[-1,]

library(janitor)

first_table <- first_table %>% clean_names()

first_table %>% 
  mutate(lifetime_gross = parse_number(lifetime_gross)) %>% 
  arrange(desc(lifetime_gross)) %>% 
  head(20) %>% 
  mutate(title = fct_reorder(title, lifetime_gross)) %>% 
  ggplot() + geom_bar(aes(y = title, x = lifetime_gross), stat = "identity", fill = "blue") +
  labs(title = "Top 20 Grossing movies in US and Canada",
       caption = "Data Source: Wikipedia ")



first_table %>% 
  mutate(lifetime_gross_2 = parse_number(lifetime_gross_2)) %>% 
  arrange(desc(lifetime_gross_2)) %>% 
  head(20) %>% 
  mutate(title = fct_reorder(title, lifetime_gross_2)) %>% 
  ggplot() + geom_bar(aes(y = title, x = lifetime_gross_2), stat = "identity", fill = "blue") +
  labs(title = "Top 20 Grossing movies in US and Canada",
       caption = "Data Source: Wikipedia ")



second_table <- tables[[2]]

second_table %>% 
  clean_names() -> second_table


second_table %>% 
  mutate(adjusted_gross = parse_number(adjusted_gross)) %>% 
  group_by(year) %>% 
  summarise(total_adjusted_gross = sum(adjusted_gross)) %>% 
  arrange(desc(total_adjusted_gross)) %>% 
  ggplot() + geom_line(aes(x = year,y = total_adjusted_gross, group = 1))

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers on Programming with R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.