Scrape HTML Table using rvest
[This article was first published on r-bloggers on Programming with R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In this tutorial, we’ll see how to scrape an HTML table from Wikipedia and process the data for finding insights in it (or naively, to build a data visualization plot).
Youtube – https://youtu.be/KCUj7JQKOJA
Why?
Most of the times, As a Data Scientist or Data Analyst, your data may not be readily availble hence it’s handy to know skills like Web scraping to collect your own data. While Web scraping is a vast area, this tutorial focuses on one particular aspect of it, which is “Scraping or Extracting Tables from Web Pages”.
Code
library(tidyverse) content <- read_html("https://en.wikipedia.org/wiki/List_of_highest-grossing_films_in_the_United_States_and_Canada") tables <- content %>% html_table(fill = TRUE) first_table <- tables[[1]] first_table <- first_table[-1,] library(janitor) first_table <- first_table %>% clean_names() first_table %>% mutate(lifetime_gross = parse_number(lifetime_gross)) %>% arrange(desc(lifetime_gross)) %>% head(20) %>% mutate(title = fct_reorder(title, lifetime_gross)) %>% ggplot() + geom_bar(aes(y = title, x = lifetime_gross), stat = "identity", fill = "blue") + labs(title = "Top 20 Grossing movies in US and Canada", caption = "Data Source: Wikipedia ") first_table %>% mutate(lifetime_gross_2 = parse_number(lifetime_gross_2)) %>% arrange(desc(lifetime_gross_2)) %>% head(20) %>% mutate(title = fct_reorder(title, lifetime_gross_2)) %>% ggplot() + geom_bar(aes(y = title, x = lifetime_gross_2), stat = "identity", fill = "blue") + labs(title = "Top 20 Grossing movies in US and Canada", caption = "Data Source: Wikipedia ") second_table <- tables[[2]] second_table %>% clean_names() -> second_table second_table %>% mutate(adjusted_gross = parse_number(adjusted_gross)) %>% group_by(year) %>% summarise(total_adjusted_gross = sum(adjusted_gross)) %>% arrange(desc(total_adjusted_gross)) %>% ggplot() + geom_line(aes(x = year,y = total_adjusted_gross, group = 1))
To leave a comment for the author, please follow the link and comment on their blog: r-bloggers on Programming with R.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.