Scraping a website with 5 lines of R code

Posted on January 24, 2018 by David Smith in R bloggers | 0 Comments

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In what is rapidly becoming a series — cool things you can do with R in a tweet — Julia Silge demonstrates scraping the list of members of the US house of representatives on Wikipedia in just 5 R statements:

library(rvest)
library(tidyverse)

h <- read_html("https://t.co/gloY1eErBn“)

reps <- h %>%
html_node(“#mw-content-text > div > table:nth-child(18)”) %>%
html_table()

reps <- reps[,c(1:2,4:9)] %>%
as_tibble() pic.twitter.com/25ANm7BHkj
— Julia Silge (@juliasilge) January 12, 2018

Since Twitter munges the URL in the third line when you cut-and-paste, here's a plain-text version of Julia's code:

library(rvest)
library(tidyverse)

h <- read_html("https://en.wikipedia.org/wiki/Current_members_of_the_United_States_House_of_Representatives")

reps <- h %>%
 html_node("#mw-content-text > div > table:nth-child(18)") %>%
 html_table()

reps <- reps[,c(1:2,4:9)] %>% as_tibble()

And sure enough, here's what the reps object looks like in the RStudio viewer:

As Julia notes it's not perfect, but you're still 95% of the way there to gathering data from a page intended for human rather than computer consumption. Impressive!

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Scraping a website with 5 lines of R code

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)