Scraping Google News with ‘rvest’

August 21, 2018
By

[This article was first published on R on ALLAN V. C. QUADROS, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is an example of how to scrape Google News with the awesome rvest package.

This post is a solution for a question from our WhatsApp group, blackbeltR. A user came up with this problem and I decided to help him. It was a cool challenge, so why not?

A great deal of the basic ideas comes from his own code. I just kept it and added few things in order to get the code working.

First off, you should take a look at the Google News website HERE, which I reproduce below:


You may notice, on the right side of the page, that we are using Google Chrome dev-tools. We use this to identify the html nodes we need. You can access this tool by hitting the F12 key. The html nodes are passed as arguments to the rvest functions.

Basically, the idea is to extract the communication vehicle (vehicle), the time elapsed since the news was published (time), and the main headline (headline).

The code and coments are presented below:

# loading the packages:
library(dplyr) # for pipes and the data_frame function
library(rvest) # webscraping
library(stringr) # to deal with strings and to clean up our data
# extracting the whole website
google <- read_html("https://news.google.com/")
# extracting the com vehicles
# we pass the nodes in html_nodes and extract the text from the last one 
# we use stringr to delete strings that are not important
vehicle_all <- google %>% 
  html_nodes("div div div main c-wiz div div div article div div div") %>% 
  html_text() %>%
  str_subset("[^more_vert]") %>%
  str_subset("[^share]") %>%
  str_subset("[^bookmark_border]")

vehicle_all[1:10] # take a look at the first ten
##  [1] "The New York Times"  "Vox.com"             "Wall Street Journal"
##  [4] "The New York Times"  "Opinion"             "Opinion"            
##  [7] "The Washington Post" "Opinion"             "Opinion"            
## [10] "CNN"
# extracting the time elapsed
time_all <- google %>% html_nodes("div article div div time") %>% html_text()

time_all[1:10] # take a look at the first ten
##  [1] "2 hours ago"  "today"        "4 hours ago"  "yesterday"   
##  [5] "yesterday"    "2 hours ago"  "today"        "one hour ago"
##  [9] "one hour ago" "today"
# extracting the headlines
# and using stringr for cleaning
headline_all <- google %>% html_nodes("article") %>% html_text("span") %>%
  str_split("(?<=[a-z0-9!?\\.])(?=[A-Z])")
  # str_split("(?<=[a-z0-9áéíóú!?\\.])(?=[A-Z])") # for Google News in Portuguese

headline_all <- sapply(headline_all, function(x) x[1]) # extract only the first elements


headline_all[1:10] # take a look at the first ten
##  [1] "As Government Reopens, the New Congress Tries to Begin Again"                              
##  [2] "Government shutdown 2.0: Trump is willing to do it, Mick Mulvaney says"                    
##  [3] "Federal Employees Head Back to Work With Payday Still Uncertain"                           
##  [4] "Opinion | The Real Wall Isn't at the Border"                                               
##  [5] "Analysis | Why the shutdown ended — and what to watch for now"                             
##  [6] "Kamala Harris officially launches 2020 presidential campaign"                              
##  [7] "Extramarital affair with Kamala Harris? Former San Francisco mayor, 84, admits it happened"
##  [8] "Kamala Harris hits Trump, promises progressive change in presidential campaign kick-off"   
##  [9] "Kamala Harris emerges as a 2020 front-runner, but is that a good thing?"                   
## [10] "Will Kamala Harris have the support of black women? Don't assume that"

In this last case we used a regular expression (REGEX) to clean up the data. We did this by separating the actual headline phrases from the complementary ones. In some cases, we have a phrase ending with uppercase letters such as “NSA” (The National Security Agency) collapsed with another phrase initiating with a uppercase letter such as the article “A” (“…with NSAA agent said…”) for example. We have to think of a better way to split these cases, but the current result is quite satisfactory for now.

The expression ?<= is called “lookbehind”, while ?= is called “lookahead”. Those “lookaround” expressions allow us to look for patterns followed or preceded by something. In our case, the idea is to separate a string at the point in which lowercase letters, numbers, exclamation points, periods or question marks are collapsed with uppercase letters , e.g. where lowercase letters, numbers and others ([a-z0-9'!?\\.]) are followed (?<=) by uppercase letters or where uppercase letters ([A-Z]) are preceded (?=) by lowercase letters.

Before we finish, we have to clean up our data. It is common to collect garbage in the process such as data related to “fact checking”, which is a section on the right side of the page. As a result, it is possible that the three vectors we have created may have different sizes. Therefore, we use the smallest of them as the base and just delete the entries above this number on the other two vectors.

# finding the smallest vector
min <- min(sapply(list(vehicle_all, time_all, headline_all), length))

# cutting
vehicle_all <- vehicle_all[1:min]
time_all <- time_all[1:min]
headline_all <- headline_all[1:min]

And we have our final data frame:

df_news <- data_frame(vehicle_all, time_all, headline_all)

df_news
## # A tibble: 160 x 3
##    vehicle_all      time_all   headline_all                               
##                                                            
##  1 The New York Ti… 2 hours a… As Government Reopens, the New Congress Tr…
##  2 Vox.com          today      Government shutdown 2.0: Trump is willing …
##  3 Wall Street Jou… 4 hours a… Federal Employees Head Back to Work With P…
##  4 The New York Ti… yesterday  Opinion | The Real Wall Isn't at the Border
##  5 Opinion          yesterday  Analysis | Why the shutdown ended — and wh…
##  6 Opinion          2 hours a… Kamala Harris officially launches 2020 pre…
##  7 The Washington … today      Extramarital affair with Kamala Harris? Fo…
##  8 Opinion          one hour … Kamala Harris hits Trump, promises progres…
##  9 Opinion          one hour … Kamala Harris emerges as a 2020 front-runn…
## 10 CNN              today      Will Kamala Harris have the support of bla…
## # ... with 150 more rows

To leave a comment for the author, please follow the link and comment on their blog: R on ALLAN V. C. QUADROS.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)