This is an example of how to scrape Google News with the awesome
This post is a solution for a question from our WhatsApp group, blackbeltR. A user came up with this problem and I decided to help him. It was a cool challenge, so why not?
A great deal of the basic ideas comes from his own code. I just kept it and added few things in order to get the code working.
First off, you should take a look at the Google News website HERE, which I reproduce below:
You may notice, on the right side of the page, that we are using Google Chrome dev-tools. We use this to identify the html nodes we need. You can access this tool by hitting the F12 key. The html nodes are passed as arguments to the
Basically, the idea is to extract the communication vehicle (vehicle), the time elapsed since the news was published (time), and the main headline (headline).
The code and coments are presented below:
# loading the packages: library(dplyr) # for pipes and the data_frame function library(rvest) # webscraping library(stringr) # to deal with strings and to clean up our data
# extracting the whole website google <- read_html("https://news.google.com/")
# extracting the com vehicles # we pass the nodes in html_nodes and extract the text from the last one # we use stringr to delete strings that are not important vehicle_all <- google %>% html_nodes("div div div main c-wiz div div div article div div div") %>% html_text() %>% str_subset("[^more_vert]") %>% str_subset("[^share]") %>% str_subset("[^bookmark_border]") vehicle_all[1:10] # take a look at the first ten
##  "The New York Times" "Vox.com" "Wall Street Journal" ##  "The New York Times" "Opinion" "Opinion" ##  "The Washington Post" "Opinion" "Opinion" ##  "CNN"
# extracting the time elapsed time_all <- google %>% html_nodes("div article div div time") %>% html_text() time_all[1:10] # take a look at the first ten
##  "2 hours ago" "today" "4 hours ago" "yesterday" ##  "yesterday" "2 hours ago" "today" "one hour ago" ##  "one hour ago" "today"
# extracting the headlines # and using stringr for cleaning headline_all <- google %>% html_nodes("article") %>% html_text("span") %>% str_split("(?<=[a-z0-9!?\\.])(?=[A-Z])") # str_split("(?<=[a-z0-9áéíóú!?\\.])(?=[A-Z])") # for Google News in Portuguese headline_all <- sapply(headline_all, function(x) x) # extract only the first elements headline_all[1:10] # take a look at the first ten
##  "As Government Reopens, the New Congress Tries to Begin Again" ##  "Government shutdown 2.0: Trump is willing to do it, Mick Mulvaney says" ##  "Federal Employees Head Back to Work With Payday Still Uncertain" ##  "Opinion | The Real Wall Isn't at the Border" ##  "Analysis | Why the shutdown ended — and what to watch for now" ##  "Kamala Harris officially launches 2020 presidential campaign" ##  "Extramarital affair with Kamala Harris? Former San Francisco mayor, 84, admits it happened" ##  "Kamala Harris hits Trump, promises progressive change in presidential campaign kick-off" ##  "Kamala Harris emerges as a 2020 front-runner, but is that a good thing?" ##  "Will Kamala Harris have the support of black women? Don't assume that"
In this last case we used a regular expression (REGEX) to clean up the data. We did this by separating the actual headline phrases from the complementary ones. In some cases, we have a phrase ending with uppercase letters such as “NSA” (The National Security Agency) collapsed with another phrase initiating with a uppercase letter such as the article “A” (“…with NSAA agent said…”) for example. We have to think of a better way to split these cases, but the current result is quite satisfactory for now.
?<= is called “lookbehind”, while
?= is called “lookahead”. Those “lookaround” expressions allow us to look for patterns followed or preceded by something. In our case, the idea is to separate a string at the point in which lowercase letters, numbers, exclamation points, periods or question marks are collapsed with uppercase letters , e.g. where lowercase letters, numbers and others (
[a-z0-9'!?\\.]) are followed (
?<=) by uppercase letters or where uppercase letters (
[A-Z]) are preceded (
?=) by lowercase letters.
Before we finish, we have to clean up our data. It is common to collect garbage in the process such as data related to “fact checking”, which is a section on the right side of the page. As a result, it is possible that the three vectors we have created may have different sizes. Therefore, we use the smallest of them as the base and just delete the entries above this number on the other two vectors.
# finding the smallest vector min <- min(sapply(list(vehicle_all, time_all, headline_all), length)) # cutting vehicle_all <- vehicle_all[1:min] time_all <- time_all[1:min] headline_all <- headline_all[1:min]
And we have our final data frame:
df_news <- data_frame(vehicle_all, time_all, headline_all) df_news
## # A tibble: 160 x 3 ## vehicle_all time_all headline_all ##
## 1 The New York Ti… 2 hours a… As Government Reopens, the New Congress Tr… ## 2 Vox.com today Government shutdown 2.0: Trump is willing … ## 3 Wall Street Jou… 4 hours a… Federal Employees Head Back to Work With P… ## 4 The New York Ti… yesterday Opinion | The Real Wall Isn't at the Border ## 5 Opinion yesterday Analysis | Why the shutdown ended — and wh… ## 6 Opinion 2 hours a… Kamala Harris officially launches 2020 pre… ## 7 The Washington … today Extramarital affair with Kamala Harris? Fo… ## 8 Opinion one hour … Kamala Harris hits Trump, promises progres… ## 9 Opinion one hour … Kamala Harris emerges as a 2020 front-runn… ## 10 CNN today Will Kamala Harris have the support of bla… ## # ... with 150 more rows