Exploring News Coverage With newsflash

February 1, 2017
By

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

I was enthused to see a mention of this on the GDELT blog since I’ve been working on an R package dubbed newsflash to work with the API that the form front-ends.

Given the current climate, I feel compelled to note that I’m neither a Clinton supporter/defender/advocate nor a 🍊 supporter/defender/advocate) in any way, shape or form. I’m only using the example for replication and I’m very glad the article author stayed (pretty much) non-partisan apart from some color commentary about the predictability of network coverage of certain topics.

For now, the newsflash package is configured to grab raw count data, not the percent summaries since folks using R to grab this data probably want to do their own work with it. I used the following to try to replicate the author’s findings:

library(newsflash)
library(ggalt) # github version
library(hrbrmisc) # github only
library(tidyverse)
starts <- seq(as.Date("2015-01-01"), (as.Date("2017-01-26")-30), "30 days")
ends <- as.character(starts + 29)
ends[length(ends)] <- ""

pb <- progress_estimated(length(starts))
emails <- map2(starts, ends, function(x, y) {
  pb$tick()$print()
  query_tv("clinton", "email,emails,server", timespan="custom", start_date=x, end_date=y)
})

clinton_timeline <- map_df(emails, "timeline")

sum(clinton_timeline$value)
## [1] 34778

count(clinton_timeline, station, wt=value, sort=TRUE) %>%
  mutate(pct=n/sum(n), pct_lab=sprintf("%s (%s)", scales::comma(n), scales::percent(pct)),
         station=factor(station, levels=rev(station))) -> timeline_df

timeline_df

## # A tibble: 7 × 4
##             station     n         pct        pct_lab
##              <fctr> <int>       <dbl>          <chr>
## 1          FOX News 14807 0.425757663 14,807 (42.6%)
## 2      FOX Business  7607 0.218730232  7,607 (21.9%)
## 3               CNN  5434 0.156248203  5,434 (15.6%)
## 4             MSNBC  4413 0.126890563  4,413 (12.7%)
## 5 Aljazeera America  1234 0.035482201   1,234 (3.5%)
## 6         Bloomberg   980 0.028178734     980 (2.8%)
## 7              CNBC   303 0.008712404     303 (0.9%)

NOTE: I had to break up the queries since the bulk one across the two dates bump up against the API limits and may be providing helper functions for that before CRAN release.

While my package matches the total from the news article and sample query: 34,778 results my percentages are different since it’s the percentages across the raw counts for the included stations. “Percent of Sentences” (result “n” divided by the number of all sentences for each station in the time frame) — which the author used — seems to have some utility so I’ll probably add that as a query parameter or add a new function.

Tidy news text

The package also is designed to work with the tidytext package (it’s on CRAN) and provides a top_text() function which can return a tidytext-ready tibble or a plain character vector for use in other text processing packages. If you were curious as to whether this API has good data behind it, we can take a naive peek with the help of tidytext:

library(tidytext)

tops <- map_df(emails, top_text)
anti_join(tops, stop_words) %>% 
  filter(!(word %in% c("clinton", "hillary", "server", "emails", "mail", "email",
                       "mails", "secretary", "clinton's", "secretary"))) %>% 
  count(word, sort=TRUE) %>% 
  print(n=20)

## # A tibble: 26,861 × 2
##             word     n
##            <chr> <int>
## 1        private 12683
## 2     department  9262
## 3            fbi  7250
## 4       campaign  6790
## 5     classified  6337
## 6          trump  6228
## 7    information  6147
## 8  investigation  5111
## 9         people  5029
## 10          time  4739
## 11      personal  4514
## 12     president  4448
## 13        donald  4011
## 14    foundation  3972
## 15          news  3918
## 16     questions  3043
## 17           top  2862
## 18    government  2799
## 19          bill  2698
## 20      reporter  2684

I’d say the API is doing just fine.

Fin

The package also has some other bits from the API in it and if this has piqued your interest, please leave all package feature requests or problems as a github issue.

Many thanks to the Internet Archive / GDELT for making this API possible. Data like this would be amazing in any time, but is almost invaluable now.

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)