Teasing Out Top Daily Topics with GDELT’s Television Explorer

[This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Earlier this year, the GDELT Project released their Television Explorer that enabled API access to closed-caption tedt from television news broadcasts. They’ve done an incredible job expanding and stabilizing the API and just recently released “top trending tables” which summarise what the “top” topics and phrases are across news stations every fifteen minutes. You should read that (long-ish) intro as there are many caveats to the data source and I’ve also found that the files aren’t always available (i.e. there are often gaps when retrieving a sequence of files).

The R newsflash package has been able to work with the GDELT Television Explorer API since the inception of the service. It now has the ability work with this new “top topics” resource directly from R.

There are two interfaces to the top topics, but I’ll show you the easiest one to use in this post. Let’s chart the top 25 topics per day for the past ~3 days (this post was generated ~mid-day 2017-09-09).

To start, we’ll need the data!

We provide start and end POSIXct times in the current time zone (the top_trending_range() function auto-converts to GMT which is how the file timestamps are stored by GDELT). The function takes care of generating the proper 15-minute sequences.

library(newsflash) # devtools::install_github("hrbrmstr/newsflash")

from <- as.POSIXct("2017-09-07 00:00:00")
to <- as.POSIXct("2017-09-09 12:00:00")

trends <- top_trending_range(from, to)

## Observations: 233
## Variables: 5
## $ ts                       <dttm> 2017-09-07 00:00:00, 2017-09-07 00:15:00, 2017-...
## $ overall_trending_topics  <list> [<"florida", "irma", "barbuda", "puerto rico", ...
## $ station_trending_topics  <list> [<c("CNN", "BLOOMBERG", "CNBC", "FBC", "FOXNEWS...
## $ station_top_topics       <list> [<c("CNN", "BLOOMBERG", "CNBC", "FBC", "FOXNEWS...
## $ overall_trending_phrases <list> [<"debt ceiling", "legalize daca", "florida key...

The glimpse view shows a compact, nested data frame. I encourage you to explore the individual nested elements to see the gems they contain, but we’re going to focus on the station_top_topics:

## Variables: 2
## $ Station <chr> "CNN", "BLOOMBERG", "CNBC", "FBC", "FOXNEWS", "MSNBC", "BBCNEWS"
## $ Topics  <list> [<"florida", "irma", "daca", "north korea", "harvey", "united st...

Each individual data frame has the top topics of each tracked station.

To get the top 25 topics per day, we’re going to bust out this structure, count up the topic “mentions” (not 100% accurate term, but good enough for now) per day and slice out the top 25. It’s a pretty straightforward process with tidyverse ops:

select(trends, ts, station_top_topics) %>% 
  unnest() %>% 
  unnest() %>% 
  mutate(day = as.Date(ts)) %>% 
  rename(station=Station, topic=Topics) %>% 
  count(day, topic) %>% 
  group_by(day) %>% 
  top_n(25) %>% 
  slice(1:25) %>% 
  arrange(day, desc(n)) %>% 
  mutate(rnk = 25:1) -> top_25_trends

## Observations: 75
## Variables: 4
## $ day   <date> 2017-09-07, 2017-09-07, 2017-09-07, 2017-09-07, 2017-09-07, 2017-0...
## $ topic <chr> "florida", "irma", "harvey", "north korea", "america", "daca", "chi...
## $ n     <int> 546, 546, 468, 464, 386, 362, 356, 274, 217, 210, 200, 156, 141, 13...
## $ rnk   <int> 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, ...

Now, it’s just a matter of some ggplotting:

ggplot(top_25_trends, aes(day, rnk, label=topic, size=n)) +
  geom_text(vjust=0.5, hjust=0.5) +
  scale_x_date(expand=c(0,0.5)) +
  scale_size(name=NULL, range=c(3,8)) +
    x=NULL, y=NULL, 
    title="Top 25 Trending Topics Per Day",
    subtitle="Topic placed by rank and sized by frequency",
    caption="GDELT Television Explorer & #rstats newsflash package github.com/hrbrmstr/newsflash"
  ) +
  theme_ipsum_rc(grid="") +
  theme(axis.text.y=element_blank()) +
  theme(legend.position=c(0.75, 1.05)) +

Hopefully you’ll have some fun with the new “API”. Make sure to blog your own creations!

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)