Retrieve & process TV News chyrons with newsflash

[This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The Internet Archive recently announced a new service they’ve dubbed ‘Third Eye’. This service scrapes the chyrons that annoyingly scroll across the bottom-third of TV news broadcasts. IA has a vast historical archive of TV news that they’ll eventually process, but — for now — the more recent broadcasts from four channels are readily available. There’s tons of information about the project on its main page where you can interactively work with the API if that’s how you roll.

Since my newsflash? package already had a “news” theme and worked with the joint IA-GDELT project TV data, it seemed to be a good home for a Third Eye interface to live.

Basic usage

You can read long-form details of the Third Eye service on their site. The TLDR is that they provide two feeds:

  • a “raw” one which has massive duplicates and tons of errors
  • a “clean” one that filters out duplicates, cleans up the text and is much better to work with

You can retrieve either with newsflash::read_chyrons() but the default is to use the clean feed. If you are studying text processing and or NLP/text-cleanup via machine learning, then the raw feed may be very interesting for you. I suspect most data journalists will want to use the clean feed that also powers the IA chyron twitter bots.

Since it’s the Internet Archive, they’re awesome at providing metadata about their data. Heck, even their metadata has metadata about metadata. We can use the fact that they provide a metadata feed to enable listing available chyron archive dates:

library(newsflash) # devtools::install_github("hrbrmstr/newsflash")
library(hrbrthemes)
library(tidyverse)

list_chyrons()
## # A tibble: 61 x 3
##            ts    type     size
##        <date>   <chr>    <dbl>
##  1 2017-09-30 cleaned   539061
##  2 2017-09-30     raw 17927121
##  3 2017-09-29 cleaned   635812
##  4 2017-09-29     raw 19234407
##  5 2017-09-28 cleaned   414067
##  6 2017-09-28     raw 12663606
##  7 2017-09-27 cleaned   613474
##  8 2017-09-27     raw 20442644
##  9 2017-09-26 cleaned   659930
## 10 2017-09-26     raw 19942951
## # ... with 51 more rows

Reading the chyrons in only requires passing in a Date object or a YYYY-mm-dd format date string:

chyrons <- read_chyrons("2017-09-30")


glimpse(chyrons)
## Observations: 2,729
## Variables: 5
## $ ts       <dttm> 2017-09-30 00:00:00, 2017-09-30 00:00:00, 2017-09-30 00:00:00, 2017-09-30...
## $ channel  <chr> "BBCNEWS", "CNNW", "FOXNEWSW", "BBCNEWS", "CNNW", "MSNBCW", "BBCNEWS", "CN...
## $ duration <int> 18, 42, 26, 10, 47, 19, 14, 62, 26, 11, 45, 17, 35, 11, 62, 32, 35, 35, 15...
## $ details  <chr> "BBCNEWS_20170929_233000_Race_and_Pace/start/1800", "CNNW_20170929_230000_...
## $ text     <chr> "TRUMP CABINET SECRETARY QUITS\\n'MIRACLE NEEDED' ON BREXIT", "TRUMP BRAGS...

You get five columns in a data frame on a successful retrieval:

  • ts (POSIXct) chyron timestamp
  • channel (character) news channel the chyron appeared on
  • duration (integer) see Description
  • details (character) Internet Archive details path
  • text (character) the chyron text

We’ll talk about the details path in a bit. The text is likely what you want, so here’s a sample:

head(chyrons$text, 30)
##  [1] "TRUMP CABINET SECRETARY QUITS\\n'MIRACLE NEEDED' ON BREXIT"                                                                                                                                                                                            
##  [2] "TRUMP BRAGS ABOUT PUERTO RICO RESPONSE AS FED-UP. SURVIVORS PLEAD FOR ELECTRICITY, WATER, FUEL\\nAnderson Cooper"                                                                                                                                      
##  [3] "ALIFORNIA STUDENT SWIPES 'MAGA' HAT"                                                                                                                                                                                                                   
##  [4] "US HEALTH SECRETARY QUITS. Mr Price apologised for use O126 private \\ufb02ights since May\\nUS HEALTH SECRETARY QUITS. Private flights cost taxpayers 4OO,OOO dollars\\nLAURA BICKER. Washington"                                                     
##  [5] "HHS SECY. PRICE OUT AFI'ER PRIVATE JET SCANDAL\\nTRUMP BRAGS ABOUT PUERTO RICO RESPONSE AS FED-UP. SURVIVORS PLEAD FOR ELECTRICITY, WATER, FUEL"                                                                                                       
##  [6] "TOM PRICE RESIGNS AMID PRIVATE JET SCANDAL"                                                                                                                                                                                                            
##  [7] "US HEALTH SECRETARY QUITS. Private flights cost taxpayers 4OO,OOO dollars\\nUS HEALTH SECRETARY QUITS. Government otficials required to take commercial \\ufb02ights\\nUS HEALTH SECRETARY QUITS. Scandal emerged after..."                            
##  [8] "HHS SECY. PRICE OUT AFI'ER PRIVATE JET SCANDAL"                                                                                                                                                                                                        
##  [9] "TOM PRICE RESIGNS AMID PRIVATE JET SCANDAL\\nTRUMP: \\\"I CERTAINLY DON'T LIKE THE OPTICS\\\" OF PRICE SCANDAL"                                                                                                                                        
## [10] "US HEALTH SECRETARY QUITS. Scandal emerged after investigation by Politico magazine\\nUS HEALTH SECRETARY QUITS. Tom Price resigned over use of private planes\\nUS HEALTH SECRETARY QUITS. Mr Price apologised for use O126..."                       
## [11] "HHS SECY. PRICE OUT AFI'ER PRIVATE JET SCANDAL\\nHHS SECY. PRICE OUT AFI'ER PRIVATE JET SCANDAL. . Ryan Nobles (J\\\\N Washington Correspondent"                                                                                                       
## [12] "BRARIAN REJECTS \\\"RACIST\\\" DR. SEUSS BOOKS I"                                                                                                                                                                                                      
## [13] "TOM PRICE RESIGNS AMID PRIVATE JET SCANDAL\\nREPORTER WHO BROKE PRICE SCANDAL SPEAKS OUT"                                                                                                                                                              
## [14] "US HEALTH SECRETARY QUITS. Tom Price resigned over use of private planes\\nUS HEALTH SECRETARY QUITS. Scandal emerged after investigation by Politico magazine\\nUS HEALTH SECRETARY QUITS. Mr Price apologised for..."                                
## [15] "HHS SECY. PRICE OUT AFI'ER PRIVATE JET SCANDAL"                                                                                                                                                                                                        
## [16] "BIZARRE LIBERAL MELTDOWNS I\\nTUCKER & THE CAT IN THE HAT I. . _ < 'rnnwnn FAD! cnm tint-\\ufb01nk"                                                                                                                                                    
## [17] "TRUMP: \\\"I CERTAINLY DON'T LIKE THE OPTICS\\\" OF PRICE SCANDAL\\nTOM PRICE RESIGNS AMID PRIVATE JET SCANDAL"                                                                                                                                        
## [18] "HHS SECY. PRICE OUT AFI'ER PRIVATE JET SCANDAL\\nTRUMP BRAGS ABOUT PUERTO RICO RESPONSE AS FED-UP. SURVIVORS PLEAD FOR ELECTRICITY, WATER, FUEL"                                                                                                       
## [19] "BRARIAN REJECTS \\\"RACIST\\\" DR. SEUSS BOOKS I\\nBIZARRE LIBERAL MELTDOWN I"                                                                                                                                                                         
## [20] "TRUMP: \\\"I CERTAINLY DON'T LIKE THE OPTICS\\\" OF PRICE SCANDAL"                                                                                                                                                                                     
## [21] "TRUMP BRAGS ABOUT PUERTO RICO RESPONSE AS FED-UP. SURVIVORS PLEAD FOR ELECTRICITY, WATER, FUEL"                                                                                                                                                        
## [22] "BRARIAN REJECTS \\\"RACIST\\\" DR. SEUSS BOOKS I\\nSCHOOL LIBRARIAN REJECTS DR. SEUSS. BOOKS GIFTED BY MELANIA TRUMP. . _' tnnx'nkr"                                                                                                                   
## [23] "TRUMP: \\\"I CERTAINLY DON'T LIKE THE OPTICS\\\" OF PRICE SCANDAL\\nTOM PRICE RESIGNS AMID PRIVATE JET SCANDAL"                                                                                                                                        
## [24] "YEMEN WAR CRIMES. UN Human Rights Council agrees on investigation\\nINIGO MENDEZ DE VIGO. Spanish Education Minister"                                                                                                                                  
## [25] "TRUMP BRAGS ABOUT PUERTO RICO RESPONSE AS FED-UP. SURVIVORS PLEAD FOR ELECTRICITY, WATER, FUEL\\nSAN JUAN MAYOR: \\\"THIS IS NOT A GOOD NEWS STORY\\\"\\nSAN JUAN MAYOR: \\\"THIS IS NOT A GOOD NEWS STORY\\\". . Mavor Carmen Yulin Cruz San Juan,..."
## [26] "BRARIAN REJECTS \\\"RACIST\\\" DR. SEUSS BOOKS"                                                                                                                                                                                                        
## [27] "TOM PRICE RESIGNS AMID PRIVATE JET SCANDAL"                                                                                                                                                                                                            
## [28] "TRUMP ASIA TOUR. US President to visit Japan, South Korea and China\\nYEMEN WAR CRIMES. UN Human Rights Council agrees on investigation"                                                                                                               
## [29] "SAN JUAN MAYOR: \\\"MAD AS HELL\\\" OVER HURRICANE RESPONSE\\nSAN JUAN MAYOR: \\\"MAD AS HELLII OVER HURRICANE RESPONSE. . Dr. Saniav Gupta (J\\\\N Chief Medical Correspondent"                                                                       
## [30] "SAN JUAN MAYOR: \\\"MAD AS HELL\\\" OVER HURRICANE RESPONSE"

Be warned: even the “clean” text is often kinda messy.

For now, there are only four channels, so it’s easy to show a quick example. Since chyrons are supposed to be super-important things you need to know NOW, let’s see how many times Puerto Rico was mentioned on them in the above archive. NOTE: This is a quick example, not a thorough one. I’m searching for some key letter combinations to see just mentions of something looking like “Puerto Rico”. “San Juan” and other text that might be associated with the topic aren’t being considered for this toy example.

mutate(
  chyrons, 
  hour = lubridate::hour(ts),
  text = tolower(text),
  mention = grepl("erto ri", text)
) %>% 
  filter(mention) %>% 
  count(hour, channel) %>% 
  ggplot(aes(hour, n)) +
  geom_segment(aes(xend=hour, yend=0)) +
  scale_x_continuous(name="Hour (GMT)", breaks=seq(0, 23, 6),
                     labels=sprintf("%02d:00", seq(0, 23, 6))) +
  scale_y_continuous(name="# Chyrons", limits=c(0,30)) +
  facet_wrap(~channel, scales="free") +
  labs(title="Chyrons mentioning 'Puerto Rico' per hour per channel",
       subtitle="Chyron date: 2017-09-30",
       caption="Source: Internet Archive Third Eye project & <github.com/hrbrmstr/newsflash>") +
  theme_ipsum_rc(grid="Y")

Details, details, details

Entries in details column look like this:

head(chyrons$details)
## [1] "BBCNEWS_20170929_233000_Race_and_Pace/start/1800"                   
## [2] "CNNW_20170929_230000_Erin_Burnett_OutFront/start/3600"              
## [3] "FOXNEWSW_20170929_230000_The_Story_With_Martha_MacCallum/start/3600"
## [4] "BBCNEWS_20170930_000000_BBC_News/start/60"                          
## [5] "CNNW_20170930_000000_Anderson_Cooper_360/start/60"                  
## [6] "MSNBCW_20170930_000000_All_In_With_Chris_Hayes/start/60"

They are path fragments that can be attached to a URL prefix to see the news clip from that station on that day/time. newsflash::view_clip() does that work for you:

view_clip(chyrons$details[2])

The URL for that is https://archive.org/details/CNNW_20170929_230000_Erin_Burnett_OutFront/start/3600/end/3660 in the event the iframe load failed or you really like being annoyed with cable news shows.

FIN

Grab the package on GitHub, kick the tyres and don’t hesitate to file issues, questions or jump on board with package development. There’s plenty of room for improvement before it hits CRAN and your ideas are most welcome.

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)