R⁶ — Capturing [YouTube] Captions

[This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

(R⁶ == brief, low-expository posts)

@yoniceedee suggested I look at the Cambridge Analytics “whistleblower” testimony proceedings:

I value the resources @yoniceedee tosses my way (they often end me down twisted paths like this one, though 🙂 but I really dislike spending any amount of time on youtube and can consume text context much faster than even accelerated video playback.

Google auto-generated captions for that video and you can display them by clicking below the video on the right and enabling the transcript which slowly (well, in my frame of reference) loads into the upper-right. That’s still sub-optimal since we need to be on the youtube page to read/scroll. There’s no “export” option and my initial instinct was to go to Developer Tools and look for the https://www.youtube.com/service_ajax?name=getTranscriptEndpoint URL and “Copy the Response” to the clipboard and save it to a file then do some JSON/list wrangling (the transcript JSON URL is in the snippet below):

library(tidyverse)

trscrpt <- jsonlite::fromJSON("https://rud.is/dl/ca-transcript.json")

runs <- trscrpt$data$actions$openTranscriptAction$transcriptRenderer$transcriptRenderer$body$transcriptBodyRenderer$cueGroups[[1]]$transcriptCueGroupRenderer$formattedStartOffset$runs
cues <- trscrpt$data$actions$openTranscriptAction$transcriptRenderer$transcriptRenderer$body$transcriptBodyRenderer$cueGroups[[1]]$transcriptCueGroupRenderer$cues

data_frame(
  mark = map_chr(runs, ~.x$text),
  text = map_chr(cues, ~.x$transcriptCueRenderer$cue$runs[[1]]$text)  
) %>% 
  separate(mark, c("minute", "second"), sep=":", remove = FALSE, convert = TRUE) 
## # A tibble: 3,247 x 4
##    mark  minute second text                                    
##    <chr>  <int>  <int> <chr>                                   
##  1 00:00      0      0 all sort of yeah web of things if it's a
##  2 00:02      0      2 franchise then there's a kind of        
##  3 00:03      0      3 ultimately there's a there's a there's a
##  4 00:05      0      5 coordinator of that franchise or someone
##  5 00:07      0      7 who's a you got a that franchise is well
##  6 00:09      0      9 well when I was there that was Alexander
##  7 00:13      0     13 Nixon Steve banning but that's that's a 
##  8 00:16      0     16 question you should be asking aiq yeah  
##  9 00:18      0     18 yeah and just got to a IQ and the GSR   
## 10 00:24      0     24 state from gts-r that's other Hogan data
## # ... with 3,237 more rows

But, then I remembered YouTube has an API for this and threw together a quick script to grab them that way as well:

# the API needs these scopes

c(
  "https://www.googleapis.com/auth/youtube.force-ssl",
  "https://www.googleapis.com/auth/youtubepartner"
) -> scope_list

# oauth dance

httr::oauth_app(
  appname = "google",
  key = Sys.getenv("GOOGLE_APP_SECRET"),
  secret = Sys.getenv("GOOGLE_APP_KEY")
) -> captions_app

httr::oauth2.0_token(
  endpoint = httr::oauth_endpoints("google"),
  app = captions_app,
  scope = scope_list,
  cache = TRUE
) -> google_token

# list the available captions for this video
# (captions can be in one or more languages)

httr::GET(
  url = "https://www.googleapis.com/youtube/v3/captions",
  query = list(
    part = "snippet",
    videoId = "f2Sxob3fl0k" # the v=string in the YouTube URL
  ),
  httr::config(token = google_token)
) -> caps_list

# I'm cheating since I know there's only one but you'd want
# to introspect `caps_list` before blindly doing this for 
# other videos.

httr::GET(
  url = sprintf(
    "https://www.googleapis.com/youtube/v3/captions/%s",
    httr::content(caps_list)$items[[1]]$id
  ),
  httr::config(token = google_token)
) -> caps

# strangely enough, the JSON response "feels" better than this
# one, though this is a standard format that's parseable quite well.

cat(rawToChar(httr::content(caps)))
## 0:00:00.000,0:00:03.659
## all sort of yeah web of things if it's a
## 
## 0:00:02.490,0:00:05.819
## franchise then there's a kind of
## 
## 0:00:03.659,0:00:07.589
## ultimately there's a there's a there's a
## 
## 0:00:05.819,0:00:09.660
## coordinator of that franchise or someone
## 
## 0:00:07.589,0:00:13.139
## who's a you got a that franchise is well
## 
## 0:00:09.660,0:00:16.230
## well when I was there that was Alexander
## ...

Neither a reflection on active memory nor a quick Duck Duck Go search (I try not to use Google Search anymore) seemed to point to an existing R resource for this, hence the quick post in the event the snippet is helpful to anyone else.

If you do know of an R package/snippet that does this already, please shoot a note into the comments so others can find it.

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)