How did Axios rectangle Trump’s PDF schedule? A try with R

Posted on February 10, 2019 by Posts on Maëlle's R blog in R bloggers | 0 Comments

[This article was first published on Posts on Maëlle's R blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Last week, Axios published a very interesting piece reporting on Trump’s private schedule thanks to an insider’s leak. The headlines all were about Trump’s spending more than 60% of his time in “executive time” which admittedly was indeed the most important aspect of the story. I, however, also got curious about Axios’ work to go from the PDF schedules to the spreadsheet they made public. In this post, I’ll have a got at using rOpenSci’s Jeroen Ooms’ pdftools package and some data-wrangling stubborness of mine to try and rectangle Trump’s PDF schedules.

Available data, available software

Axios published both the spreadsheet they made from the data leaked, and PDF corresponding to the schedules retyped in the format White House staff received them. I tried finding out how they went from the latter to the former, but this was apparently not reported. Quite a niche topic, I agree. Here’s the untidy PDF below.

pdf_file <- "data/Axios-President-Donald-Trump-Private-Schedules.pdf"
library("magrittr")
pdftools::pdf_render_page(pdf_file, page = 1) %>%
  magick::image_read()

Seeing the PDF format above reminded me of the cool new pftools feature Jeroen Ooms announced back in December: low level text extraction.

# High level text extraction
pdftools::pdf_text(pdf_file)[1]

## [1] "                       SCHEDULE OF THE PRESIDENT\n                          Wednesday, November 7, 2018\n8:00 AM        EXECUTIVE TIME\n(3 hr)         Location:           Oval Office\n11:00 AM       MEETING WITH THE CHIEF OF STAFF\n(30 min)       Location:           Oval Office\n11:30 AM       EXECUTIVE TIME\n(1 hr)         Location:           Oval Office\n12:30 PM       LUNCH\n(1 hr)         Location:           Private Dining Room\n1:30 PM        EXECUTIVE TIME\n(3 hr, 30 min) Location:           Oval Office\nRON:           The White House\n"

# Low level text extraction
head(pdftools::pdf_data(pdf_file)[[1]])

##   width height   x  y space       text
## 1    60     12 220 72  TRUE   SCHEDULE
## 2    15     12 283 72  TRUE         OF
## 3    23     12 301 72  TRUE        THE
## 4    62     12 327 72 FALSE  PRESIDENT
## 5    54     12 236 99  TRUE Wednesday,
## 6    46     12 293 99  TRUE   November

While the pdftools::pdf_text() function returns “a character vector of length equal to the number of pages in the file”, pdftools::pdf_data returns a list of as many data.frame’s as there are pages, each one indicating the coordinates and width of each text piece. So powerful! Trump’s schedule appeared to be a perfect use case for this functionality!

Special tale for Ubuntu users

Edit: Benjamin Louis posted a link to a helpful thread about installing Poppler’s latest version and pdftools on Ubuntu 18.04 in the comments. Thank you!

Sadly, as noted in Jeroen’s tech note, the new Poppler version that is necessary for pdftools to support low level text extraction… is not available on Linux yet, and I’ve just switched to Ubuntu from Windows (an accidental New Year’s resolution after Windows disappeared from my laptop). Jeroen wrote “Linux users probably have to wait for the latest version of poppler to become available in their system package manager (or compile from source).”. I thought, OK, let me just compile Poppler from source then. I’m not sure I’d recommend trying this at home so take my experience report with a bit of caution. I downloaded Poppler’s latest version, unzipped it, created a build folder inside the poppler folder and ran something like this (inspired by this gist):

cmake ..
make
sudo make install

It seemed to work but then re-installing pdftools failed. Jeroen thankfully advised me to remove my old Poppler installation with a drastic command that affected all softwares depending on Poppler (that’s why you might not want to reproduce this at home) like GIMP or Libre Office Writer: sudo apt-get remove libpoppler*. But, after I did that, I was able to install pdftools using the newest Poppler version.

General strategy

I started work by carefully looking at the PDF. There were one-page daily schedules like on page 1,

pdftools::pdf_render_page(pdf_file, page = 1) %>%
  magick::image_read()

but also schedules continued on several pages,

pdftools::pdf_render_page(pdf_file, page = 3) %>%
  magick::image_read()

Because of that, I decided to extract the dates of pages that had one, then assign to each undated page the latest available date, before splitting the PDF low level text data into dates and working on a pipeline by date.

Splitting the low level text data by date

In a first step, I wrote a pipeline adding the page number to each data.frame from pdftools::pdf_data().

pdf_data <- pdftools::pdf_data(pdf_file)

read_page <- function(page, pdf_data){
  df <- pdf_data[[page]]
  if(nrow(df) > 0) {
    df$page <- page
    df
  } else {
    NULL
  }
}

pdf_text <- purrr::map(1:pdftools::pdf_info(pdf_file)$pages,
                       read_page,
                       pdf_data = pdf_data)

# remove empty pages
pdf_text <- purrr::discard(pdf_text, is.null)

head(pdf_text[[1]])

##   width height   x  y space       text page
## 1    60     12 220 72  TRUE   SCHEDULE    1
## 2    15     12 283 72  TRUE         OF    1
## 3    23     12 301 72  TRUE        THE    1
## 4    62     12 327 72 FALSE  PRESIDENT    1
## 5    54     12 236 99  TRUE Wednesday,    1
## 6    46     12 293 99  TRUE   November    1

Then for each of them I tried extracting the date.

get_date <- function(page_data){
  if(page_data$text[[1]] == "SCHEDULE"){
    date <- as.character(
      glue::glue_collapse(page_data$text[6:8],
                          sep = " "))
    date <- lubridate::parse_date_time(date, 
                                       orders = "BdY", 
                                       locale = "en_US.utf8", tz = "EST")
    date <- as.Date(date)
  }else{
    date <- NA
  }
  tibble::tibble(date = date,
                 page = page_data$page[1])
}

dates <- purrr::map_df(pdf_text, get_date)

Finally, I created a gigantic data.frame of all the data.frame’s from pdftools::pdf_text(), sorted it by page just to be sure it was sorted by page, assigned to each row the latest non missing page, and then split the data by date which was a better unit to work from. In order to not mess up coordinates, within each date I added 700, 1400, etc. to y-coordinates from the second, third, etc. page.

split_days <- function(pdf_text, dates){
  pdf_text %>% 
  dplyr::bind_rows() %>%
  dplyr::left_join(dates, by = "page") %>%
  dplyr::arrange(page) %>%
  dplyr::mutate(date = zoo::na.locf(date)) %>%
  dplyr::group_by(date) %>%
  dplyr::mutate(y = y + (page - min(page)) * 700) %>%
  dplyr::ungroup() %>%
  split(.$date)
}
  
pdf_text <- split_days(pdf_text, dates)

length(pdf_text)

## [1] 50

pdf_text[[2]]

## # A tibble: 243 x 8
##    width height     x     y space text       page date      
##    <int>  <int> <int> <dbl> <lgl> <chr>     <int> <date>    
##  1    60     12   220    72 TRUE  SCHEDULE      2 2018-11-08
##  2    15     12   283    72 TRUE  OF            2 2018-11-08
##  3    23     12   301    72 TRUE  THE           2 2018-11-08
##  4    62     12   327    72 FALSE PRESIDENT     2 2018-11-08
##  5    44     12   240    99 TRUE  Thursday,     2 2018-11-08
##  6    46     12   287    99 TRUE  November      2 2018-11-08
##  7     8     12   336    99 TRUE  8,            2 2018-11-08
##  8    21     12   347    99 FALSE 2018          2 2018-11-08
##  9    19     12    72   140 TRUE  8:00          2 2018-11-08
## 10    17     12    94   140 FALSE AM            2 2018-11-08
## # … with 233 more rows

tail(pdf_text[[2]])

## # A tibble: 6 x 8
##   width height     x     y space text         page date      
##   <int>  <int> <int> <dbl> <lgl> <chr>       <int> <date>    
## 1    28     12   229  1069 FALSE House           3 2018-11-08
## 2    51     12   288   799 TRUE  Christopher     3 2018-11-08
## 3    31     12   342   799 FALSE Liddell         3 2018-11-08
## 4    30     12   288   813 FALSE Closed          3 2018-11-08
## 5    21     12   288   826 TRUE  Oval            3 2018-11-08
## 6    28     12   312   826 FALSE Office          3 2018-11-08

At that point I took a minute to compare the dates I had with the ones in Axios’ clean data.

clean_data <- readr::read_csv("data/axios_data.csv")
unique(clean_data$date[!clean_data$date %in% dates$date])

##  [1] "2018-11-12" "2018-11-16" "2018-11-20" "2018-11-21" "2018-11-22"
##  [6] "2018-12-06" "2018-12-24" "2018-12-25" "2018-12-26" "2018-12-27"
## [11] "2018-12-28" "2018-12-31" "2019-01-01"

Wat?! I had a look at the PDF, and actually, these dates are not even in the PDF I’m fighting, so no need for me to try and revise my R script.

Digesting each date’s data

This part was the most crucial, and difficult one. I wanted to obtain a table with one line by event. My strategy was to first extract times by filtering text data whose x-coordinate was in an interval I had decided on by looking at some pages, clean these times to obtain time, duration and y-coordinate of the top of the time info, and then assign words from the right of the page to each event thanks to fuzzy joins: words correspond to an event if their y-coordinate is between the y-coordinate of the time of that event and the y-coordinate of next event.

On top of that I had to decide what to do with pages with times in several timezones, like below

pdftools::pdf_render_page(pdf_file, page = 13) %>%
  magick::image_read()

I simply removed all times that where in parentheses, assumed a time with no timezone information had the same timezone as the last event with a timezone, and that the first event of the day, if lacking a timezone, was in EST. In real life if I were really working on such data I’d probably still make such a general assumption but then review the result page by page, preferably with several pairs of eyes.

get_times <- function(page_data){
  times <- dplyr::filter(page_data,
                         dplyr::between(x, 72, 117))
  rest <- dplyr::filter(page_data,
                        !dplyr::between(x, 72, 117))
  
  times <- times %>%
    dplyr::group_by(y) %>%
    dplyr::summarize(text = glue::glue_collapse(text, sep = " "))
  
  times <- dplyr::filter(times,
                         !(grepl("\\(", text)&grepl("M", text)))
  clean_times <- dplyr::mutate(times,
                               time = dplyr::if_else(
                                 grepl("M", text)|grepl("RON", text),
                                 text,
                                 ""),
                               duration = dplyr::if_else(
                                 grepl("\\(", dplyr::lead(text)),
                                       dplyr::lead(text),
                                       "")
                               )
  clean_times <- dplyr::filter(clean_times, time != "")
  clean_times <- dplyr::select(clean_times, y, time, duration)
  clean_times <- dplyr::mutate(clean_times,
                               timezone = gsub(".*\\:.*M ", "", time), 
                               timezone = ifelse(timezone == time,
                                                 NA, timezone))
  if(is.na(clean_times$timezone[1])){
    clean_times$timezone[1] <- "EST"
  }
  clean_times <- dplyr::mutate(clean_times,
                               timezone = zoo::na.locf(timezone))
  
  clean_times <- dplyr::mutate(clean_times,
                               timezone = trimws(timezone),
                               timezone = gsub(" ", "",timezone),
                               time = gsub("M.*", "M", time))
  
  clean_times <- dplyr::mutate(clean_times,
                               start = y, 
                               end = ifelse(dplyr::lead(y) != start,
                                            dplyr::lead(y) - 1, start), 
                               end = ifelse(is.na(end),
                                            99999, end))
  if(nrow(clean_times) == 0){
    print(page_data$date[1])
    return(NULL)
  }
  
  rest <- dplyr::mutate(rest,
                        start = y, end = y)
  
  df <- fuzzyjoin::interval_left_join(clean_times,
                                      rest,
                                      by = c("start", "end"))
  df <- dplyr::select(df, time, duration, timezone, x, y.y, text)
  df <- dplyr::rename(df, y = y.y)
  
  df <- df %>%
    dplyr::group_by(time, duration, timezone) %>%
    dplyr::mutate(event = as.character(
      glue::glue_collapse(text[y == min(y)], sep = " "))) %>%
    dplyr::group_by(time, duration, timezone, y, event) %>%
    dplyr::summarize(text = as.character(
      glue::glue_collapse(text, sep = " ")
    )) %>%
    dplyr::group_by(time, duration, timezone, event) %>%
    dplyr::mutate(text = ifelse(toupper(event) != event,
                                as.character(
                                  glue::glue_collapse(text, sep = " ")
                                ), text)) %>%
    dplyr::ungroup()
  
  if(nrow(df) == 0){
    df <- tibble::tibble(event = as.character(
      glue::glue_collapse(page_data$text, sep = " ")
    ))
  }
  
  df$date <- page_data$date[1]
  df
}

all_events <- purrr::map_df(pdf_text,
                            get_times)

head(all_events)

## # A tibble: 6 x 7
##   time    duration    timezone     y event         text          date      
##   <chr>   <chr>       <chr>    <dbl> <chr>         <chr>         <date>    
## 1 1:30 PM (3 hr, 30 … EST        302 EXECUTIVE TI… EXECUTIVE TI… 2018-11-07
## 2 1:30 PM (3 hr, 30 … EST        315 EXECUTIVE TI… Location: Ov… 2018-11-07
## 3 11:00 … (30 min)    EST        180 MEETING WITH… MEETING WITH… 2018-11-07
## 4 11:00 … (30 min)    EST        194 MEETING WITH… Location: Ov… 2018-11-07
## 5 11:30 … (1 hr)      EST        221 EXECUTIVE TI… EXECUTIVE TI… 2018-11-07
## 6 11:30 … (1 hr)      EST        234 EXECUTIVE TI… Location: Ov… 2018-11-07

Cleaning each event

At that point I was fairly happy but I knew that a code review would have been great in real life, and that in any case I needed more work. I wanted to have a duration in minutes for each event as well as a start and end as times, and I wanted each event’s info to be in a single row.

I first made each event a row.

gather_events <- function(all_events){
    all_events %>%
    dplyr::group_by(event, text) %>%
    dplyr::mutate(clean_text = trimws(gsub(event[1], "", text))) %>%
    dplyr::ungroup() %>%
    dplyr::select(- text) %>%
    dplyr::rename(text = clean_text)
  
  # events with no text
  no_text_events <- all_events %>%
    dplyr::group_by(time, duration, 
                    timezone, date) %>%
    dplyr::filter(all(!nzchar(text))) %>%
    dplyr::summarise(event = unique(event),
                     text = NA,
                     y = min(y))
  
  
  # events with text
  text_events <- all_events  %>%
    dplyr::group_by(time, duration, 
                    timezone, date) %>%
    dplyr::filter(any(nzchar(text))) %>%
    dplyr::filter(nzchar(text)) %>%
    dplyr::summarize(y = min(y),#y gives us the *certain* order of events in a day
                     event = ifelse(length(text[!grepl("\\:", text)]) > 0,
                                    as.character(
                                 glue::glue_collapse(paste(unique(event, text[!grepl("\\:", text)&text!=event]))), sep = " "),
                       event),
                     text = ifelse(length(text[grepl("\\:", text)]) > 0,
                                   toString(text[grepl("\\:", text)]),
                                   ""))
  all_events <- dplyr::bind_rows(no_text_events, text_events)
  all_events <- dplyr::ungroup(all_events)
  all_events <- dplyr::arrange(all_events, date, y)
  all_events
}
all_events <- gather_events(all_events)

Before binding the events with text and no text I had checked by had that no events got lost (using dplyr::n_groups()).

head(all_events)

## # A tibble: 6 x 7
##   time    duration   timezone date       event          text              y
##   <chr>   <chr>      <chr>    <date>     <chr>          <chr>         <dbl>
## 1 8:00 AM (3 hr)     EST      2018-11-07 EXECUTIVE TIME Location: Ov…   140
## 2 11:00 … (30 min)   EST      2018-11-07 MEETING WITH … Location: Ov…   180
## 3 11:30 … (1 hr)     EST      2018-11-07 EXECUTIVE TIME Location: Ov…   221
## 4 12:30 … (1 hr)     EST      2018-11-07 LUNCH          Location: Pr…   261
## 5 1:30 PM (3 hr, 30… EST      2018-11-07 EXECUTIVE TIME Location: Ov…   302
## 6 RON:    ""         EST      2018-11-07 The White Hou… ""              342

head(all_events$text)

## [1] "Location: Oval Office"         "Location: Oval Office"        
## [3] "Location: Oval Office"         "Location: Private Dining Room"
## [5] "Location: Oval Office"         ""

Further work aimed at extracting “Location”, “Note”, etc. into specific columns. This code is especially ugly.

extract_info <- function(all_events){
  all_events %>%
  dplyr::group_by(time, duration, timezone, date, event, y) %>%
  dplyr::mutate(location = ifelse(any(grepl("Location", 
                                                        unlist(strsplit(text, ",")))),
                                  gsub("Location\\: ", "", 
                                       unlist(strsplit(text, ","))[grepl("Location", 
                                                        unlist(strsplit(text, ",")))]),
                                  NA),
                project_officer = ifelse(any(grepl("Project Officer", 
                                                        unlist(strsplit(text, ",")))),
                                  gsub("Project Officer\\: ", "", 
                                       unlist(strsplit(text, ","))[grepl("Project Officer", 
                                                        unlist(strsplit(text, ",")))]),
                                  NA),
                press = ifelse(any(grepl("Press", 
                                                        unlist(strsplit(text, ",")))),
                                  gsub("Press\\: ", "", 
                                       unlist(strsplit(text, ","))[grepl("Press", 
                                                        unlist(strsplit(text, ",")))]),
                                  NA)) %>%
  dplyr::ungroup() %>%
  dplyr::mutate(location = trimws(location),
                project_officer = trimws(project_officer),
                press = trimws(press))
}
  
all_events <- extract_info(all_events)

At this point, again, in a real work environment I’d spend time checking data carefuly, but instead, I’ll just have a rough look at the clean data vs my own cleaned data.

dplyr::count(clean_data, listed_project_officer, sort = TRUE) 

## # A tibble: 39 x 2
##    listed_project_officer     n
##    <chr>                  <int>
##  1 <NA>                     369
##  2 Ambassador John Bolton    33
##  3 Sarah Sanders             25
##  4 Lindsay Reynolds          18
##  5 Shahira Knight            18
##  6 William McGinley          18
##  7 Christopher Liddell       13
##  8 John DeStefano            10
##  9 Mercedes Schlapp           9
## 10 Tim Pataki                 7
## # … with 29 more rows

dplyr::count(all_events, trimws(project_officer), sort = TRUE) 

## # A tibble: 32 x 2
##    `trimws(project_officer)`     n
##    <chr>                     <int>
##  1 <NA>                        467
##  2 Ambassador John Bolton       32
##  3 Sarah Sanders                23
##  4 Shahira Knight               18
##  5 William McGinley             16
##  6 Lindsay Reynolds             15
##  7 Christopher Liddell          13
##  8 John DeStefano               10
##  9 Mercedes Schlapp              9
## 10 Tim Pataki                    7
## # … with 22 more rows

dplyr::count(clean_data, listed_location, sort = TRUE) 

## # A tibble: 36 x 2
##    listed_location             n
##    <chr>                   <int>
##  1 Oval office               310
##  2 <NA>                       75
##  3 Oval Office                48
##  4 Private dining room        34
##  5 Buenos Aires, Argentina    21
##  6 oval office                 9
##  7 Grand foyer                 7
##  8 Roosevelt room              6
##  9 Cabinet Room                5
## 10 East Room                   5
## # … with 26 more rows

dplyr::count(all_events, trimws(location), sort = TRUE) 

## # A tibble: 34 x 2
##    `trimws(location)`            n
##    <chr>                     <int>
##  1 Oval Office                 377
##  2 <NA>                        151
##  3 Private Dining Room          35
##  4 Buenos Aires                 18
##  5 Grand Foyer                   8
##  6 Cabinet Room                  7
##  7 Roosevelt Room                7
##  8 Biloxi                        4
##  9 Diplomatic Reception Room     4
## 10 McAllen                       4
## # … with 24 more rows

I surprisingly have more events in my data than in Axios’ clean data, which would require further exploration. Part of that might be due to my data listing “RON” as events, where RON means “Remainder of the Night” which is not an event, but a way to indicate where the President spent the rest of his time on that day. On one day, the whole schedule is actually “RON”.

dplyr::filter(all_events, date == lubridate::ymd("2018-11-23"))

## # A tibble: 1 x 10
##   time  duration timezone date       event text      y location
##   <chr> <chr>    <chr>    <date>     <chr> <chr> <dbl> <chr>   
## 1 RON:  ""       EST      2018-11-23 Mar-… ""      180 <NA>    
## # … with 2 more variables: project_officer <chr>, press <chr>

For some reasons, in its clean data Axios classified that day as executive time, although the PDF does not indicate anything about it.

dplyr::filter(clean_data, date == lubridate::ymd("2018-11-23"))

## # A tibble: 1 x 11
##    week date       time_start time_end duration listed_title top_category
##   <dbl> <date>     <time>     <time>      <dbl> <chr>        <chr>       
## 1     3 2018-11-23 08:00      17:00           9 Executive t… executive_t…
## # … with 4 more variables: listed_location <chr>,
## #   listed_project_officer <chr>, detail_category <chr>, notes <chr>

I then had a go at cleaning times and durations.

clean_time_info <- function(all_events){
  all_events %>%
  dplyr::mutate(start = lubridate::parse_date_time(paste(date, time),
                                                   orders = "Ymd HM p", 
                                                   locale = "en_US.utf8", tz = "EST")) %>%
  dplyr::mutate(duration = gsub("\\(", "", duration),
                            duration = gsub("\\)", "", duration)) %>%
  dplyr::mutate(hours = ifelse(grepl("hr", duration),
                               as.numeric(trimws(gsub("hr.*$", "", duration))),
                               0),
                minutes = ifelse(grepl("min", duration),
                               as.numeric(trimws(stringr::str_remove(
                                 gsub("^.*\\,", "", duration),
                                 "min"))),
                               0)) %>%
  dplyr::mutate(duration = 60*hours + minutes) %>%
  dplyr::select(- hours, - minutes) %>%
  dplyr::group_by(date) %>%
  dplyr::mutate(end = ifelse(is.na(duration),
                             as.POSIXct(dplyr::lead(start),
                                        origin = "1970-01-01"),
                             as.POSIXct(start + lubridate::minutes(duration),
                                        origin = "1970-01-01")))
}
  
all_events <- clean_time_info(all_events)
head(all_events)

## # A tibble: 6 x 12
## # Groups:   date [1]
##   time  duration timezone date       event text      y location
##   <chr>    <dbl> <chr>    <date>     <chr> <chr> <dbl> <chr>   
## 1 8:00…      180 EST      2018-11-07 EXEC… Loca…   140 Oval Of…
## 2 11:0…       30 EST      2018-11-07 MEET… Loca…   180 Oval Of…
## 3 11:3…       60 EST      2018-11-07 EXEC… Loca…   221 Oval Of…
## 4 12:3…       60 EST      2018-11-07 LUNCH Loca…   261 Private…
## 5 1:30…      210 EST      2018-11-07 EXEC… Loca…   302 Oval Of…
## 6 RON:         0 EST      2018-11-07 The … ""      342 <NA>    
## # … with 4 more variables: project_officer <chr>, press <chr>,
## #   start <dttm>, end <dbl>

I stopped there, not willing to further work on the data. More work on timezones, and on filling missing durations based on context, would be useful. I’d be delighted to see someone cleaning the dataset more thoroughly than I did here!

Conclusion

In this post I took advantage of the nifty pdftools::pdf_data() function to extract text from Trump’s PDF schedule along with coordinates, in the hope to be able to rectangle the data. Part of that went well in particular thanks to fuzzyjoin and Tidyverse tooling, part of that went less well, in particular due to the inexact data formatting and less than optimal indication of timezones… and my not desiring to spend my week cleaning the data. I have no idea how Axios prepared the dataset, but if I were to clean such data for actual work, after scripting as much as possible I’d request at least one colleague to review the code, and this person and I would also sample a few if not all events and compare their information in the PDF and the resulting data.

The advantage of scripting at least part of the cleaning is reproducibility, for being able to explain what happened to the data, but also to apply the code to new data. This morning I read that more schedules had been leaked so I tried my code on it.

pdf_file2 <- "data/newschedules.pdf"
pdf_data2 <- pdftools::pdf_data(pdf_file2)

pdf_text2 <- purrr::map(1:pdftools::pdf_info(pdf_file2)$pages,
                       read_page,
                       pdf_data = pdf_data2) %>%
  purrr::discard(is.null)

dates2 <- purrr::map_df(pdf_text2, get_date)

all_events2  <- pdf_text2 %>%
  split_days(dates2) %>%
  purrr::map_df(get_times) %>%
  gather_events() %>%
  extract_info() %>%
  clean_time_info()

head(all_events2)

## # A tibble: 6 x 12
## # Groups:   date [1]
##   time  duration timezone date       event text      y location
##   <chr>    <dbl> <chr>    <date>     <chr> <chr> <dbl> <chr>   
## 1 8:00…      180 EST      2019-02-04 EXEC… Loca…   141 Oval Of…
## 2 11:0…       30 EST      2019-02-04 MEET… Loca…   180 Oval Of…
## 3 11:3…       15 EST      2019-02-04 EXEC… Loca…   219 Oval Of…
## 4 11:4…       30 EST      2019-02-04 INTE… Proj…   258 Oval Of…
## 5 12:1…       30 EST      2019-02-04 EXEC… Loca…   310 Oval Of…
## 6 12:4…       60 EST      2019-02-04 LUNC… Proj…   349 Private…
## # … with 4 more variables: project_officer <chr>, press <chr>,
## #   start <dttm>, end <dbl>

I think that the fact that my (imperfect!) pipeline seems to work on the new schedules shows the power of scripting one’s data cleaning… as well as the hard work Axios’ journalists did! If you work on the PDF data, I’d be glad to hear how you handled it, so share links to gists and posts in the comments below! If you prefer playing with the data, check out this cool Shiny app by Garrick Aden-Buie.

To leave a comment for the author, please follow the link and comment on their blog: Posts on Maëlle's R blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

How did Axios rectangle Trump’s PDF schedule? A try with R

Available data, available software

Special tale for Ubuntu users

General strategy

Splitting the low level text data by date

Digesting each date’s data

Cleaning each event

Conclusion

Related

Available data, available software

Special tale for Ubuntu users

General strategy

Splitting the low level text data by date

Digesting each date’s data

Cleaning each event

Conclusion

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)