I’ve mentioned @stiles before on the blog but for those new to my blatherings, Matt is a top-notch data journalist with the @latimes and currently stationed in South Korea. I can only imagine how much busier his life has gotten since that fateful, awful November 2016 Tuesday, but I’m truly glad his eyes, pen and R console are covering the important events there.
When I finally jumped on Twitter today, I saw this:
and went into action and figured I should blog the results as one can never have too many “convert this PDF to usable data” examples.
The U.S. Defense POW/MIA Accounting Agency maintains POW/MIA data for all our nation’s service members. Matt is working with data from Korea (the “All US Unaccounted-For” PDF direct link is in the code below) and needed to get the PDF into a usable form and (as you can see if you read through the Twitter thread) both Tabulizer and other tools were introducing sufficient errors that the resultant extracted data was either not complete or trustworthy enough to rely on (hand-checking nearly 8,000 records is not fun).
There PDF in question was pretty uniform, save for the first and last pages. Here’s a sample:
We just need a reproducible way to extract the data with sufficient veracity to ensure we can use it faithfully.
We’ll need some packages and the file itself, so let’s get that bit out of the way first:
library(stringi) library(pdftools) library(hrbrthemes) library(ggpomological) library(tidyverse) # grab the PDF text mia_url
Let’s look at those three example pages:
cat(doc[]) ## Defense POW/MIA Accounting Agency ## Personnel Missing - Korea (PMKOR) ## (Reported for ALL Unaccounted For) ## Total Unaccounted: 7,699 ## Name Rank/Rate Branch Year State/Territory ## ABBOTT, RICHARD FRANK M/Sgt UNITED STATES ARMY 1950 VERMONT ## ABEL, DONALD RAYMOND Pvt UNITED STATES ARMY 1950 PENNSYLVANIA ## ... ## AKERS, HERBERT DALE Cpl UNITED STATES ARMY 1950 INDIANA ## AKERS, JAMES FRANCIS Cpl UNITED STATES MARINE CORPS 1950 VIRGINIA cat(doc[]) ## Name Rank/Rate Branch Year State/Territory ## AKERS, RICHARD ALLEN 1st Lt UNITED STATES ARMY 1951 PENNSYLVANIA ## AKI, CLARENCE HALONA Sgt UNITED STATES ARMY 1950 HAWAII ... ## AMIDON, DONALD PRENTICE PFC UNITED STATES MARINE CORPS 1950 TEXAS ## AMOS, CHARLES GEARL Cpl UNITED STATES ARMY 1951 NORTH CAROLINA cat(doc[[length(doc)]]) ## Name Rank/Rate Branch Year State/Territory ## ZAVALA, FREDDIE Cpl UNITED STATES ARMY 1951 CALIFORNIA ## ZAWACKI, FRANK JOHN Sgt UNITED STATES ARMY 1950 OHIO ## ... ## ZUVER, ROBERT LEONARD Pfc UNITED STATES ARMY 1950 CALIFORNIA ## ZWILLING, LOUIS JOSEPH Cpl UNITED STATES ARMY 1951 ILLINOIS ## This list of Korean War missing personnel was prepared by the Defense POW/MIA Accounting Agency (DPAA). ## Please visit our web site at http://www.dpaa.mil/Our-Missing/Korean-War-POW-MIA-List/ for updates to this list and other official missing personnel data lists. ## Report Prepared: 06/19/2018 11:25
poppler library’s “layout” mode (which
pdftools uses brilliantly) combined with the author of the PDF not being evil will help us make short work of this since:
- there’s a uniform header on each page
- the “layout” mode returned uniform per-page, fixed-width columns
- there’s no “special column tweaks” that some folks use to make PDFs more readable by humans
There are plenty of comments in the code, so I’ll refrain from too much blathering about it, but the general plan is to go through each of the 119 pages and:
- convert the text to lines
- find the header line
- find the column start/end positions from the header on the page (since they are different for each page)
- reading it in with
- remove headers, preamble and epilogue cruft
- turn it all into one data frame
# we're going to process each page and read_fwf will complain violently # when it hits header/footer rows vs data rows and we rly don't need to # see all those warnings read_fwf_q % flatten_chr() -> lines # want the lines from each page # find the header on the page and get the starting locations for each column keep(lines, stri_detect_regex, "^Name") %>% stri_locate_all_fixed(c("Name", "Rank", "Branch", "Year", "State")) %>% map(`[`, 1) %>% flatten_int() -> starts # now get the ending locations; cheating and using `NA` for the last column ends
Now the data is both usable and sobering: