Extracting PDF Text with R and Creating Tidy Data

March 12, 2018

In the digital age of today, data comes in many forms. Many of the more common file types like CSV, XLSX, and plain text (TXT) are easy to access and manage. Yet, sometimes, the data we need is locked away in a file format that is less accessible such as a PDF. If you have ever found yourself in this dilemma, fret not — pdftools has you covered.

In this post, you will learn how to: usepdftools to extract text from a PDF, use the stringr package to manipulate strings of text, and create a tidy data set. In anticipation of March Madness and being a University of Cincinnati alumnus along with some other my other Datazar constituents, I have chosen to extract season statistics from the UC men’s basketball team. In the end, I will create a tibble showing season statistics for minutes played, field goal percentage, total points, and average points per game for each player.

The first step is to load the packages that are needed using library(). The stringr package is a member of the tidyverse collection of R packages (more on that here if you are not familiar). The packages in therein are designed to make data science easy. I highly recommend purchasing R for Data Science by Hadley Wickham and Garrett Grolemund. It is a great book for beginners as well as a pocket reference for more advanced programmers. I use this book almost every day — it goes where I go.

The next step is to load your PDF into your Datazar project. I am going to call my new object ‘UC_text’ and I am going to use the pdf_text command to read the text of my file. The read_lines() function reads the lines of our new file.

I want to focus on the season statistics of the players, which makes up lines 9 through 24 of our new file. Line 9 consists of the column names of our resulting data frame. I am going to call this new object season_stats.

In the next series of steps, I will use functions in the stringr package to manipulate the lines of text into a desirable form. The first problem to tackle is the whitespace between the different elements in each line of text. The str_squish function reduces the repeated whitespace between each string. I also need to remove the comma between each player’s first and last name. I’ll use str_replace_all to remove the comma.

After the whitespace and the commas have been removed, I can focus on separating each element. I will use strsplt to split the elements of each string into substrings.

The structure of our new all_stats_lines object is a list. Let’s focus now the first element, which will be the column names of our data frame. There are two issues here: 1.) there are three elements that are named ‘avg’ 2.) there is only one element named ‘Player,’ but each player’s name is split between two columns (I’ll fix that later). For now, I’ll focus on changing the column names. I’ll do that by subsetting the first element and the transforming list into a character vector using unlist(). I can assign new values to our column names easily once I transform them back into a character vector.

The 5th, 15th, and 23rd element of var_lines all are named ‘avg.’ Based on the preceding elements of the vector (and some basketball know-how), we can infer that these elements represent average minutes played, average rebounds, and average points, respectively. I will rename these elements, ‘avg_min’, ‘avg_min’, ‘avg_pts.’

Now that I have finalized my column names, I will focus on rows of the player data.

My next big hurdle is to transform my list of player statistics into a data frame. I will use the ldply() function in the plyr package, which applies a function to each element in a list and combines the results into a data frame.

Now it is time to circle back to the problem of the player names. Remember, the number of column names do not align with the columns of rows of game statistics because each players name is split between two columns (‘V1’ and ‘V2’) in our stats_df object.

To combine the columns with each players first and last names, I will the unite() function.

Now that our columns finally align, I can finally assemble the final data frame. The first step is to attach the column names using the colnames(). I want to transform my final data frame as a tibble. There are many reasons that working with tibbles can make your life as a data scientist easy (more on that here). One of which is that tibbles easily handle non-syntactic variable names. To refer to non-syntactic variables, they must be surrounded in backticks.

I am mostly interested in the statistics related to scoring. So, for my final data set I am going to choose specific variables using select(). Because player number and field goal percentage (`fg%`) contain special characters, I have surrounded them in backticks.

I now have a clean, tidy final data set ready for analysis, visualization, or export. I hope you learned something new and useful to add to your to your data wrangling toolkit.

Link to the notebook (feel free to copy it): https://www.datazar.com/file/f959f24d7-7f58-4723-88d5-00dd1af63348

Like what you saw here? Visit www.datazar.com to create Notebooks for free and show us what you built @datazarhq.

Extracting PDF Text with R and Creating Tidy Data was originally published in Datazar Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)