PDF Scraping in R with tabulizer

[This article was first published on business-science.io, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This article comes from Jennifer Cooper, a new student in Business Science University. Jennifer is 35% complete with the 101 course – and shows off her progress in this PDF Scraping tutorial. Jennifer has an interest in understanding the plight of wildlife across the world, and uses her new data science skills to perform a useful analysis – scraping PDF tables of a Report on Endangered Species with the tabulizer R package and visualizing alarming trends with ggplot2.

Scraping PDFs and Analyzing Endangered Species

by Jennifer Cooper, Marketing Data Scientist at Skopos Financial

Hey, everybody! Hope everyone has had a great weekend 😀!!

I’ve been “heads down” this weekend working on a special R project. This week I gave myself a challenge to start using R at work and also come up with a project on the side that I could use to help review what I’ve learned so far in Business Science University’s DS4B 101-R course.

In addition to being passionate about data science, I also love animals and am concerned about the plight of wildlife across the world, particularly with climate change. I decided to take a look at data on critically endangered species.

The only information on Endangered Species I could find was in a PDF format, so I spent a lot of time trying to figure out the nuances of tabulizer for scraping PDF. I finally got it done tonight!

Through this process, I discovered I still need a lot more practice, so I’m going to continue seeing what I can do to apply it at work (figure out how to connect to our SQL database this week), carve out more time to practice, and I may write up an article on working with tabulizer and PDFs.

Interested in learning R? Join me in the 101 course.

My Workflow

Here’s a diagram of the workflow I used:

  1. Start with PDF

  2. Use tabulizer to extract tables

  3. Clean up data into “tidy” format using tidyverse (mainly dplyr)

  4. Visualize trends with ggplot2



My Code Workflow for PDF Scraping with tabulizer

Get the PDF

I analyzed the Critically Endangered Species PDF Report.

PDF Scrape and Exploratory Analysis

Step 1 – Load Libraries

Load the following libraries to follow along.

library(rJava)      # Needed for tabulizer
library(tabulizer)  # Handy tool for PDF Scraping
library(tidyverse)  # Core data manipulation and visualization libraries

Note that tabulizer depends on rJava, which may require some setup. Here are a few pointers:

  • Mac Users: If you have issues connecting Java to R, you can try running sudo R CMD javareconf in the Terminal (per this post)

  • Windows Users: This blog article provides a step-by-step process for installing rJava on Windows machines.

Step 2 – Extracting the Tabular Data from PDF

The tabulizer package provides a suite of tools for extracting data from PDFs. The vignette, “Introduction to tabulizer” has a great overview of tabulizer’s features.

We’ll use the extract_tables() function to pull out each of the tables from the Endangered Species Report. This returns a list of data.frames.

# PDF Scrape Tables
endangered_species_scrape <- extract_tables(
    file   = "2019-09-23-tabulizer/endangered_species.pdf", 
    method = "decide", 
    output = "data.frame")

The table I’m interested in is the first one – the Critically Endangered Species. I’ll extract it using the pluck() function and convert it to a tibble() (the tidy data frame format). I see that I’m going to need to do a bit of cleanup.

# Pluck the first table in the list
endangered_species_raw_tbl <- endangered_species_scrape %>% 
    pluck(1) %>% 
    as_tibble()

# Show first 6 rows
endangered_species_raw_tbl %>% head() %>% knitr::kable()
XX.1X.2X.3Critically.Endangered..CR.X.4X.5X.6X.7X.8
YearMammalsBirdsReptilesAmphibians Fishes InsectsMolluscsOther invertebratesNAPlantsFungi & protists
2019203224303575 549 311658263NA3,02714
2018201224287550 486 300633252NA2,87914
2017202222266552 468 273625243NA2,72210
2016204225237546 461 226586211NA2,5068
2015209218180528 446 176576209NA2,3475

Step 3 – Clean Up Column Names

Next, I want to start by cleaning up the names in my data – which are actually in the first row. I’ll use a trick using slice() to grab the first row, and the new pivot_longer() function to transpose and extract the column names that are in row 1. I can then set_names() and remove row 1.

# Get column names from Row 1
col_names <- endangered_species_raw_tbl %>%
    slice(1) %>%
    pivot_longer(cols = everything()) %>%
    mutate(value = ifelse(is.na(value), "Missing", value)) %>%
    pull(value)

# Overwrite names and remove Row 1
endangered_species_renamed_tbl <- endangered_species_raw_tbl %>%
    set_names(col_names) %>%
    slice(-1)

# Show first 6 rows
endangered_species_renamed_tbl %>% head() %>% knitr::kable()
YearMammalsBirdsReptilesAmphibians Fishes InsectsMolluscsOther invertebratesMissingPlantsFungi & protists
2019203224303575 549 311658263NA3,02714
2018201224287550 486 300633252NA2,87914
2017202222266552 468 273625243NA2,72210
2016204225237546 461 226586211NA2,5068
2015209218180528 446 176576209NA2,3475
2014213213174518 443 168576205NA2,1192

Step 4 – Tidy the Data

There are a few issues with the data:

  1. Remove columns with NAs: Column labelled “Missing” is all NA’s – We can just drop this column

  2. Fix columns that were combined: Three of the columns are combined – Amphibians, Fishes, and Insects – We can separate() these into 3 columns

  3. Convert to (Tidy) Long Format for visualization: The data is in “wide” format, which isn’t tidy – We can use pivot_longer() to convert to “long” format with one observation for each row

  4. Fix numeric data stored as character: The numeric data is stored as character and several of the numbers have commas – We’ll remove commas and convert to numeric

  5. Convert Character Year & species to Factor: The year and species columns are character – We can convert to factor for easier adjusting of the order in the ggplot2 visualizations

  6. Percents by year: The visualizations will have a percent (proportion) included so we can see which species have the most endangered – We can add proportions by each year

endangered_species_final_tbl <- endangered_species_renamed_tbl %>%
    
    # 1. Remove columns with NAs
    select_if(~ !all(is.na(.))) %>%
    
    # 2. Fix columns that were combined
    separate(col  = `Amphibians Fishes Insects`, 
             into = c("Amphibians", "Fishes", "Insects"), 
             sep  = " ") %>%
    
    # 3. Convert to (Tidy) Long Format for visualization
    pivot_longer(cols = -Year, names_to = "species", values_to = "number") %>%
    
    # 4. Fix numeric data stored as character
    mutate(number = str_remove_all(number, ",")) %>%
    mutate(number = as.numeric(number)) %>%
    
    # 5. Convert Character Year & species to Factor
    mutate(Year = as_factor(Year)) %>%
    mutate(species = as.factor(species)) %>%
    
    # 6. Percents by year
    group_by(Year) %>%
    mutate(percent = number / sum(number)) %>%
    mutate(label = scales::percent(percent)) %>%
    ungroup()

# Show first 6 rows
endangered_species_final_tbl %>% head() %>% knitr::kable()
Yearspeciesnumberpercentlabel
2019Mammals2030.03313203.3%
2019Birds2240.03655953.7%
2019Reptiles3030.04945324.9%
2019Amphibians5750.09384699.4%
2019Fishes5490.08960349.0%
2019Insects3110.05075895.1%

Step 5 – Visualize the Data

Summary Visualization

I made a summary visualization using stacked bar chart to show the alarming trends of critically endangered species over time.

endangered_species_final_tbl %>%
    mutate(Year = fct_rev(Year)) %>%
    
    ggplot(aes(x = Year, y = number, fill = species)) +
    
    # Geoms
    geom_bar(position = position_stack(), stat = "identity", width = .7) +
    geom_text(aes(label = label), position = position_stack(vjust= 0.5), size = 2) +
    coord_flip() +
    
    # Theme
    labs(
        title = "Critically Endangered Species",
        y = "Number of Species Added to Critically Endangered List", x = "Year"
    ) +
    theme_minimal()

plot of chunk unnamed-chunk-6

I then faceted the species and visualized the trend over time using a smoother (geom_smooth). Again, we see that each of the species exhibit increasing trends.

endangered_species_final_tbl %>%
    mutate(Year = fct_rev(Year)) %>%
    
    # Geom
    ggplot(aes(Year, number, color = species, group = species)) +
    geom_point() +
    geom_smooth(method = "loess") + 
    facet_wrap(~ species, scales = "free_y", ncol = 3) +
    
    # Theme
    expand_limits(y = 0) +
    theme_minimal() +
    theme(legend.position = "none",
          axis.text.x = element_text(angle = 45, hjust = 1)) +
    labs(
        title = "Critically Endangered Species",
        subtitle = "Trends Not Improving",
        x = "", y = "Changes in Number of Species in Threatened Category"
    )

plot of chunk unnamed-chunk-7

Parting Thoughts

It was really exciting to see my hard work pay off. It took a bit to get going, but I found that tabulizer made PDF extraction manageable. The most challenging part was getting the data into a format that can be easily visualized (the tidyverse really helped as shown in Step 4!). I was particularly excited to see results of my analysis, and I want to share with others the effects of

If you’d like to join me, I’m currently learning Data Science for Business in Business Science’s 101 course (Data Science Foundations), and I’ve signed up for 201 Advanced Machine Learning and 102 Shiny Web Applications.

To leave a comment for the author, please follow the link and comment on their blog: business-science.io.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)