**R – Tales of R**, and kindly contributed to R-bloggers)

# Introduction

There are several ways to mine tables and other content from a pdf, using R. After a lot of trial & error, here’s how I managed to extract global exam results from an international, massive, yearly examination, the EDAIC.

This is my first use case of *“pdf mining”* with R, and also a fairly simple one. However, more complex and very fine examples of this can be found elsewhere, using both pdftools and tabulizer packages.

As can be seen from the original pdf, exam results are anonymous. They consist on a numeric, 6-digit **code** and a binary result: “**FAIL / PASS**”. I was particularly interested into seeing how many of them passed the exam, as some indirect measure of how *“hard”* it can be.

# Mining the table

In this case I preferred pdftools as it allowed me to extract the whole content from the pdf:

`install.packages("pdftools")`

```
library(pdftools)
txt <- pdf_text("EDAIC.pdf")
txt[1]
class(txt[1])
```

` [1] "EDAIC Part I 2017 Overall Results\n Candidate N° Result\n 107131 FAIL\n 119233 PASS\n 123744 FAIL\n 127988 FAIL\n 133842 PASS\n 135692 PASS\n 140341 FAIL\n 142595 FAIL\n 151479 PASS\n 151632 PASS\n 152787 PASS\n 157691 PASS\n 158867 PASS\n 160211 PASS\n 161970 FAIL\n 162536 PASS\n 163331 PASS\n 164442 FAIL\n 164835 PASS\n 165734 PASS\n 165900 PASS\n 166469 PASS\n 167241 FAIL\n 167740 PASS\n 168151 FAIL\n 168331 PASS\n 168371 FAIL\n 168711 FAIL\n 169786 PASS\n 170721 FAIL\n 170734 FAIL\n 170754 PASS\n 170980 PASS\n 171894 PASS\n 171911 PASS\n 172047 FAIL\n 172128 PASS\n 172255 FAIL\n 172310 PASS\n 172706 PASS\n 173136 FAIL\n 173229 FAIL\n 174336 PASS\n 174360 PASS\n 175177 FAIL\n 175180 FAIL\n 175184 FAIL\nYour candidate number is indicated on your admission document Page 1 of 52\n"`

` [1] "character"`

These commands return a lenghty *blob* of text. Fortunately, there are some `\n`

symbols that signal the new lines in the original document.

We will use these to split the *blob* into something more approachable, using `tidyversal`

methods…

- Split the
*blob*. - Transform the resulting
`list`

into a`character vector`

with`unlist`

. - Trim leading white spaces with
`stringr::str_trim`

.

```
library(tidyverse)
library(stringr)
tx2 <- strsplit(txt, "\n") %>% # divide by carriage returns
unlist() %>%
str_trim(side = "both") # trim white spaces
tx2[1:10]
```

```
[1] "EDAIC Part I 2017 Overall Results"
[2] "Candidate N° Result"
[3] "107131 FAIL"
[4] "119233 PASS"
[5] "123744 FAIL"
[6] "127988 FAIL"
[7] "133842 PASS"
[8] "135692 PASS"
[9] "140341 FAIL"
[10] "142595 FAIL"
```

- Remove the very first row.
- Transform into a
`tibble`

.

```
tx3 <- tx2[-1] %>%
data_frame()
tx3
```

```
# A tibble: 2,579 x 1
.
<chr>
1 Candidate N° Result
2 107131 FAIL
3 119233 PASS
4 123744 FAIL
5 127988 FAIL
6 133842 PASS
7 135692 PASS
8 140341 FAIL
9 142595 FAIL
10 151479 PASS
# ... with 2,569 more rows
```

- Use
`tidyr::separate`

to split each row into two columns. - Remove all spaces.

```
tx4 <- separate(tx3, ., c("key", "value"), " ", extra = "merge") %>%
mutate(key = gsub('\\s+', '', key)) %>%
mutate(value = gsub('\\s+', '', value))
tx4
```

```
# A tibble: 2,579 x 2
key value
```
1 Candidate N°Result
2 107131 FAIL
3 119233 PASS
4 123744 FAIL
5 127988 FAIL
6 133842 PASS
7 135692 PASS
8 140341 FAIL
9 142595 FAIL
10 151479 PASS
# ... with 2,569 more rows

- Remove rows that do not represent table elements.

```
tx5 <- tx4[grep('^[0-9]', tx4[[1]]),]
tx5
```

```
# A tibble: 2,424 x 2
key value
```
1 107131 FAIL
2 119233 PASS
3 123744 FAIL
4 127988 FAIL
5 133842 PASS
6 135692 PASS
7 140341 FAIL
8 142595 FAIL
9 151479 PASS
10 151632 PASS
# ... with 2,414 more rows

# Extracting the results

We already have the table! now it’s time to get to the summary:

```
library(knitr)
tx5 %>%
group_by(value) %>%
summarise (count = n()) %>%
mutate(percent = paste( round( (count / sum(count)*100) , 1), "%" )) %>%
kable()
```

value | count | percent |
---|---|---|

FAIL | 1017 | 42 % |

PASS | 1407 | 58 % |

From these results we see that the **EDAIC-Part1** exam doesn’t have a particularly high clearance rate. It is currently done by medical specialists, but its dificulty relies in a *very* broad list of subjects covered, ranging from topics in applied physics, the entire human physiology, pharmacology, clinical medicine and latest guidelines.

Despite being a hard test to pass -and also the exam fee-, it’s becoming increasingly popular among anesthesiologists and critical care specialists that wish to stay up-to date with the current medical knowledge and practice.

**leave a comment**for the author, please follow the link and comment on their blog:

**R – Tales of R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...