Extract text from pdf in R and word Detection

finnstats

9 months ago

[This article was first published on Methods – finnstats, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Extract text from pdf in R, first we need to install pdftools package from cran.

Let’s install the pdftools package from cran.

install.packages("pdftools")

Load the package

library("pdftools")

The pdf file needs to save in local directory or get it from online. Here we are extracting one sample document from online.

Store the link in pdf.file variable.

pdf.file <- "https://file-examples-com.github.io/uploads/2017/10/file-sample_150kB.pdf"

Set the working directory

setwd("D:/RStudio/PDFEXTRACT/")

Let’s download the demo pdf file into the local directory

How to run R code in PyCharm? » R & PyCharm »

download.file(pdf.file, destfile = "sample.pdf", mode = "wb")

pdf_text() function, which returns a character vector of length equal to the number of pages in the file.

Extract text from pdf in R

Now we can extract the text from all pages.

pdf.text <- pdftools::pdf_text("sample.pdf")

Suppose if you want to display second page information then use below code,

cat(pdf.text[[2]])

Displayed only a few text here

In non mauris justo. Duis vehicula mi vel mi pretium, a viverra erat efficitur. Cras aliquam
est ac eros varius, id iaculis dui auctor. Duis pretium neque ligula, et pulvinar mi placerat
et. Nulla nec nunc sit amet nunc posuere vestibulum. Ut id neque eget tortor mattis
tristique. Donec ante est, blandit sit amet tristique vel, lacinia pulvinar arcu. Pellentesque
scelerisque fermentum erat, id posuere justo pulvinar ut. Cras id eros sed enim aliquam
lobortis. Sed lobortis nisl ut eros efficitur tincidunt. Cras justo mi, porttitor quis mattis vel, ultricies ut purus. Ut facilisis et lacus eu cursus.

Now if you want to extract a particular word from these pages, unlist the data and convert it into lower case letters

How to do t test statistical analysis in R, Assumptions and Inference

pdf.text<-unlist(pdf.text)
pdf.text<-tolower(pdf.text)

Suppose if we want to extract the page number details for the word contains “Suspendisse“

library(stringr)
res<-data.frame(str_detect(pdf.text,"suspendisse"))
colnames(res)<-"Result"
res<-subset(res,res$Result==TRUE)
row.names(res)

Output

1] "2" "3"

The word “suspendisse” contains on pages number 2 and 3.

Conclusion

This article described text data extraction from pdf files and particular word detection from pdf data in R.

Data Analysis in R pdf tools & pdftk » Read, Merge, Split, Attach

Enjoyed this tutorial? Don’t forget to show your love, Please Subscribe the Newsletter and COMMENT below!

The post Extract text from pdf in R and word Detection appeared first on finnstats.

To leave a comment for the author, please follow the link and comment on their blog: Methods – finnstats.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.