Announcing pdftools 1.0

[This article was first published on rOpenSci Blog - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This week we released version 1.0 of the ropensci pdftools package to CRAN. Pdftools provides utilities for extracting text, fonts, attachments and other data from PDF files. It also supports rendering of PDF files into bitmap images.

This release has a few internal enhancements and fixes an annoying bug for landscape PDF pages. The version bump to 1.0 signifies that the package has undergone sufficient testing and the API is stable.

Extracting Text

As described in our previous post, the most common use of pdftools is extracting text from (scientific) articles for searching / indexing. But let's try a somewhat more unusual PDF file this time: a poster.

library(pdftools)
url <- "https://www.rstudio.com/wp-content/uploads/2016/02/advancedR.pdf"

# Display author, editor
pdf_info(url)

The pdf_info file returns all kind of metadata from the pdf file. For example we can read that this PDF was created on 2016-02-12 by Arianne Colton using Acrobat PDFMaker 11 for PowerPoint.

# extract text vector
text <- pdf_text(url)

# Print text from page 1
cat(text[1])

The pdf_text function extracts text into an R character vector if length equal to the number of pages in the PDF.

Note how the text is spaced to match the position in the PDF page.

Rendering PDF

Recent versions of pdftools allow rendering of PDF pages into bitmap images. The pdf_render_page function returns the bitmap as a raw vector array of size channels * width * height (in pixels).

library(pdftools)
bitmap <- pdf_render_page(url, page = 1, dpi = 72)
dim(bitmap)
## 4 1100  850

From here we can use for example the rOpenSci magick package to read the bitmap and manipulate/export it to various formats.

library(magick)
poster <- image_read(bitmap)
print(poster)
image_write(poster, "out.png", format = "png")

Or have some fun with the other magick tools 🙂

# Download dancing banana
banana <- image_read("https://jeroenooms.github.io/images/banana.gif")
banana <- image_scale(banana, "300")

# Combine and flatten frames
frames <- lapply(banana, function(frame) {
  image_composite(poster, frame, offset = "+70+30")
})

# Turn frames into animation
animation <- image_animate(image_join(frames))
print(animation)

# Save as gif
image_write(animation, "output.gif")

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)