epubr 0.6.0 CRAN release

January 10, 2019
By

(This article was first published on Matt's R Blog, and kindly contributed to R-bloggers)

The epubr R package provides functions supporting the reading and parsing of internal e-book content from EPUB files. It has been updated to v0.6.0 on CRAN. This post highlights new functionality. The key improvements focus on cases where EPUB files have poorly arranged text when loaded into R as a result of their metadata entries and archive file structure.

library(epubr)
file <- system.file("dracula.epub", package = "epubr")
(x <- epub(file))
#> # A tibble: 1 x 9
#>   rights  identifier   creator  title language subject date  source   data 
#>                               
#> 1 Public~ http://www.~ Bram St~ Drac~ en       Horror~ 1995~ http://~  # A tibble: 15 x 4
#>    section        text                                          nword nchar
#>                                                        
#>  1 item6          "The Project Gutenberg EBook of Dracula, by ~ 11446 60972
#>  2 item7          "But I am not in heart to describe beauty, f~ 13879 71798
#>  3 item8          "\" 'Lucy, you are an honest-hearted girl, I~ 12474 65522
#>  4 item9          "CHAPTER VIIIMINA MURRAY'S JOURNAL\nSame day~ 12177 62724
#>  5 item10         "CHAPTER X\nLetter, Dr. Seward to Hon. Arthu~ 12806 66678
#>  6 item11         "Once again we went through that ghastly ope~ 12103 62949
#>  7 item12         "CHAPTER XIVMINA HARKER'S JOURNAL\n23 Septem~ 12214 62234
#>  8 item13         "CHAPTER XVIDR. SEWARD'S DIARY-continued\nIT~ 13990 72903
#>  9 item14         "\"Thus when we find the habitation of this ~ 13356 69779
#> 10 item15         "\"I see,\" I said. \"You want big things th~ 12866 66921
#> 11 item16         "CHAPTER XXIIIDR. SEWARD'S DIARY\n3 October.~ 11928 61550
#> 12 item17         "CHAPTER XXVDR. SEWARD'S DIARY\n11 October, ~ 13119 68564
#> 13 item18         " \nLater.-Dr. Van Helsing has returned. He ~  8435 43464
#> 14 item19         "End of the Project Gutenberg EBook of Dracu~  2665 18541
#> 15 coverpage-wra~ ""                                                0     0

Restructure parsed content

When reading EPUB files it is ideal to be able to identify meaningful sections to retain via a regular expression pattern, as well as to drop extraneous sections in a similar manner. Using pattern matching as shown above is a convenient way to filter rows of the nested text content data frame.

For e-books with poor metadata formatting this is not always possible, or may be possible only after some other pre-processing. epubr provides other functions to assist in restructuring the text table. The Dracula EPUB file included in epubr is a good example to continue with here.

Split and recombine into new sections

This book is internally broken into sections at arbitrary break points, hence why several sections begin in the middle of chapters, as seen above. Other chapters begin in the middle of sections. Use epub_recombine along with a regular expression that can match the true section breaks. This function collapses the full text and then rebuilds the text table using new sections with proper break points. In the process it also recalculates the numbers of words and characters and relabels the sections with chapter notation.

Fortunately, a reliable pattern exists, which consists of CHAPTER in capital letters followed by a space and some Roman numerals. Recombine the text into a new object.

pat <- "CHAPTER [IVX]+"
x2 <- epub_recombine(x, pat)
x2
#> # A tibble: 1 x 10
#>   rights identifier creator title language subject date  source nchap data 
#>                          
#> 1 Publi~ http://ww~ Bram S~ Drac~ en       Horror~ 1995~ http:~    54  # A tibble: 55 x 4
#>    section text                                                 nword nchar
#>                                                        
#>  1 prior   "The Project Gutenberg EBook of Dracula, by Bram St~   159  1110
#>  2 ch01    "CHAPTER I\nPage\nJonathan Harker's Journal\n1\n"        7    43
#>  3 ch02    "CHAPTER II\nJonathan Harker's Journal\n14\n"            6    40
#>  4 ch03    "CHAPTER III\nJonathan Harker's Journal\n26\n"           6    41
#>  5 ch04    "CHAPTER IV\nJonathan Harker's Journal\n38\n"            6    40
#>  6 ch05    "CHAPTER V\nLetters-Lucy and Mina\n51\n"                 6    35
#>  7 ch06    "CHAPTER VI\nMina Murray's Journal\n59\n"                6    36
#>  8 ch07    "CHAPTER VII\nCutting from \"The Dailygraph,\" 8 Au~     9    55
#>  9 ch08    "CHAPTER VIII\nMina Murray's Journal\n84\n"              6    38
#> 10 ch09    "CHAPTER IX\nMina Murray's Journal\n98\n"                6    36
#> # ... with 45 more rows

But this is not quite as expected. There should be 27 chapters, not 54. What was not initially apparent was that the same pattern matching each chapter name also appears in the first section where every chapter is listed in the table of contents. The new section breaks were successful in keeping chapter text in single, unique sections, but there are now twice as many as needed. Unintentionally, the first 27 “chapters” represent the table of contents being split on each chapter ID. These should be removed.

An easy way to do this is with epub_sift, which sifts, or filters out, small word- or character-count sections from the nested data frame. It’s a simple sieve and you can control the size of the holes with n. You can choose type = "word" (default) or type = "character". This is somewhat of a blunt instrument, but is useful in a circumstance like this one where it is clear it will work as desired.

library(dplyr)
x2 <- epub_recombine(x, pat) %>% epub_sift(n = 200)
x2
#> # A tibble: 1 x 10
#>   rights identifier creator title language subject date  source nchap data 
#>                          
#> 1 Publi~ http://ww~ Bram S~ Drac~ en       Horror~ 1995~ http:~    54  # A tibble: 27 x 4
#>    section text                                                 nword nchar
#>                                                        
#>  1 ch28    "CHAPTER IJONATHAN HARKER'S JOURNAL\n(Kept in short~  5694 30602
#>  2 ch29    "CHAPTER IIJONATHAN HARKER'S JOURNAL-continued\n5 M~  5476 28462
#>  3 ch30    "CHAPTER IIIJONATHAN HARKER'S JOURNAL-continued\nWH~  5703 29778
#>  4 ch31    "CHAPTER IVJONATHAN HARKER'S JOURNAL-continued\nI A~  5828 30195
#>  5 ch32    "CHAPTER V\nLetter from Miss Mina Murray to Miss Lu~  3546 18004
#>  6 ch33    "CHAPTER VIMINA MURRAY'S JOURNAL\n24 July. Whitby.-~  5654 29145
#>  7 ch34    "CHAPTER VIICUTTING FROM \"THE DAILYGRAPH,\" 8 AUGU~  5567 29912
#>  8 ch35    "CHAPTER VIIIMINA MURRAY'S JOURNAL\nSame day, 11 o'~  6267 32596
#>  9 ch36    "CHAPTER IX\nLetter, Mina Harker to Lucy Westenra.\~  5910 30129
#> 10 ch37    "CHAPTER X\nLetter, Dr. Seward to Hon. Arthur Holmw~  5932 30730
#> # ... with 17 more rows

This removes the unwanted rows, but one problem remains. Note that sifting the table sections in this case results in a need to re-apply epub_recombine because the sections we removed had nevertheless offset the chapter indexing. Another call to epub_recombine can be chained, but it may be more convenient to use the sift argument to epub_recombine, which is applied recursively.

# epub_recombine(x, pat) %>% epub_sift(n = 200) %>% epub_recombine(pat)
x2 <- epub_recombine(x, pat, sift = list(n = 200))
x2
#> # A tibble: 1 x 10
#>   rights identifier creator title language subject date  source nchap data 
#>                          
#> 1 Publi~ http://ww~ Bram S~ Drac~ en       Horror~ 1995~ http:~    27  # A tibble: 27 x 4
#>    section text                                                 nword nchar
#>                                                        
#>  1 ch01    "CHAPTER IJONATHAN HARKER'S JOURNAL\n(Kept in short~  5694 30602
#>  2 ch02    "CHAPTER IIJONATHAN HARKER'S JOURNAL-continued\n5 M~  5476 28462
#>  3 ch03    "CHAPTER IIIJONATHAN HARKER'S JOURNAL-continued\nWH~  5703 29778
#>  4 ch04    "CHAPTER IVJONATHAN HARKER'S JOURNAL-continued\nI A~  5828 30195
#>  5 ch05    "CHAPTER V\nLetter from Miss Mina Murray to Miss Lu~  3546 18005
#>  6 ch06    "CHAPTER VIMINA MURRAY'S JOURNAL\n24 July. Whitby.-~  5654 29145
#>  7 ch07    "CHAPTER VIICUTTING FROM \"THE DAILYGRAPH,\" 8 AUGU~  5567 29912
#>  8 ch08    "CHAPTER VIIIMINA MURRAY'S JOURNAL\nSame day, 11 o'~  6267 32596
#>  9 ch09    "CHAPTER IX\nLetter, Mina Harker to Lucy Westenra.\~  5910 30129
#> 10 ch10    "CHAPTER X\nLetter, Dr. Seward to Hon. Arthur Holmw~  5932 30730
#> # ... with 17 more rows

Reorder sections based on pattern in text

Some poorly formatted e-books have their internal sections occur in an arbitrary order. This can be frustrating to work with when doing text analysis on each section and where order matters. Just like recombining into new sections based on a pattern, sections that are out of order can be reordered based on a pattern. This requires a bit more work. In this case the user must provide a function that will map something in the matched pattern to an integer representing the desired row index.

Continue with the Dracula example, but with one difference. Even though the sections were originally broken at arbitrary points, they were in chronological order. To demonstrate the utility of epub_reorder, first randomize the rows so that chronological order can be recovered.

set.seed(1)
x2$data[[1]] <- sample_frac(x2$data[[1]])  # randomize rows for example
x2$data[[1]]
#> # A tibble: 27 x 4
#>    section text                                                 nword nchar
#>                                                        
#>  1 ch08    "CHAPTER VIIIMINA MURRAY'S JOURNAL\nSame day, 11 o'~  6267 32596
#>  2 ch10    "CHAPTER X\nLetter, Dr. Seward to Hon. Arthur Holmw~  5932 30730
#>  3 ch15    "CHAPTER XVDR. SEWARD'S DIARY-continued.\nFOR a whi~  5803 29705
#>  4 ch22    "CHAPTER XXIIJONATHAN HARKER'S JOURNAL\n3 October.-~  5450 28081
#>  5 ch05    "CHAPTER V\nLetter from Miss Mina Murray to Miss Lu~  3546 18005
#>  6 ch20    "CHAPTER XXJONATHAN HARKER'S JOURNAL\n1 October, ev~  5890 31151
#>  7 ch24    "CHAPTER XXIVDR. SEWARD'S PHONOGRAPH DIARY, SPOKEN ~  6272 32065
#>  8 ch14    "CHAPTER XIVMINA HARKER'S JOURNAL\n23 September.-Jo~  6411 32530
#>  9 ch12    "CHAPTER XIIDR. SEWARD'S DIARY\n18 September.-I dro~  7285 37868
#> 10 ch02    "CHAPTER IIJONATHAN HARKER'S JOURNAL-continued\n5 M~  5476 28462
#> # ... with 17 more rows

It is clear above that sections are now out of order. It is common enough to load poorly formatted EPUB files and yield this type of result. If all you care about is the text in its entirely, this does not matter, but if your analysis involves trends over the course of a book, this is problematic.

For this book, you need a function that will convert an annoying Roman numeral to an integer. You already have the pattern for finding the relevant information in each text section. You only need to tweak it for proper substitution. Here is an example:

f <- function(x, pattern) as.numeric(as.roman(gsub(pattern, "\\1", x)))

This function is passed to epub_reorder. It takes and returns scalars. It must take two arguments: the first is a text string. The second is the regular expression. It must return a single number representing the index of that row. For example, if the pattern matches CHAPTER IV, the function should return a 4.

epub_reorder takes care of the rest. It applies your function to every row in the the nested data frame and then reorders the rows based on the full set of indices. Note that it also repeats this for every row (book) in the primary data frame, i.e., for every nested table. This means that the same function will be applied to every book. Therefore, you should only use this in bulk on a collection of e-books if you know the pattern does not change and the function will work correctly in each case.

The pattern has changed slightly. Parentheses are used to retain the important part of the matched pattern, the Roman numeral. The function f here substitutes the entire string (because now it begins with ^ and ends with .*) with only the part stored in parentheses (In f, this is the \\1 substitution). epub_reorder applies this to all rows in the nested data frame:

x2 <- epub_reorder(x2, f, "^CHAPTER ([IVX]+).*")
x2$data[[1]]
#> # A tibble: 27 x 4
#>    section text                                                 nword nchar
#>                                                        
#>  1 ch01    "CHAPTER IJONATHAN HARKER'S JOURNAL\n(Kept in short~  5694 30602
#>  2 ch02    "CHAPTER IIJONATHAN HARKER'S JOURNAL-continued\n5 M~  5476 28462
#>  3 ch03    "CHAPTER IIIJONATHAN HARKER'S JOURNAL-continued\nWH~  5703 29778
#>  4 ch04    "CHAPTER IVJONATHAN HARKER'S JOURNAL-continued\nI A~  5828 30195
#>  5 ch05    "CHAPTER V\nLetter from Miss Mina Murray to Miss Lu~  3546 18005
#>  6 ch06    "CHAPTER VIMINA MURRAY'S JOURNAL\n24 July. Whitby.-~  5654 29145
#>  7 ch07    "CHAPTER VIICUTTING FROM \"THE DAILYGRAPH,\" 8 AUGU~  5567 29912
#>  8 ch08    "CHAPTER VIIIMINA MURRAY'S JOURNAL\nSame day, 11 o'~  6267 32596
#>  9 ch09    "CHAPTER IX\nLetter, Mina Harker to Lucy Westenra.\~  5910 30129
#> 10 ch10    "CHAPTER X\nLetter, Dr. Seward to Hon. Arthur Holmw~  5932 30730
#> # ... with 17 more rows

It is important that this is done on a nested data frame that has already been cleaned to the point of not containing extraneous rows that cannot be matched by the desired pattern. If they cannot be matched, then it is unknown where those rows should be placed relative to the others.

If sections are both out of order and use arbitrary break points, it would be necessary to reorder them before you split and recombine. If you split and recombine first, this would yield new sections that contain text from different parts of the e-book. However, the two are not likely to occur together; in fact it may be impossible for an EPUB file to be structured this way. In developing epubr, no such examples have been encountered. In any event, reordering out of order sections essentially requires a human-identifiable pattern near the beginning of each section text string, so it does not make sense to perform this operation unless the sections have meaningful break points.

Other new functions

Word count

The helper function count_words provides word counts for strings, but allows you to control the regular expression patterns used for both splitting the string and conditionally counting the resulting character elements. This is the same function used internally by epub and epub_recombine. It is exported so that it can be used directly.

By default, count_words splits on spaces and new line characters. It counts as a word any element containing at least one alphanumeric character or the ampersand. It ignores everything else as noise, such as extra spaces, empty strings and isolated bits of punctuation.

x <- " This   sentence will be counted to have:\n\n10 (ten) words."
count_words(x)
#> [1] 10

Inspection

Helper functions for inspecting the text in the R console include epub_head and epub_cat.

epub_head provides an overview of the text by section for each book in the primary data frame. The nested data frames are unnested and row bound to one another and returned as a single data frame. The text is shortened to only the first few characters (defaults to n = 50).

epub_cat can be used to cat the text of an e-book to the console for quick inspection in a more readable form. It can take several arguments that help slice out a section of the text and customize how it is printed.

Both functions can take an EPUB filename or a data frame of an already loaded EPUB file as their first argument.

To leave a comment for the author, please follow the link and comment on their blog: Matt's R Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)