epubr 0.5.0 CRAN release

[This article was first published on Matt's R Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The epubr package provides functions supporting the reading and parsing of internal e-book content from EPUB files. This post briefly highlights the changes from v0.4.0. See the vignette for a more comprehensive introduction.

Minor addition

There is not much to see with the upgrade to version 0.5.0. Only one user function has been added, epub_cat. All this does is allow you to cat chunks of parsed e-book text to the console in a more readable manner. This can be helpful to get a quick glimpse of the content you are working with in a way that is easier on the eyes than looking at table entries and object structures.

Arguments to epub_cat give you control over the formatting. It is not intended to serve any other purposes beyond this human-guided content inspection. Arguments include:

  • max_paragraphs
  • skip
  • paragraph_spacing
  • paragraph_indent
  • section_sep
  • book_sep

See the help documentation for details.

Minor change

epub_cat, and now epub_head for consistency, take a more generic-sounding first argument x as a data argument, rather than data or file. This is because these summary functions can now be used on a filename string pointing to an external EPUB file that does not otherwise need to be read into R or, alternatively, a data frame already read into R by epub. This allows you to avoid reading the files from disk multiple times if the data is already in your R session.

Some context around epub_cat

Also, notice that this operates on paragraphs, not lines, if by lines I meant sentences. Since this level of information has not been stripped from the text that has been read in, it can be used. This may not mean the same thing in every section of every e-book, but the general idea is that epub_cat respects line breaks in the text. It pays attention to where they are and where they are not. I chose to call these paragraphs; it’s the label that is by far most often the correct one. But a one-line title or even the distinct lines of text on the copyright page of an e-book would all be treated the same way.

Control ends there of course. For example, you cannot stipulate that a title line should be excluded from indenting. Remember that the purpose of epubr is to bring in text for analysis, while preserving much (but not all) of its structure. I.e., you want “only the text” to operate on with ease, but you also don’t want to be relegated to a single, gigantic character string (that may not even be in the correct order) that aggregates out potential variables that could be mapped to sections of text and their sequence. epubr is not intended to retain all the the XML tags that define the appearance of the original document. The fundamental purpose of epubr is to strip that out entirely.

Here is an example:

library(epubr)
file <- system.file("dracula.epub", package = "epubr")
(x <- epub(file))
#> # A tibble: 1 x 9
#>   rights  identifier   creator  title language subject date  source  data 
#>   <chr>   <chr>        <chr>    <chr> <chr>    <chr>   <chr> <chr>   <lis>
#> 1 Public~ http://www.~ Bram St~ Drac~ en       Horror~ 1995~ http:/~ <tib~

x$data[[1]]
#> # A tibble: 15 x 4
#>    section       text                                          nword nchar
#>    <chr>         <chr>                                         <int> <int>
#>  1 item6         "The Project Gutenberg EBook of Dracula, by ~ 11252 60972
#>  2 item7         "But I am not in heart to describe beauty, f~ 13740 71798
#>  3 item8         "\" 'Lucy, you are an honest-hearted girl, I~ 12356 65522
#>  4 item9         "CHAPTER VIIIMINA MURRAY'S JOURNAL\nSame day~ 12042 62724
#>  5 item10        "CHAPTER X\nLetter, Dr. Seward to Hon. Arthu~ 12599 66678
#>  6 item11        "Once again we went through that ghastly ope~ 11919 62949
#>  7 item12        "CHAPTER XIVMINA HARKER'S JOURNAL\n23 Septem~ 12003 62234
#>  8 item13        "CHAPTER XVIDR. SEWARD'S DIARY-continued\nIT~ 13812 72903
#>  9 item14        "\"Thus when we find the habitation of this ~ 13201 69779
#> 10 item15        "\"I see,\" I said. \"You want big things th~ 12706 66921
#> 11 item16        "CHAPTER XXIIIDR. SEWARD'S DIARY\n3 October.~ 11818 61550
#> 12 item17        "CHAPTER XXVDR. SEWARD'S DIARY\n11 October, ~ 12989 68564
#> 13 item18        " \nLater.-Dr. Van Helsing has returned. He ~  8356 43464
#> 14 item19        "End of the Project Gutenberg EBook of Dracu~  2669 18541
#> 15 coverpage-wr~ ""                                                0     0

epub_cat(x, max_paragraphs = 3, skip = 100)
#>   CHAPTER IJONATHAN HARKER'S JOURNAL
#> 
#>   (Kept in shorthand.)
#> 
#>   3 May. Bistritz.-Left Munich at 8:35 P. M., on 1st May, arriving at Vienna early next morning; should have arrived at 6:46, but train was an hour late. Buda-Pesth seems a wonderful place, from the glimpse which I got of it from the train and the little I could walk through the streets. I feared to go very far from the station, as we had arrived late and would start as near the correct time as possible. The impression I had was that we were leaving the West and entering the East; the most western of splendid bridges over the Danube, which is here of noble width and depth, took us among the traditions of Turkish rule.

I suppose this function could be useful for cat-ing text externally to another display or document for some use case other than just quick inspection at the console. If so, I would love to hear what that purpose is so that I might be able to improve epub_ cat or add some other functionality entirely. But in general I would consider this a tangential function in epubr. There are bigger fish to fry.

Encoding

The main change was my decision to no longer rely on native encoding (getOption("encoding")) when reading EPUB files. In my ignorance I thought there was no reason to fuss with this. However, I began noticing issues with improper character parsing, resulting in weird characters showing up in some places. My initial thought to substitute these odd strings of characters for what they were supposed to represent, e.g., an apostrophe, was a natural first thought and it did seem like a fairly contained problem, which is why I didn’t notice it sooner and no one else made any mention of it either. But this idea was missing the bigger picture (and it didn’t work well anyway).

I did some poking around and learned that while EPUB titles from various publishers can give the impression that EPUB formatting is the Wild West, there is apparently some standardization at least in terms of encoding, such that EPUB files have UTF encoding in common- UTF-8 obviously, but possibly UTF-16.

This led me to add an explicit encoding argument to epub, defaulting to UTF-8. This still allows the user to change it, though typically there should be no reason to do so, except possibly to change it to UTF-16 and I have no idea how common that even is.

This had the result of clearing up every issue I was seeing with improper character translation. Even at this point I thought it was just something to do with the fact that apparently EPUB files are all UTF-encoded. I only found out more recently that the native encoding option in relation to the Windows environment has been a nightmare to other developers on a more fundamental level.

Anyhow, if you were seeing any weird characters after reading in some EPUB files with epubr, hopefully the situation is improved now. I don’t expect epubr to be perfect (there are some strangely put together e-books out there), but so far so good.

Upcoming version

Work is already underway on version 0.6.0. While 0.5.0 is more of an unsung hero, making changes and handling edge cases you are unlikely to notice, 0.6.0 will add some new user functions that enhance the ease with which you can restructure parsed e-book text that comes in looking like crap due to crappy e-book metadata (see the open source packaged EPUB file above as a good example of formatting sadness). The next version will help improve some situations like this in terms of section breaks and content sequence. Garbage in does not need to equal garbage out!

To leave a comment for the author, please follow the link and comment on their blog: Matt's R Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)