How I resurrected my ancient PhD thesis using R/bookdown (and some other tools)

[This article was first published on R – What You're Doing Is Rather Desperate, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve long admired the look of publications generated using the R bookdown package, and thought it would be fun and educational to publish one myself. The problem is that I am not writing a book and have no plans to do so any time soon.

Then I remembered that I’ve already written a book. There it is on the right. It’s called “Cloning, sequence analysis and studies on the expression of the nirS gene, encoding cytochrome cd1 nitrite reductase, from Thiosphaera pantotropha“. Catchy title, hey. It’s from my former life, as a biochemistry graduate turned reluctant molecular microbiologist. I believe there are 3 printed copies in existence: mine, one for the lab and one deposited in the university library.

That’s simple enough then Neil, you say, you just grab your digital files, copy/paste into RMarkdown files, do a bit of editing and you’re set. Here’s the thing.

There are no digital files.

There were, once. A collection of documents: Word, Powerpoint and JPEGs. I think they lived on a 100 MB zip drive for a while. At some point they were burned onto a CD. And at some other point, that CD became corrupted. And that was that. Like many (most?) people, I’d barely looked at the thesis since depositing a copy in the library anyway. It didn’t seem to matter much.

And then I grew older, and started looking at some of the documents in our family, and realising that in the event of accident or disaster, they’d be lost forever. So I started working on ways to digitally archive some of them. At some point my thoughts turned to that thesis, which took 4 years of my life. I wondered whether the university library had digitised it and if so, whether it might be available online. So far as I can tell, the answer is no. That seemed a shame.

So here, briefly, is the story of how I used R/bookdown and some other tools to resurrect that thesis.

1. Scan the pages for optical character recognition

Step 1: photograph each page. A good camera phone is sufficient here, but the key to success is to keep the pages flat and minimise any distortion.
Next, I created a Google Drive folder for the project, and subfolders within it for each chapter. Upload the images (as JPEGs) to the appropriate folder.

2. Conversion to text

Did you know that Google Docs has OCR built in? Just right-click a JPEG file containing an image of text, choose “open in Google Docs” and you’ll get a document containing the original image plus the extracted text.

You don’t want to do that for each document one by one though. Fortunately, you can use a Google Apps script to automate the task. Right-click, choose “Google Apps Script”, paste in the following and save it as e.g. “”JPEG to Docs”:

function convertJPGtoGoogleDocs() {
  var srcfolderId = "XXX"; // <--- Please input folder ID.
  var dstfolderId = "YYY"; // <--- If you want to change the destination folder, please modify this.
  var files = DriveApp.getFolderById(srcfolderId).getFilesByType(MimeType.JPEG); // Modified
  while (files.hasNext()) {
    var file = files.next();
    Drive.Files.insert({title: file.getName(), parents: [{id: dstfolderId}]}, file.getBlob(), {ocr: true}); // Modified
  }
}

Substitute XXX and YYY for the folder with the JPEGs and the output folder, respectively (look at the Google Drive URL, the folder ID is the last part after “folders/”), then run the script. The maximum execution time is 5 minutes which may be exceeded if there are a lot (more than 30 or so) of JPEG files to process. In this case move the processed files somewhere else, repeat until done, then bring all the Google Docs back together in the chapter folder.

Wait, there’s more. You want to concatenate those Google Docs together into one file? Cue the next Google Apps script, which is somewhat longer:

function combineDocs() {
  // This function assumes only Google Docs files are in the root folder
  var folder = DriveApp.getFolderById('XXX');
  if (folder == null) { Logger.log("Failed to get root folder"); return; }
  var combinedTitle = "Combined Document Example";
  var combo = DocumentApp.create(combinedTitle);
  var comboBody = combo.getBody();
  var hdr = combo.addHeader();
  hdr.setText(combinedTitle)
    
  // merely iterating the files does not get them in alphabetical order.
  // So: sort them.
  var docList = folder.getFiles();
  var docArr  = [];
  while (docList.hasNext()) {
    var item = docList.next();
    var docName = item.getName();
    var docId   = item.getId();
    var doc     = {name: docName, id: docId};
    docArr.push(doc);
  }
  
  // this sort will fail if you have files with identical names
  docArr.sort(function(a, b) { return a.name < b.name ? -1 : 1; });
  
  // Now load the docs into the combo doc.
  // We can't load a doc in one big lump though;
  // we have to do it by looping through its elements and copying them
  for (var j = 0; j < docArr.length; j++) {
    var entryId = docArr[j].id;
    var entry = DocumentApp.openById(entryId);
    var entryBody = entry.getBody();
    var elems = entryBody.getNumChildren();
    for (var i = 0; i < elems; i++) {
      var elem = entryBody.getChild(i).copy();
      switch (elem.getType()) {
        case DocumentApp.ElementType.HORIZONTAL_RULE:
          comboBody.appendHorizontalRule();
          break;
        case DocumentApp.ElementType.INLINE_IMAGE:
          comboBody.appendImage(elem);
          break;
        case DocumentApp.ElementType.LIST_ITEM:
          comboBody.appendListItem(elem);
          break;
        case DocumentApp.ElementType.PAGE_BREAK:
          comboBody.appendPageBreak(elem);
          break;
        case DocumentApp.ElementType.PARAGRAPH:
          comboBody.appendParagraph(elem);
          break;
        case DocumentApp.ElementType.TABLE:
          comboBody.appendTable(elem);
          break;
        default:
          var style = {};
          style[DocumentApp.Attribute.BOLD] = true;
          comboBody.appendParagraph("Element type '" + elem.getType() + "' could not be merged.").setAttributes(style);
      }
    }
   // page break at the end of each entry.
   comboBody.appendPageBreak();
  }
}

Save as “MergeDocs”, edit XXX to be the folder ID containing your converted Google Docs and run. This will generate a combined document in the root of your Google Drive; you can rename it and move back to the appropriate folder.

Repeat for each chapter, delete the images from the combined document if you desire and you’re done. We now have a Google Document with text for each chapter, ready to edit.

3. Photograph the figures

Figures require more care than text. A phone camera too close to the page will introduce distortion, even if you take care to hold the pages straight. Even lighting is also essential, and must be angled to avoid glare if you have any shiny photographic prints on the page – which I do, see the end of this post as to why.

I used a digital SLR on a tripod, pointing down at the thesis pages on a table, and connected remotely to the computer for taking each image. I shot RAW images, processed them to improve the brightness, contrast and sharpness, cropped, straightened and saved each image as e.g. “fig1-1.png”. That might be overkill for your purposes so do whatever works for you.

The figures will look like what they are – photographs of a page – but we can’t do much about that.

4. Prepare the bibliography

Here’s a fun fact about my thesis: I didn’t use a reference manager and just copy-pasted each citation and reference manually.

Boy, does it show. I reckon there’s at least an error per page in the first six pages alone of the Bibliography. Did I correct these in the version submitted to the library? I don’t recall.

So I decided to make amends for this sin, by creating a proper bibliography file in BibTeX format. Essentially this involved manual searching PubMed (and a few other sources) for each reference in my bibliography Google Doc and copy-pasting the identifier to Zotero. That was a few evenings of work, made much easier by Zotero. I wish it had been available the first time around.

Zotero exports to BibTeX format, giving us a file “thesis2021.bib”.

5. R/thesisdown

We have chapter text, figures and a bibliography. It’s time to get editing in R/bookdown.

Well strictly speaking R/thesisdown, which is essentially a modified version of bookdown tailored for thesis publication. I installed the package from Github and followed the instructions to create a new project in RStudio named “thesis2021”. The file and directory hierarchy looks like this:

thesis2021/
├── 01-chap01.Rmd
├── 02-chap02.Rmd
├── 03-chap03.Rmd
├── 04-chap04.Rmd
├── 05-chap05.Rmd
├── 06_chap06.Rmd
├── 07_chap07.Rmd
├── 08_chap08.Rmd
├── 09-appendix.Rmd
├── 99-references.Rmd
├── _bookdown.yml
├── _bookdown_files/
├── bib/
├── chemarr.sty
├── csl/
├── data/
├── docs/
├── figure/
├── index.Rmd
├── prelims/
├── reedthesis.cls
├── style.css
├── template.tex
└── thesis2021.Rproj

Some notes on all of that:

  • There’s an Rmd file for each chapter, appendices and bibliography (the latter generated automatically when we knit index.Rmd)
  • The BibTeX bibliography goes in bib/, citation styles can be obtained from Zotero and go in csl/
  • The PNG figures go in figure/ but you may not want to add that to git, as they are duplicated in docs/figure when knitted
  • The published files are normally generated in _book/ but I’ve edited the config to use docs/ instead (see later for why)
  • Any preliminary matter – abstract, acknowledgements and so on – goes in Rmd files in prelims/

I was only interested in generating an online, HTML version of the thesis – no PDF, Word or ePub. So the YAML header for my index.Rmd contains:

output:
  thesisdown::thesis_gitbook:
    css: style.css

and the _bookdown.yml configuration looks like:

book_filename: "thesis"
chapter_name: "Chapter "
delete_merged_file: true
output_dir: "docs"

It’s all committed to a Github repository so for more details you can go and look in there.

And now we edit. Scientific theses contain a lot of special formatting. In my field gene names and restriction enzymes, for example, are italicised. There are subscripts, superscripts and Greek letters – lots of alphas, betas, gammas and deltas. There are chemical formulae and equations. All of which is admirably handled using the MathJax library.

The nice thing is that MathJax even works in figure and table captions. Here’s a particularly fiendish example from my thesis:

```{r table11, echo=FALSE, results='asis'}
table1_1 <- data.frame(Reaction = c("$\\mathrm{2NO_3^- + 4H^+ + 4e^-}$ $\\rightarrow$ $\\mathrm{2NO_2^- + 2H_2O}$", "$\\mathrm{2NO_2^- + 2H^+ + 2e^-}$ $\\rightarrow$ $\\mathrm{2NO + 2H_2O}$", "$\\mathrm{2NO + 2H^+ + 2e^-}$ $\\rightarrow$ $\\mathrm{N_2O + H_2O}$", "$\\mathrm{N_2O + 2H^+ + 2e^-}$ $\\rightarrow$ $\\mathrm{N_2 + H_2O}$", "**Overall**", "$\\mathrm{2NO_3^- + 2H^+ + 5H_2}$ $\\rightarrow$ $\\mathrm{N_2 + 6H_2O}$", "$\\mathrm{O_2 + 4H^+ + 4e^-}$ $\\rightarrow$ $\\mathrm{2H_2O}$", "$\\mathrm{NAD^+ + 2H^+ + 2e^-}$ $\\rightarrow$ $\\mathrm{NADH + H^+}$"),
                       Couple = c("NO$_3^-$/NO$_2^-$", "NO$_2^-$/NO", "NO/N$_2$O", "N$_2$O/N$_2$",
                                  "", "", "O$_2$/H$_2$O", "NAD$^+$/NADH"),
                       `$\\Delta$_E_$^{o'}$ (V)` = c("+0.420", "+0.375", "+1.175", "+1.355",
                                                     "", "", "+0.800", "-0.320"),
                       `$\\Delta$_G_$^{o'}$ (kJ mol$^{-1}$) per reaction` = c("-285.6", "-134.1", 
                                                                              "-288.5", "-323.3", 
                                                                              "", "-1031.5", 
                                                                              "-438.5", ""),
                       `$\\Delta$_G_$^o$ (kJ mol$^{-1}$) per NADH` = c("-142.8", "-134.1", 
                                                                       "-288.5", "-323.3", 
                                                                       "", "-206.3", 
                                                                       "-219.2", ""),
                        check.names = FALSE)

kable(table1_1,
      caption = "Reduction potentials and thermodynamics of the reactions of denitrification compared with those of the oxygen/water couple, assuming NADH as the reductant.
```

Horrid to type, but it gets the job done.

Table 1.1

A figure caption might look like:

```{r fig6-6, fig.cap="Construction of the suicide plasmid pRVS$\\Delta$nir. Only the relevant restriction fragments are shown. Full details of the plasmids are given in Table 6.1. Plasmid pRVS$\\Delta$nir was used to construct the _nirS_ deletion mutants $\\Delta5$, $\\Delta7$, $\\Delta8$. Enzymes: B, _BamHI_, E, _EcoRI_, H, _HindIII_, N, _NotI_, S, _SalI_.", fig.align='center', echo=FALSE, out.width='75%'}
include_graphics(path = "figure/fig6-6.png")
```

Note that double-backslash is sometimes required as an escape. Syntax highlighting for markdown and MathJax doesn’t display when editing captions in RStudio, so take care.

You can also cite references in captions, just as you would in the main text using for example [@van_spanning_genes_1991], where van_spanning_genes_1991 is the reference key in your BibTeX file.

6. Github and publishing

So after many cycles of editing, knitting the index.Rmd file and proof-reading the preview, I committed the project to Github.

Publishing is easy. Having edited _bookdown.yml to have the HTML output written to docs/, you just edit your Github Pages settings to use docs/ on the master (or main) branch – and your work is published. And so I give you the 2021 version of:

Cloning, sequence analysis and studies on the expression of the nirS gene, encoding cytochrome cd1 nitrite reductase, from Thiosphaera pantotropha

It’s a work in progress. Some of the references are still not quite right, I’m not happy with the citation style and the figures need some tweaks. And you know what – that’s OK. Unlike the first time around, with the stress of producing an error-free (or not as it turned out) print copy, I now have version control. So I can tweak it here and there, as and when I like, push to Github and automatically get the updated version. Hey if you were really keen, you could even submit a Github issue if you find a typo 🙂

My research is long out of date now of course, and probably of little use to anyone. But if you do want it – you can get it, unlike before. And perhaps this post will give you some ideas if you want to digitise your own printed publication.

Let’s just hope the university don’t get in touch to tell me that self-publishing is prohibited.


To finish, as promised earlier here are some more fun facts about my thesis. Going through this digitisation process caused a whole lot of long-buried memories to resurface. Some not so good – doctoral research can be a painful process for many people, and I was no exception – but some good too – notably, the many moments of kindness and help from other lab members, as I struggled my way to the end.

The thesis was written on my first PC, which I believe was a 486 purchased from Gateway. It ran Windows 3.1. The Word version was probably 6, and the figures were drawn in Powerpoint.

It looks very thick, because it had be to printed double-spaced and single-sided. I also invented a bizarre and ridiculous page numbering system with a “T” suffix for tables and “F” for figures, which made an accurate page count impossible.

Scanned images were still frowned upon in those days so it contains actual photographic prints, glued in. An exception was made for images of agarose gels.

I forget the specs of the PC, but I’m pretty sure the memory was measured in MB not GB. This meant that Word could not handle a complete version of the document, or even a single chapter containing image files, without crashing. To include the figures I had to insert the image, draw lines around the borders, remove it and save the document. Then to print I’d put the image back, remove the borders, print the page, then remove the image again before saving.

I had a good friend one year ahead of me who unlike me, went on to do great things in science (hi Pamela!) I recall seeing a draft of her thesis, beautifully typeset, full of equations and thinking “how did she do that?” Years later of course, I realised that the answer was LaTeX. I should have asked her about reference management.

To leave a comment for the author, please follow the link and comment on their blog: R – What You're Doing Is Rather Desperate.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)