RObservations #31: Using the magick and tesseract packages to examine asterisks within the Noam Elimelech

[This article was first published on r – bensstats, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Since my last blog on Tesseract-OCR I have been playing around casually with it to see what it is possible of doing. Tesseract supports optical character recognition for over 100 languages. That together with straight forward usage for implementing it in R inspired me to try using it for Hebrew text.

The last time I publicly explored anything to do with Hebrew language and letters was when I wrote a R package for calculating Hebrew Gemmatrias. While its remained untouched for years now, its still usable and you can check it out on my Github here.

In this blog I explore two pages of the Noam Elimelech and examine the words at the end of each asterisk. For context, the text of the Noam Elimelech is a collection of teachings by the 18th century Rabbi, Rabbi Elimelech of Lizhensk ztvk”l zy”a. There are a number of asterisks placed across the text which are largely unexplained for as to why. While it is beyond the scope of this blog go too deep into the specifics, I share how I extracted the text which preceded each asterisk.

The text I used can be accessed here. While the methods highlighted here can be extended to the entire text, this blog is just for proof of concept. As such I limit the scope to two pages of the text.

The Code

Since the text that I’m using has with two columns per page, the text will need to be cropped by columns before OCR is applied. Prior to that, the .pdf files will need to be converted to .png format. The workflow is thus:

  1. Converting the .pdf file to .png format (pdftools::pdf_convert())
  2. Reading the created .png file and cropping it (magick::image_read() and magick::image_crop())
  3. Employing Tesseract-OCR to extract the text (tesseract::ocr()).
    (While there are functions in the magick package accomplish this, I found the Tesseract-OCR wrapper to not fair as well as using it directly with the `tesseract` package. I thus used the magick package for cropping the text area and tesseract for the ocr work.)
  4. Do the relevant text cleaning and extract the words before each asterisk by using regular expressions.
library(tidyverse)
library(magick)
library(tesseract)

noamElimelech<-c("NoamElimelech_Bechukosai_1.pdf",
                 "NoamElimelech_Bechukosai_2.pdf") %>% 
               sapply(function(x) pdftools::pdf_convert(x, dpi = 1000)) %>% 
               unname()

noamElimelech_Left_1 <- noamElimelech[1] %>% 
                        image_read() %>% 
                        image_crop("0x11243+4050+1600") %>% 
                        ocr(eng=tesseract("heb")) %>% 
                        str_split("\\n") %>% 
                        unlist()
noamElimelech_Left_2 <- noamElimelech[2] %>% 
                        image_read() %>% 
                        image_crop("0x11310+4300+1300") %>% 
                        ocr(eng=tesseract("heb")) %>% 
                        str_split("\\n") %>% 
                        unlist()

noamElimelech_Right_1 <- noamElimelech[1] %>% 
                        image_read() %>% 
                        image_crop("4050x8800+0+1200")%>% 
                        ocr(eng=tesseract("heb")) %>% 
                        str_split("\\n") %>% 
                        unlist()

noamElimelech_Right_2 <- noamElimelech[2] %>% 
                         image_read() %>% 
                         image_crop("4400x2500+0+1200")%>% 
                         ocr(eng=tesseract("heb")) %>% 
                         str_split("\\n") %>% 
                         unlist() 


For regular expressions I looked up how to use regular expressions with Hebrew text and learned that the Unicode reference for the Hebrew letter alphabet is the range \u0590-\u05fe (see here). Additionally to deal with the apostrophes which are common for abbreviating text, I was sure to ignore them when extracting the words.

noamElimelech_text <- c(noamElimelech_Left_1,
                        noamElimelech_Right_1,
                        noamElimelech_Left_2,
                        noamElimelech_Right_2) %>% 
                      paste(collapse="") %>% 
                      str_replace_all('"',"'")

The regular expressions I use extracts the previous two words which allow for better context. I will have to spend some more time learning about text analysis if I wanted to make this blog beyond demonstrating text extraction.

There are some spaces skipped and letters misread by tesseract, but nevertheless the result is interesting.

words_before_asterisks <-noamElimelech_text %>%  str_extract_all("([\\u0590-\\u05fe[']]{1,} [\\u0590-\\u05fe[']]{1,})( \\* )") %>% 
  unlist() %>% 
  str_remove_all("\\*") %>% 
  trimws()

words_before_asterisks


[1] "במה פעמים"        "טובה גדולה"       "והצדיק מבטל"      "כן יקום"         
 [5] "ממילא בטל"        "של בע'פ"          "להפכם לרחמים"     "להשפיע לכם"      
 [9] "ויבולהמלשון יובל" "בל' נסתר"         "בעבודה כלל"       "שבולכם פה"       
[13] "והתחברות יחד"     "הוא להיפך"        "ברצונם ותשוקתם"   "להפכם לרחמים"    
[17] "של אדם"           "רע ונהפכולרחמים"  "דהיינו ממטהלמעל'" "על אחרים"        
[21] "תשוב' שלימ'"     

Conclusion

Its very cool to see how well Tesseract-OCR works. While there were some characters misclassified and spaces missed I still managed to get the text before the asterisks. It could be that from these findings there may be some hints to why the asterisks are in the places they are, but its beyond the scope and my present qualifications to explain any of that!

If you know of training data available which fairs better for character recognition than what I used please let me know!

Thank you for reading this blog!

Want to see more of my content?

Be sure to subscribe and never miss an update!

To leave a comment for the author, please follow the link and comment on their blog: r – bensstats.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)