Last week Google and friends released the new major version of their OCR system: Tesseract 4. This release builds upon 2+ years of hard work and has completely overhauled the internal OCR engine. From the tesseract wiki:
Tesseract 4.0 includes a new neural network-based recognition engine that delivers significantly higher accuracy (on document images) than the previous versions, in return for a significant increase in required compute power. On complex languages however, it may actually be faster than base Tesseract.
We have now also updated the R package tesseract to ship with the new Tesseract 4 on MacOS and Windows. It uses the new engine by default, and the results are extremely impressive! Recognition is much more accurate then before, even without manually enhancing the image quality.
The binary R package for Windows and MacOS can be installed directly from CRAN:
Tesseract 4 uses a new training data format, so if you had previously installed custom training data you might need to redownload these as well:
# If you want to OCR french text: library(tesseract) tesseract_download('fra')
Training data for
eng (default) is included with the R package already.
Let’s take a simple example from last month’s blog post about ocr’ing bird drawings from the natural history collection. We use the magick package to preprocess the image (crop the area of interest). The
image_ocr() function is a magick wrapper for
library(magick) image_read("https://jeroen.github.io/images/birds.jpg") %>% image_crop('1200x300+100+1700') %>% image_ocr() %>% cat()
H Grenveld. del Watherby & 0° CLIMACTERIS PICUMNUS. (BROWN TREE CREEPER).
Tesseract has perfectly detected the hand-written species name (both the Latin and English name), and has also found and nearly perfectly predicted the tiny author names. These results would be a very good basis for post-processing and automatic classification. For example we could match these results against known species and authors as illustrated explained in the original blog post.