Tesseract Update: Options and Languages

Posted on December 8, 2016 by Jeroen Ooms in R bloggers | 0 Comments

[This article was first published on rOpenSci Blog - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A few weeks ago we announced the first release of the tesseract package: a high quality OCR engine in R. We have now released an update with extra features.

Installing Training Data

As explained in the first post, the tesseract system is powered by language specific training data. By default only English training data is installed. Version 1.3 adds utilities to make it easier to install additional training data.

# Download French training data
tesseract_download("fra")

Note that this function is not needed on Linux. Here you should install training data via your system package manager instead. For example on Debian/Ubuntu:

sudo apt-get install tesseract-ocr-fra

And on Fedora/CentOS you use:

sudo yum install tesseract-langpack-fra

Use tesseract_info() to see which training data are currently installed.

OCR Engine Parameters

Tesseract supports many parameters to fine tune the OCR engine. For example you can limit the possible characters that can be recognized.

engine <- tesseract(options = list(tessedit_char_whitelist = "0123456789"))
ocr("image.png", engine = engine)

In the example above, Tesseract will only consider numeric characters. If you know in advance the data is numeric (for example an accounting spreadsheet) such options can tremendously improve the accuracy.

Magick Images

Tesseract now automatically recognizes images from the awesome magick package (our R wrapper to ImageMagick). This can be useful to preprocess images before feeding to tesseract.

library(magick)
library(tesseract)
image <- image_read("http://jeroenooms.github.io/files/dog_hq.png")
image <- image_crop(image, "1700x100+50+150")
cat(ocr(image))

We plan to more integration between Magick and Tesseract in future versions.

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Tesseract Update: Options and Languages

Installing Training Data

OCR Engine Parameters

Magick Images

Related

Installing Training Data

OCR Engine Parameters

Magick Images

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)