Tesseract Update: Options and Languages

December 8, 2016

(This article was first published on rOpenSci Blog - R, and kindly contributed to R-bloggers)

A few weeks ago we announced the first release of the tesseract package: a high quality OCR engine in R. We have now released an update with extra features.

Installing Training Data

As explained in the first post, the tesseract system is powered by language specific training data. By default only English training data is installed. Version 1.3 adds utilities to make it easier to install additional training data.

# Download French training data

Note that this function is not needed on Linux. Here you should install training data via your system package manager instead. For example on Debian/Ubuntu:

sudo apt-get install tesseract-ocr-fra

And on Fedora/CentOS you use:

sudo yum install tesseract-langpack-fra

Use tesseract_info() to see which training data are currently installed.

OCR Engine Parameters

Tesseract supports many parameters to fine tune the OCR engine. For example you can limit the possible characters that can be recognized.

engine <- tesseract(options = list(tessedit_char_whitelist = "0123456789"))
ocr("image.png", engine = engine)

In the example above, Tesseract will only consider numeric characters. If you know in advance the data is numeric (for example an accounting spreadsheet) such options can tremendously improve the accuracy.

Magick Images

Tesseract now automatically recognizes images from the awesome magick package (our R wrapper to ImageMagick). This can be useful to preprocess images before feeding to tesseract.

image <- image_read("http://jeroenooms.github.io/files/dog_hq.png")
image <- image_crop(image, "1700x100+50+150")

We plan to more integration between Magick and Tesseract in future versions.

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)