[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this new post, we’re taking a break from modern birding data in our birder’s series… let’s explore gorgeous drawings from a natural history collection! Armed with rOpenSci’s packages binding powerful C++ libraries and open taxonomy data, how much information can we automatically extract from images? Maybe not much, but we’ll at least have explored image manipulation, optical character recognition (OCR), language detection, taxonomic name resolution with rOpenSci’s packages.

### Free natural history images and appropriate R tooling!

A long time ago I had bookmarked the Flickr account of the Biodiversity Heritage Library (BHL). So many beautiful images of biodiversity, moreover free to use! In particular, I downloaded all pictures from one of the Birds of Australia albums.

I wanted to try to extract the bird names from images using packages of Jeroen Ooms’, rOpenSci’s post-doc hacker & C(++)-bindings wizard. For that I worked with magick for image manipulation, tesseract for optical character recognition (OCR), cld2/cld3 for language detection… Quite the armory! We’ll also sprinkle some taxonomy magic by Scott Chamberlain, one of rOpenSci’s co-founders, to resolve the names extracted.

### OCR bird naming workflow, piece by piece

In this section, we explain the different elements of our R workflow: preparing images, extracting text, resolving taxonomic names.

#### Image preparation

I saved the pictures locally in a “birds” folder. Yes, I click-buttoned instead of using the Flickr API for which e.g. Jim Hester wrote a minimal R wrapper… I don’t do everything with R scripts (yet).

library("magrittr")
filenames <- fs::dir_ls("birds")


Each image shows a bird and its name. Images are either landscape- or portrait-oriented, but for the sake of simplicity, I’ll act as if they were all portrait-oriented. A possible easy and lazy fix here would be to duplicate images rotated (magick::image_rotate) in all possible directions and then to apply the workflow to all 4 versions of each image, hoping to get good data from one of the rotated versions.

magick::image_read(filenames[1])


From that image, I wanted to extract the name indicated below the bird. To maximize the efficiency of OCR, I shall first prepare the image, since the accuracy of OCR depends on the quality of the input image which can be influenced a bit. This part could be tweaked even more, and in real life examples it’ll be worth spending time trying different magick functions and parameter values. Since I have in mind the case of a bunch of images to be batch-processed, nothing is done by hand.

crop_bird <- function(filename){

results$cld3 %in% c("la", "en"),] }else{ return(NULL) } } (results1 <- filenames[1] %>% magick::image_read() %>% get_names()) ## NULL  Nothing! Now if we replace magick::image_read with the previously defined crop_bird function that crops and cleans the image… (results2 <- filenames[1] %>% crop_bird() %>% get_names()) ## # A tibble: 2 x 4 ## text cld2 cld3 filename ## ## 1 climacteris picumnus la birds/n115_w1150_42399797481_o.jpg ## 2 brown tree creeper en birds/n115_w1150_42399797481_o.jpg  We get a result! So we see that the image transformation was quite useful. Now, these names look fine, but how to be sure they’re actually taxonomic names? #### Taxonomic name resolution The taxize package by Scott Chamberlain, is a taxonomic toolbelt for R, providing access to many fantastic data sources and tools for taxonomy. One of them, the Global Name Resolver, provides, well, resolution of taxonomic names, sadly not common names. taxize::gnr_resolve has many options, of which only one is used below: best_match_only = TRUE means it’ll only return the best match from the different data sources. latin <- results2$text[results2$cld2 == "la"| results2$cld3 == "la"]
taxize::gnr_resolve(latin,
best_match_only = TRUE)

## # A tibble: 1 x 5
##   user_supplied_name submitted_name  matched_name   data_source_tit~ score
## *
## 1 climacteris picum~ Climacteris pi~ Climacteris p~ NCBI             0.988


English names could be cleaned up a bit using the spelling package, also an rOpenSci package, since spelling::spell_check_text would output possible typos.

### OCR bird naming workflow in action!

First the two steps image processing and OCR are applied to all images.

bird_names <- purrr::map(filenames, crop_bird) %>%
purrr::map_df(get_names)


Out of 51 images only 17 are present in the final table with possible names which is a bit disheartening, but one could surely do better in the image processing and OCR tweaking steps! Maybe one could frame the parameter search as a machine learning problem. Please also keep in mind that such natural history images are quite hard to parse.

The name resolution information can be added to this table.

safe_resolve <- function(text){

results <- taxize::gnr_resolve(text,
best_match_only = TRUE)

if(nrow(results) == 0){
list(NULL)
}else{
list(results)
}
}

bird_names <- dplyr::group_by(bird_names, text) %>%
dplyr::mutate(gnr = ifelse(cld2 == "la" | cld3 == "la",
safe_resolve(text),
list(NULL)))


We do not get much resolution, but we knew the names weren’t very good to start with. A better (untested here!) idea here might be to get a full list of names of Australian birds, potentially leveraging the taxizedb package by Scott Chamberlain, and to then fuzzy-match them with the names we have.

unique(bird_names$gnr) ## [[1]] ## # A tibble: 1 x 5 ## user_supplied_name submitted_name matched_name data_source_tit~ score ## * ## 1 climacteris picum~ Climacteris pi~ Climacteris p~ NCBI 0.988 ## ## [[2]] ## [1] NA ## ## [[3]] ## NULL ## ## [[4]] ## # A tibble: 1 x 5 ## user_supplied_na~ submitted_name matched_name data_source_title score ## * ## 1 austrodicaeum ii~ Austrodicaeum ~ Austrodicaeu~ The Interim Regis~ 0.75 ## ## [[5]] ## # A tibble: 1 x 5 ## user_supplied_name submitted_name matched_name data_source_tit~ score ## * ## 1 melithreptus laet~ Melithreptus l~ Melithreptus ~ CU*STAR 0.988 ## ## [[6]] ## # A tibble: 1 x 5 ## user_supplied_na~ submitted_name matched_name data_source_title score ## * ## 1 rad isdlvorniode Rad isdlvorni~ Rad Baker & ~ The Interim Regist~ 0.75  ### Conclusion #### rOpenSci packages supporting this (and your) workflow In this post, we made use of R packages quite useful to wrangle information from diverse formats: We also used a function from taxize allowing us to use the Global Name Resolver. Discover more packages from our suite here. #### Applicability of this OCR bird naming workflow Actually, the BHL itself provides OCR output for its collection, see this example. I wasn’t able to find information about the software powering this OCR. What I was able to find out is that the BHL uses purposeful gaming in its OCR workflow. The raw OCR results aren’t much better than what we got in this post which is comforting. #### More data from the Biodiversity Heritage Library If you’re interested in other types of data from the BHL, in addition to the images, have a look at the rbhl paclage, part of rOpenSci’s suite, that interacts with the BHL API. One can e.g. search all books by the same author as the one we used images from. author <- rbhl::bhl_authorsearch("Gregory M Mathews") books <- rbhl::bhl_getauthortitles(creatorid = author$CreatorID)
head(books$FullTitle) ## [1] "A manual of the birds of Australia," ## [2] "A list of the birds of the Phillipian sub-region : which do not occur in Australia. " ## [3] "A manual of the birds of Australia /" ## [4] "A list of the birds of Australia : containing the names and synonyms connected with each genus, species, and subspecies of birds found in Australia, at present known to the author /" ## [5] "Austral avian record; a scientific journal devoted primarily to the study of the Australian avifauna." ## [6] "Arcana, or, The museum of natural history : containing the most recent discovered objects : embellished with coloured plates, and corresponding descriptions : with extracts relating to animals, and remarks of celebrated travellers; combining a general survey of nature."  Or we could get all books whose title contains the words “birds” and “australia”. australia_birds <- rbhl::bhl_booksearch(title = "birds Australia") head(australia_birds$FullTitle)

## [1] "Handbook to the birds of Australia. : [Supplementary material in Charles Darwin's copy]."
## [2] "An introduction to The birds of Australia /"
## [3] "The useful birds of southern Australia : with notes on other birds /"
## [4] "The Birds of Australia"
## [5] "The birds of Australia,"
## [6] "The birds of Australia,"


And to get the OCR results of the pages of the book we used, we could write:

library("magrittr")

# ocr=TRUE to extract OCR for all pages
rbhl::bhl_getitempages("250938", ocr = TRUE) %>%
# for each page transform the type into a string
dplyr::group_by(PageUrl) %>%
dplyr::mutate(page_type = toString(PageTypes[[1]])) %>%
# keep only the illustration pages
# that are like the ones we used
dplyr::filter(page_type == "Illustration") %>%
# from the data.frame extract the OCR
dplyr::pull(OcrText) %>%

## [1] "491 \nFAL CUNCULUS LEUCOGASTER. \n( WHITE -BELLIED £ If BIKE - TIT) \nFALCUNCULUS FRONTATUS. \nSHRIKE - TIT). \n"
## [2] "492 \nA** \nOREOICA GUTTURALIS. \n(CRESTED BELL-BIRD). \n"
## [3] "APHELOCEPHALA LEUCOPSIS \n( WHITE FACE ). \n"
## [4] "* \nAPHELOCEPHALA PE CTORALIS. \n(CHE <3 TNUT -BREASTED WHITEFA CEj. \nAPHELOCEPHALA NIGRICINCTA. \n(BE A CK-BAH.DED WHITE FA CEj. \n"
## [5] "H . Gronvold. del. \nWitherLy & C° \nSPHENOSTOMA CRIS TATUM \n(WEDGE BIEL). \n"
## [6] "49 6 \nH \n(jronvolcl. del. \nN E O SIT TA LE CJ C O CE PHAI.A. \n( WHITE ¦ HE AID EE THE EH UN HE FL). \nNEOSHTA ALBATA \n(F IE E> T Ft E EE UNNEFlj. \nNEOSITTA CHRYSOPTERA \nf OFi. A. NGE - wing-e d tree runner). \nWitWLjA \n"


So there’s quite a lot to explore!

#### More birding soon!

Stay tuned for the next post in this series, about getting and using bird taxonomic and trait data in R! In the meantime, happy birding!

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.