O’Reilly animals in trouble? Conservation status of book covers

August 24, 2018

[This article was first published on Posts on Maëlle's R blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

What can a kaka, a kakapo, an European rabbit and a grey heron have in
common? Well, they might co-habit in the bookshelf of an R user, since
they’re all animals on the covers of popular R books: “R
, “R for Data
, “Text mining with
and “Efficient R
, respectively.
Their publisher, O’Reilly, has now based its brand on covers featuring
beautiful gravures of animals.

Recently, while wondering what the name of R for Data Science bird was
again (I thought it was a kea!), I was thrilled to find the whole
O’Reilly menagerie
, i.e. a list of
books and corresponding animals! The website also features a link to “A
short history of the O’Reilly

that was an amazing read. In it was noted that “The animals are in
trouble.”, with a few examples of endangered species. It inspired me to
actually try and assess the conservation status of O’Reilly animals
using responsible webscraping, taxonomic name resolving and IUCN Redlist
API querying…

Scraping the menagerie: an utter delight!

I had a great time webscraping the menagerie, not only thanks to my now
reasonable experience doing such things, but also thanks to

The menagerie is divided into pages of 20 books, so I mapped over all
possible offsets up to the number of animals indicated on the website,

home_url <- "https://www.oreilly.com/animals.csp"

session <- polite::bow(home_url,
                       user_agent = "Maëlle Salmon https://masalmon.eu/")

get_twenty <- function(offset, session){
  # offset parameter to get all books 20 by 20
  params <- glue::glue("?x-o={offset}")
  # scraping with content parameter
  # cf https://github.com/dmi3kno/polite/issues/6
  # https://www.oreilly.com/animals.csp?x-o=720 was problematic
  # (German characters)
  page <- polite::scrape(session, params = params, 
                         content = "text/html;charset=iso-8859-1")
  # get all animal rows
  rows <- rvest::xml_nodes(page,
                           xpath = "//div[@class='animal-row']")
  # extract book titles
  rows %>%
    rvest::xml_nodes(xpath = "a[@class='book']") %>%
    rvest::xml_nodes(xpath = "h1[@class='book-title']") %>%
    rvest::html_text() -> book_titles
  rows %>%
    rvest::xml_nodes(xpath = "h2[@class='animal-name']") %>%
    rvest::html_text() -> animal_names
  tibble::tibble(book = book_titles,
                 animal = animal_names)

no_animals <- 1227 # by hand!

offsets <- (0:floor(no_animals/20))*20

purrr::map_df(offsets, get_twenty, session = session) %>%

I got 1134 rows, each corresponding to a book, with animals potentially

## # A tibble: 1,134 x 2
##    book                            animal                                 
##  1 Mobile Design and Development   12-Wired Bird of Paradise              
##  2 Windows PowerShell for Develop~ 3-Banded Armadillo                     
##  3 Jakarta Commons Cookbook        Aardvark                               
##  4 Clojure Cookbook                Aardwolf                               
##  5 Ubuntu: Up and Running          Addax, aka Screwhorn Antelope          
##  6 Social eCommerce                Adjutant (Storks)                      
##  7 BioBuilder                      Aegina Citrea, narcomedusae, jellyfish 
##  8 JRuby Cookbook                  African Civet                          
##  9 C# 5.0 Pocket Reference         African Crowned Crane aka Grey Crowned~
## 10 Programming C# 5.0              African Crowned Crane aka Grey Crowned~
## # ... with 1,124 more rows

In the short history of animals, Edie Freedman mentions having
discovered “that there were intriguing correspondences between specific
technologies and specific animals”. This made me curious about my last
name, Salmon!

animals %>%
  dplyr::filter(stringr::str_detect(animal, "[Ss]almon")) %>%
book animal
Values, Units, and Colors Salmon
CSS Text Salmon
CSS Fonts Salmon
Selectors, Specificity, and the Cascade Salmon
Transitions and Animations in CSS Salmon2

I have no idea what trait of salmons make them good at design, other
than my not sharing that trait with them. When my friend Adrien and I
wrote a (non O’Reilly)

years ago, we selected a frog for the cover based on its being pretty,
which is much less cool than O’Reilly branding!

From animals common names to scientific names?

Now, you’ll have noticed the names of animals are written in English. My
ultimate goal being the querying of IUCN Red List API, and this API only
accepting scientific names (contrary to the website of the same
organization), I needed to resolve the common names to scientific names.
This is a hard problem! My strategy here was:

  • Cleaning the names a bit to remove the parts after “aka” for
clean <- function(animal){
  semiclean <- animal %>%
    stringr::str_remove_all("aka.*") %>%
    stringr::str_remove_all("\\,.*") %>%
  if(semiclean == "12-Wired Bird of Paradise"){
    semiclean <- "Twelve-Wired Bird of Paradise"
  if(semiclean == "3-Banded Armadillo"){
    semiclean <- "Three-Banded Armadillo"
  stringr::str_remove_all(semiclean, "[0-9]")

animals <- dplyr::mutate(animals, animal_clean = purrr::map_chr(animal, clean))
  • Using the rOpenSci taxize
    that has a handy
    comm2sci function. This function works for anyone, but it’s better
    to request a key for the database used, EOL by default (see e.g.
    taxize::use_eol() for more info).

  • Not being too optimistic since the databases taxize queries cannot
    do wonders, no matter how good they are.

Note that for each species, the first scientific name returned is
selected, because there’s no other criterion to go by. That’s how I’ll
end up with a Salmon catfish for Salmon, too bad.

animal_names <- unique(animals$animal_clean)

# scientific names
good_comm2sci <- memoise::memoise(taxize::comm2sci)

get_name <- function(common_name){
  sci_names <- good_comm2sci(common_name)
  # don't get the name of who defined the species
  sci_name <- stringr::word(sci_names[[1]][1], start=1, end = 2)
  tibble::tibble(common_name = common_name,
                 sci_name = sci_name)
scientific_names <- purrr::map_df(animal_names, get_name)

animals <- dplyr::left_join(animals,
                            by = c("animal_clean" = "common_name"))

I got names for 694 books, out of 1134, getting 555 animals. It’s not
bad, but this number also needs to be treated with caution. See for

animals %>%
  dplyr::filter(stringr::str_detect(animal, "Galapagos")) %>%
book animal animal_clean sci_name
PHP Cookbook Galapagos Land Iguana Galapagos Land Iguana Conolophus marthae
Upgrading to PHP 5 Galapagos Tortoise Galapagos Tortoise Chelonoidis nigra

I noticed the iguana while perusing my results, and a quick internet
search taught me that there are three species of terrestrial iguanas
in the Galapagos, the most common one, and the one probably present on
the book cover, being Conolophus subcristatus, not Conolophus marthae!
I’ve noticed a few other mistakes, so I’ll need to handle the results
with care. I now wish the menagerie had a bit more Latin in it!

Querying the IUCN Red List

Indeed, scientific names of species are the key to a wealth of data!
Traits data, taxonomic
… and conservation
status thanks to the IUCN Red List, an
impressive assessment of species at the global scale. One can
programmatically query it using the rOpenSci rredlist
! That’s what I did, adding
a waiting time of 2 seconds between API calls, as recommended by the
IUCN folks
. Note
that I have an API key because I asked for it, see more info by typing
rredlist::rl_use_iucn() after installing rredlist, and be patient
since it can last a few days before one gets one.

slow_rl_search <- ratelimitr::limit_rate(rredlist::rl_search,
                                         rate = ratelimitr::rate(1, 2))

get_status <- function(sci_name){
  results <- slow_rl_search(sci_name)$result
    results$sci_name <- sci_name

animals <- dplyr::filter(animals, !is.na(sci_name))
purrr::map_df(unique(animals$sci_name), get_status) %>%
status <- readr::read_csv("oreilly_animals_status.csv")

animals <- readr::read_csv("oreilly_animals_scientific.csv")

status <- dplyr::filter(status, !is.na(category))
animals <- dplyr::left_join(animals, status, by = "sci_name")

## Classes 'tbl_df', 'tbl' and 'data.frame':    1134 obs. of  32 variables:
##  $ book              : chr  "Mobile Design and Development" "Windows PowerShell for Developers" "Jakarta Commons Cookbook" "Clojure Cookbook" ...
##  $ animal            : chr  "12-Wired Bird of Paradise" "3-Banded Armadillo" "Aardvark" "Aardwolf" ...
##  $ animal_clean      : chr  "Twelve-Wired Bird of Paradise" "Three-Banded Armadillo" "Aardvark" "Aardwolf" ...
##  $ sci_name          : chr  "Seleucidis melanoleuca" "Tolypeutes tricinctus" "Cucumis humifructus" "Proteles cristata" ...
##  $ taxonid           : int  NA 21975 NA 18372 NA 22697721 NA 41589 22692046 22692046 ...
##  $ scientific_name   : chr  NA "Tolypeutes tricinctus" NA "Proteles cristata" ...
##  $ kingdom           : chr  NA "ANIMALIA" NA "ANIMALIA" ...
##  $ phylum            : chr  NA "CHORDATA" NA "CHORDATA" ...
##  $ class             : chr  NA "MAMMALIA" NA "MAMMALIA" ...
##  $ order             : chr  NA "CINGULATA" NA "CARNIVORA" ...
##  $ family            : chr  NA "CHLAMYPHORIDAE" NA "HYAENIDAE" ...
##  $ genus             : chr  NA "Tolypeutes" NA "Proteles" ...
##  $ main_common_name  : chr  NA "Brazilian Three-banded Armadillo" NA "Aardwolf" ...
##  $ authority         : chr  NA "(Linnaeus, 1758)" NA "(Sparrman, 1783)" ...
##  $ published_year    : int  NA 2014 NA 2015 NA 2016 NA 2015 2016 2016 ...
##  $ category          : chr  NA "VU" NA "LC" ...
##  $ criteria          : chr  NA "A2cd" NA NA ...
##  $ marine_system     : logi  NA FALSE NA FALSE NA FALSE ...
##  $ freshwater_system : logi  NA FALSE NA FALSE NA TRUE ...
##  $ terrestrial_system: logi  NA TRUE NA TRUE NA TRUE ...
##  $ assessor          : chr  NA "Miranda, F., Moraes-Barros, N., Superina, M. & Abba, A.M." NA "Green, D.S." ...
##  $ reviewer          : chr  NA "Loughry, J." NA "Dloniak, S.M.D. & Holekamp, E." ...
##  $ aoo_km2           : chr  NA NA NA NA ...
##  $ eoo_km2           : chr  NA "937000" NA NA ...
##  $ elevation_upper   : int  NA NA NA 2000 NA 550 NA 2500 NA NA ...
##  $ elevation_lower   : int  NA NA NA 0 NA 0 NA 0 0 0 ...
##  $ depth_upper       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ depth_lower       : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ errata_flag       : logi  NA NA NA NA NA NA ...
##  $ errata_reason     : chr  NA NA NA NA ...
##  $ amended_flag      : logi  NA NA NA NA NA NA ...
##  $ amended_reason    : chr  NA NA NA NA ...

There are 1134 books, 499 with a conservation status from the IUCN Red
List, although this includes “DD” meaning “Data Deficient”. I am
hesitant to actually show the proportion of species in each category for
those for which I got data for, because the resolution of common names
to scientific names isn’t certain… Take the following table with a pinch
of salt!

dplyr::count(animals, category) %>%
category n
CR 14
DD 7
EN 42
EW 1
EX 6
LC 348
LR/cd 1
LR/lc 5
LR/nt 4
NT 23
VU 48
NA 635

See the following page for more precise information about
LC is least concern. Let’s have a look at the extinct species.

animals %>%
  dplyr::filter(category == "EX") %>%
  dplyr::select(book, animal, sci_name) %>%
book animal sci_name
Java Data Objects Bilby, Rabbit-eared Bandicoot (Macrotis lagotis) Macrotis leucura
Building and Testing with Gradle Bush Wren Xenicus longipes
Designing Mobile Payment Experiences Crested Pigeon Microgoura meeki
SSH, The Secure Shell: The Definitive Guide Land Snail Amastra crassilabrum
Java NIO Pigfooted Bandicoot Chaeropus ecaudatus
Java I/O White Rabbit Macrotis leucura

I searched for the covers and names and could assess that in that table,
there are 4 false positives due to the ambiguity of common names! Only
the Bush wren and the Pigfooted Bandicoot got scientific names
corresponding to what they look like, and are extinct, which is quite

Now, to reverse-engineer what Edie Freedman wrote in the short history of
O’Reilly animals, “Many of the animals that appear on our covers are
critically endangered—the tarsier from Learning the vi & Vim Editors,
the lorises from sed & awk, the Hawksbill turtle from Getting Started
with CouchDB, the tiger from Running Mac OS X Tiger, and the African
elephant on Hadoop: The Definitive Guide, just to name a few.”, let’s
look at what we got for them.

animals %>%
  dplyr::filter(book %in%
                  c("Hadoop: The Definitive Guide",
                    "Learning the vi and Vim Editors",
                    "sed & awk",
                    "Getting Started with CouchDB",
                    "Running Mac OS X Tiger")) %>%
  dplyr::select(book, animal, sci_name, category) %>%
book animal sci_name category
Hadoop: The Definitive Guide African Elephant, young Elephantulus rozeti LC
Getting Started with CouchDB Hawksbill Turtle Eretmochelys imbricata CR
sed & awk Slender Loris “Awk” NA NA
Running Mac OS X Tiger Sumatran Tiger Parantica tityoides LR/nt
Learning the vi and Vim Editors Tarsier, full-body, standing on hind feet, b/w engraving Tarsius pelengensis EN

Again, our name resolution wasn’t very good!

  • The elephant should be Loxondota africana, vulnerable

  • The turtle is right.

  • For the loris we should have gotten this

  • The Sumatran tiger, Panthera tigris ssp. sumatrae , is critically

  • There are several Tarsier species, I’m not sure which one is the
    right one.

So all in all, we got some truth but also some wrong names and hence
wrong conservation statuses!

Conclusion: hoping for a menagerie of scientific names

In this post, I exemplified responsible webscraping with the use of the
polite package
to get a table of
all animals on O’Reilly book covers from the dedicated menagerie. I
tried resolving the common names to scientific names using
taxize::comm2sci, which was only
partly successful. I got conservation status for the scientific names
using the rredlist package,
programmatic interface to the IUCN Red List. The results would be better
if O’Reilly published scientific names of animals, but nonetheless this
workflow helped me identify two extinct species, the Bush wren of
Building and Testing with
and the
Pigfooted Bandicoot of Java
. I can’t but hope
the list of such now sad book covers won’t grow any longer…

To leave a comment for the author, please follow the link and comment on their blog: Posts on Maëlle's R blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)