Nesting Birds and Models in R Dataframes

December 15, 2018
By

(This article was first published on Laurens Geffert, and kindly contributed to R-bloggers)

R Dataframes in the tidyverse are more than just simple tables these days.
They can store complex information in list columns, and this becomes an
immensely powerful framework when we use it to apply methods to different
sets of data in parallel. In this article I illustrate this approach using
data for a rare UK bird species to investigate if its distribution has been
impacted by climate change.

Motivation

After recently seeing a Hadley Wickham lecture on nested models I became
incredibly excited about nested dataframes with s3 objects in list columns
again. Here is the video I am talking about:

Hadley uses this approach for data exploration but I think it is also very powerful
for iterative workflows and for experimentation or hypothesis testing on large
datasets. For example, when working on my PhD thesis I was
routinely fitting hundreds of machine learning models at once. All models
used the same predictor set and only varied in hyperparameters as well as label
data. Yet, I had to run them in separate parallel processes and load the data
into each of these. Moreover, when capturing results I often looked to the
list class for help. This did the job but also meant that I had to be very
careful about which results belonged to which data, which hyperparameters, and
which model object.

Enter nested dataframes. They still rely on the list class, but they nicely
organise the corresponding data elements together, in accordance with the
tidy data framework

Data

I decided to explore this framework hands-on, using a small exemplary case
study in the domain of species distribution modelling. This is what the models
I mentioned earlier were. For this type of modelling task we need species
occurrence data (our “label”, “response”, or Y) and climatic variables
(the “predictors”, or X)

Species Data

After browsing the web for a suitable case study species for a while I decided on
the Scottish Crossbill (Loxia scotica). This is a small passerine bird that
inhabits the Caledonian Forests of Scotland, and is the only terrestrial
vertebrate species unique to the United Kingdom. Only ~ 20,000 individuals
of this species are alive today.

Getting species occurrence data used to be the main challenge in Biogeography.
Natural Historians such as Charles Darwin and Alexander von Humboldt would
travel for years on rustic sail ships around the globe collecting specimen.
Today, we are standing on the shoulders of giants. Getting data is fast and
easy thanks to the work of two organisations:

  • the Global Biodiversity Information Facility (GBIF),
    an international network and research infrastructure funded by the world’s
    governments and aimed at providing anyone, anywhere, open access to data about
    all types of life on Earth. We will use their data in this project.

  • rOpenSci, a non-profit initiative that has developed
    an ecosystem of open source tools, runs annual unconferences, and reviews
    community developed software. They provide an R package called rgbif that
    I once made a humble contribution to. It is essentially a wrapper around the
    GBIF API will help us access the species data straight into R.

library(tidyverse)
library(rgbif)

# get the database id ("key") for the Scottish Crossbill
speciesKey <- name_backbone('Loxia scotica')$speciesKey

# get the occurrence records of this species
gbif_response <- occ_search(
  scientificName = "Loxia scotica", country = "GB",
  hasCoordinate = TRUE, hasGeospatialIssue = FALSE,
  limit = 9999)

# backup to reduce API load
write_rds(x = gbif_response, path = here::here('gbif_occs_loxsco.rds'))

GBIF and rOpenSci just saved us years or roaming around the highlands with a pair of binoculars, camping in mud, rain, and snow, and chasing crossbills through the forest. Nevertheless, it is still up to us to make sense of the data we got back, in particular to clean it, as data collected on this large scale can have its own issues. Luckily, GBIF provides some useful metadata on each record. Here, I will exclude those that

  • are not tagged as “present” (they may be artifacts from collections)
  • don’t have any flagged issues (nobody has noticed anything abnormal with this)
  • are under creative commons license (we can use them here)
  • are older than 1965

After cleaning the data we use tidyr::nest() to aggregate the data by decade.

library(lubridate)
birds_clean <- gbif_response$data %>%
  # get decade of record from eventDate
  mutate(decade = eventDate %>% ymd_hms() %>% round_date("10y") %>% year() %>% as.numeric()) %>%
  # clean data using metadata filters
  filter(
    # only creative commons license records
    str_detect(license, "http://creativecommons.org/") &
    # only records with no issues
    issues == "" &
    # no records before 1965
    decade >= 1970 &
    # no records after 2015 (there is not a lot of data yet)
    decade < 2020) %>%
  # retain only relevant variables
  select(decimalLongitude, decimalLatitude, decade) %>% arrange(decade)

birds_nested <- birds_clean %>%
  # define the nesting index
  group_by(decade) %>%
  # aggregate data in each group
  nest()

# let's have a look
glimpse(birds_nested)

Nested Dataframe

Climate data

For the UK the MetOffice had some nice climatic datasets available.
They were in a horrible format (CSV with timesteps, variable types, and geospatial
information spread across rows, columns, and file partitions) but I managed to
transform them into something useable. The details of this are beyond the scope
of this post, but if you are interested in the code for that you can check it
out here.

The final rasters look like this:
Climate Plot

Modelling

We’ll split the data in training and test set with a true temporal holdout
from all data collected between 2005 – 2015.

# last pre-processing step
df_modelling <- df_nested %>%
  # get into modelling format
  unnest() %>%
  # caret requires a factorial response variable for classification
  mutate(presence = case_when(
    presence == 1 ~ "presence",
    presence == 0 ~ "absence") %>%
    factor()) %>%
  # drop all observations with NA variables
  na.omit()

# create a training set for the model build
df_train <- df_modelling %>%
  # true temporal split as holdout
  filter(decade != "2010") %>%
  # drop decade, it's not needed anymore
  dplyr::select(-decade)

# same steps for test set
df_test <- df_modelling %>%
  filter(decade == "2010") %>%
  dplyr::select(-decade)

Species responses to environmental variables are often non-linear. For example,
a species usually can’t survive if it is too cold, but it can’t deal with too
much heat either. It needs the “sweet spot” in the middle. Linear models like
a glm are not very useful under these circumstances. On the other hand,
algorithms such as random forest can easily overfit to this kind of data.
I therefore decided to test a regularised random forest (RFF) as introduced
by Deng (2013), hoping that it would
offer just the right ratio of bias vs variance for this use case.

Caret makes the model fitting incredibly easy! All we need to do is specify
a tuning grid of hyperparameters that we want to optimise, a tune control
that adjusts the number of iterations and the loss function used, and then
call train with the algorithm we have picked.

library(RRF)

# for reproducibility
set.seed(12345)

# set up model fitting parameters
# tuning grid, trying every possible combination
tuneGrid <- expand.grid(
  mtry = c(3, 6, 9),
  coefReg = c(.01, .03, .1, .3, .7, 1),
  coefImp = c(.0, .1, .3, .6, 1))
tuneControl <- trainControl(
  method = 'repeatedcv',
  classProbs = TRUE,
  number = 10,
  repeats = 2,
  verboseIter = TRUE,
  summaryFunction = twoClassSummary)
# actual model build
model_fit <- train(
  presence ~ .,
  data = df_train,
  method = "RRF",
  metric = "ROC",
  tuneGrid = tuneGrid,
  trControl = tuneControl)

plot(model_fit)

We can evaluate the performance of this model on our hold-out data from
2005 – 2015. Just as uring training we are using the Area under the Receiver
Operator Characteristic curve (AUC). With this metric, a model no bettern than
random would score 0.5 while a perfect model making no mistakes would score 1.

# combine prediction with validation set
df_eval <- data_frame(
  "obs" = df_test$presence,
  "pred" = predict(
    object = model_fit,
    newdata = df_test,
    type = "prob") %>%
    pull(1))

# get ROC value
library(yardstick)
roc_auc_vec(estimator = "binary", truth = df_eval$obs, estimate = df_eval$pred)

Now we can combine the raw data, model performance, and predictions all in one
nested dataframe. We can save this for later to make sure we always know what
data was used to build which model.

df_eval <- df_modelling %>%
  group_by(decade) %>% nest() %>%
  # combine with climate data
  left_join(climate_nested, by = "decade") %>%
  # evaluate by decade
  mutate(
    "obs" = map(
      .x = data,
      ~ .x$presence),
    "pred" = map(
      .x = data,
      ~ predict(model_fit, newdata = .x, type = "prob") %>% pull("presence")),
    "auc" = map2_dbl(
      .x = obs,
      .y = pred,
      ~ roc_auc_vec(
          estimator = "binary",
          truth = .x,
          estimate = .y)),
    "climate_data" = map(
      .x = raster_stacks,
      ~ as(.x, "SpatialPixelsDataFrame") %>%
        as_data_frame() %>%
        na.omit()),
    "habitat_suitability" = map(
      .x = climate_data,
      ~ predict(model_fit, newdata = .x, type = "prob") %>% pull("presence"))
    )

df_eval

Conclusion

Let’s look at the change over time using gganimate. Unfortunately, we can see
that the suitable area for the species in the UK is drastically decreasing
after 1985. Not all species are negatively affected by climate change but many
are. And this is just one of the many unintended consequences of our impact on
planet earth.

Change Animation

I hope that you enjoyed this blog post despite our pessimistic findings.
As you can see nested dataframes with list columns are immensely powerful
in a range of situations. I will certainly use them a lot more in the future.
Please let me know in the comments if you are, too!

To leave a comment for the author, please follow the link and comment on their blog: Laurens Geffert.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)