R Dataframes in the tidyverse are more than just simple tables these days.
They can store complex information in list columns, and this becomes an
immensely powerful framework when we use it to apply methods to different
sets of data in parallel. In this article I illustrate this approach using
data for a rare UK bird species to investigate if its distribution has been
impacted by climate change.
After recently seeing a Hadley Wickham lecture on nested models I became
incredibly excited about nested dataframes with s3 objects in list columns
again. Here is the video I am talking about:
Hadley uses this approach for data exploration but I think it is also very powerful
for iterative workflows and for experimentation or hypothesis testing on large
datasets. For example, when working on my PhD thesis I was
routinely fitting hundreds of machine learning models at once. All models
used the same predictor set and only varied in hyperparameters as well as label
data. Yet, I had to run them in separate parallel processes and load the data
into each of these. Moreover, when capturing results I often looked to the
list class for help. This did the job but also meant that I had to be very
careful about which results belonged to which data, which hyperparameters, and
which model object.
Enter nested dataframes. They still rely on the list class, but they nicely
organise the corresponding data elements together, in accordance with the
tidy data framework
I decided to explore this framework hands-on, using a small exemplary case
study in the domain of species distribution modelling. This is what the models
I mentioned earlier were. For this type of modelling task we need species
occurrence data (our “label”, “response”, or Y) and climatic variables
(the “predictors”, or X)
After browsing the web for a suitable case study species for a while I decided on
the Scottish Crossbill (Loxia scotica). This is a small passerine bird that
inhabits the Caledonian Forests of Scotland, and is the only terrestrial
vertebrate species unique to the United Kingdom. Only ~ 20,000 individuals
of this species are alive today.
Getting species occurrence data used to be the main challenge in Biogeography.
Natural Historians such as Charles Darwin and Alexander von Humboldt would
travel for years on rustic sail ships around the globe collecting specimen.
Today, we are standing on the shoulders of giants. Getting data is fast and
easy thanks to the work of two organisations:
the Global Biodiversity Information Facility (GBIF),
an international network and research infrastructure funded by the world’s
governments and aimed at providing anyone, anywhere, open access to data about
all types of life on Earth. We will use their data in this project.
rOpenSci, a non-profit initiative that has developed
an ecosystem of open source tools, runs annual unconferences, and reviews
community developed software. They provide an R package called
I once made a humble contribution to. It is essentially a wrapper around the
GBIF API will help us access the species data straight into R.
library(tidyverse) library(rgbif) # get the database id ("key") for the Scottish Crossbill speciesKey <- name_backbone('Loxia scotica')$speciesKey # get the occurrence records of this species gbif_response <- occ_search( scientificName = "Loxia scotica", country = "GB", hasCoordinate = TRUE, hasGeospatialIssue = FALSE, limit = 9999) # backup to reduce API load write_rds(x = gbif_response, path = here::here('gbif_occs_loxsco.rds'))
GBIF and rOpenSci just saved us years or roaming around the highlands with a pair of binoculars, camping in mud, rain, and snow, and chasing crossbills through the forest. Nevertheless, it is still up to us to make sense of the data we got back, in particular to clean it, as data collected on this large scale can have its own issues. Luckily, GBIF provides some useful metadata on each record. Here, I will exclude those that
- are not tagged as “present” (they may be artifacts from collections)
- don’t have any flagged issues (nobody has noticed anything abnormal with this)
- are under creative commons license (we can use them here)
- are older than 1965
After cleaning the data we use
tidyr::nest() to aggregate the data by decade.
library(lubridate) birds_clean <- gbif_response$data %>% # get decade of record from eventDate mutate(decade = eventDate %>% ymd_hms() %>% round_date("10y") %>% year() %>% as.numeric()) %>% # clean data using metadata filters filter( # only creative commons license records str_detect(license, "http://creativecommons.org/") & # only records with no issues issues == "" & # no records before 1965 decade >= 1970 & # no records after 2015 (there is not a lot of data yet) decade < 2020) %>% # retain only relevant variables select(decimalLongitude, decimalLatitude, decade) %>% arrange(decade) birds_nested <- birds_clean %>% # define the nesting index group_by(decade) %>% # aggregate data in each group nest() # let's have a look glimpse(birds_nested)
For the UK the MetOffice had some nice climatic datasets available.
They were in a horrible format (CSV with timesteps, variable types, and geospatial
information spread across rows, columns, and file partitions) but I managed to
transform them into something useable. The details of this are beyond the scope
of this post, but if you are interested in the code for that you can check it
The final rasters look like this:
We’ll split the data in training and test set with a true temporal holdout
from all data collected between 2005 – 2015.
# last pre-processing step df_modelling <- df_nested %>% # get into modelling format unnest() %>% # caret requires a factorial response variable for classification mutate(presence = case_when( presence == 1 ~ "presence", presence == 0 ~ "absence") %>% factor()) %>% # drop all observations with NA variables na.omit() # create a training set for the model build df_train <- df_modelling %>% # true temporal split as holdout filter(decade != "2010") %>% # drop decade, it's not needed anymore dplyr::select(-decade) # same steps for test set df_test <- df_modelling %>% filter(decade == "2010") %>% dplyr::select(-decade)
Species responses to environmental variables are often non-linear. For example,
a species usually can’t survive if it is too cold, but it can’t deal with too
much heat either. It needs the “sweet spot” in the middle. Linear models like
a glm are not very useful under these circumstances. On the other hand,
algorithms such as random forest can easily overfit to this kind of data.
I therefore decided to test a regularised random forest (RFF) as introduced
by Deng (2013), hoping that it would
offer just the right ratio of bias vs variance for this use case.
Caret makes the model fitting incredibly easy! All we need to do is specify
a tuning grid of hyperparameters that we want to optimise, a tune control
that adjusts the number of iterations and the loss function used, and then
call train with the algorithm we have picked.
library(RRF) # for reproducibility set.seed(12345) # set up model fitting parameters # tuning grid, trying every possible combination tuneGrid <- expand.grid( mtry = c(3, 6, 9), coefReg = c(.01, .03, .1, .3, .7, 1), coefImp = c(.0, .1, .3, .6, 1)) tuneControl <- trainControl( method = 'repeatedcv', classProbs = TRUE, number = 10, repeats = 2, verboseIter = TRUE, summaryFunction = twoClassSummary) # actual model build model_fit <- train( presence ~ ., data = df_train, method = "RRF", metric = "ROC", tuneGrid = tuneGrid, trControl = tuneControl) plot(model_fit)
We can evaluate the performance of this model on our hold-out data from
2005 – 2015. Just as uring training we are using the Area under the Receiver
Operator Characteristic curve (AUC). With this metric, a model no bettern than
random would score 0.5 while a perfect model making no mistakes would score 1.
# combine prediction with validation set df_eval <- data_frame( "obs" = df_test$presence, "pred" = predict( object = model_fit, newdata = df_test, type = "prob") %>% pull(1)) # get ROC value library(yardstick) roc_auc_vec(estimator = "binary", truth = df_eval$obs, estimate = df_eval$pred)
Now we can combine the raw data, model performance, and predictions all in one
nested dataframe. We can save this for later to make sure we always know what
data was used to build which model.
df_eval <- df_modelling %>% group_by(decade) %>% nest() %>% # combine with climate data left_join(climate_nested, by = "decade") %>% # evaluate by decade mutate( "obs" = map( .x = data, ~ .x$presence), "pred" = map( .x = data, ~ predict(model_fit, newdata = .x, type = "prob") %>% pull("presence")), "auc" = map2_dbl( .x = obs, .y = pred, ~ roc_auc_vec( estimator = "binary", truth = .x, estimate = .y)), "climate_data" = map( .x = raster_stacks, ~ as(.x, "SpatialPixelsDataFrame") %>% as_data_frame() %>% na.omit()), "habitat_suitability" = map( .x = climate_data, ~ predict(model_fit, newdata = .x, type = "prob") %>% pull("presence")) ) df_eval
Let’s look at the change over time using
gganimate. Unfortunately, we can see
that the suitable area for the species in the UK is drastically decreasing
after 1985. Not all species are negatively affected by climate change but many
are. And this is just one of the many unintended consequences of our impact on
I hope that you enjoyed this blog post despite our pessimistic findings.
As you can see nested dataframes with list columns are immensely powerful
in a range of situations. I will certainly use them a lot more in the future.
Please let me know in the comments if you are, too!