Modeling the Extinction of Species with SVM-Kernel

[This article was first published on DataGeeek, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In the last article, we analyzed carbon emissions and the effects that created them. This time I want to look into another important environmental issue, animal biodiversity; by animals, I mean mammals, birds, fish, reptiles, and amphibians.

The metric we are going to be interested in is the living planet index which measures the change in the number of 31,831 populations across 5,230 species relative to the year 1970. The explanatory variables we will take, are annual carbon emissions per capita(co2), annual gross domestic product per capita(gdp), and regions(region).

First, we will compare the living planet index(lpi) by region. To do that, we will create our datasets; because we do that by regions, we will summarize the co2 and gdp variables by the relative region.

df_co <- read_csv("")
df_gdp <- read_csv("")
df_lpi <- read_csv("")
#Building the main dataset 
df_co_gdp <- 
  df_gdp %>% 
  left_join(df_co) %>% 
  clean_names() %>% 
  #continent names
  mutate(region = countrycode(sourcevar = entity,
                              origin = "",
                              destination = "region")) %>% 
    co2= annual_co2_emissions_per_capita,
    gdp= gdp_per_capita_ppp_constant_2017_international) %>% 
  na.omit() %>%
  #adjusting the region values for merging with the LPI data frame
    region = str_replace(region, "&", "and"),
    region=="South Asia" | region == "East Asia and Pacific" ~ "Asia and Pacific",
    region=="Middle East and North Africa" | region == "Sub-Saharan Africa" ~ "Africa",
    TRUE ~ region
  )) %>% 
  #summing the co2 and gdp values by the region, separately
  group_by(region,year) %>% 
  summarise(co2= sum(co2),
            gdp= sum(gdp))

df_tidy <- 
  df_lpi %>% 
  left_join(df_co_gdp) %>% 

#Comparing the regions by plotting
p <- 
  df_tidy %>% 
  ggplot(aes(x= year, 
             #text variable for hoverinfo
             text = glue("{number(round(lpi), suffix='%')}\n{year}")))+
  #the reference line
  #region texts
  geom_text(data = df_tidy %>% 
              group_by(region) %>% 
            aes(x= year, y= lpi,label=region),
            show.legend = FALSE,
            nudge_y = 4.2
  #the reference line text
  geom_text(aes(x = 1994, 
                y=105, color= "red",
                label = "The reference line(1970 = %100)"))+ 
  scale_y_continuous(breaks = pretty_breaks(),
                     labels = label_percent(scale = 1))+
  labs(title = "Living Planet Index")+
  bbc_style() +
  theme(legend.position = "none", #removes the legend keys
        plot.title = element_text(hjust = 0.5))

#plotly for interactive plotting
ggplotly(p, tooltip = c("text")) %>% 
  #removes the reference line info
  style(hoverinfo = "none",traces = 6)

When we look at the regions, it seems they are all under their 1970-year values. We will now examine the underlying reasons for this. To do that, we will first model the data with a support vector machine with kernel-based algorithms.

#Preprocessing the data
df_rec <- 
  recipe(lpi ~ ., data = df_train) %>% 
  step_dummy(all_nominal_predictors(), one_hot = TRUE) %>% 
  #all numeric predictors should be in the same unit

#Modeling with svm with kernel-based algorithms
df_spec <- 
  svm_linear() %>% 
  set_engine("kernlab") %>% 

df_wf <- 
  workflow() %>% 
  add_recipe(df_rec) %>% 

#cross-validation for resamples
df_folds <- vfold_cv(df_train, strata = region)

#Resampling for the accuracy metrics
df_rs <- 
  df_wf %>% 
  fit_resamples(resamples = df_folds)

#computes the accuracy metrics  
# A tibble: 2 x 6
#  .metric .estimator  mean     n std_err .config       
#  <chr>   <chr>      <dbl> <int>   <dbl> <chr>         
#1 rmse    standard   4.32     10 0.387   Preprocessor1~
#2 rsq     standard   0.980    10 0.00528 Preprocessor1~

#fits the best model to the training set
#and evaluate it with the testing set.
final_fit <- last_fit(df_wf, df_split)

# A tibble: 2 x 4
#  .metric .estimator .estimate .config             
#  <chr>   <chr>          <dbl> <chr>               
#1 rmse    standard       7.27  Preprocessor1_Model1
#2 rsq     standard       0.963 Preprocessor1_Model1

The above accuracy results look fine to step forward with that model. Now, we can compute and draw permutation-based variable importance scores.

#Permutation-based variable importance

#Creating a preprocessed dataframe of the train dataset
imp_data <- 
  df_rec %>%
  prep() %>%
  bake(new_data = NULL)

#building the explainer-object
explainer_svm <- 
    df_spec %>% 
      fit(lpi ~ ., data = imp_data),
    data = imp_data %>% select(-lpi), 
    y = df_train$lpi,
    verbose = FALSE

#calculates the variable-importance measure
vip_svm <- 
    explainer = explainer_svm, 
    loss_function = loss_root_mean_square, 
    B = 100, #the number of permutations
    type = "difference",
    label =""

#Plotting ranking of the importance of explanatory variables
plot(vip_svm) +
  ggtitle("Mean variable-importance over 100 permutations", "")+
  theme(plot.title = element_text(hjust = 0.5))

The interesting thing we can extract from the above plot is that the carbon emission variable(co2) is a very weak effect on the LPI index. Instead of it, Europe and Latin America regions, and the gdp variables look to have far the most dominant effects on the index.

To leave a comment for the author, please follow the link and comment on their blog: DataGeeek. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)