A generalized linear model (GLM) is a flexible generalization of an ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution.
The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.
GLMs can be split into three groups:
• Poisson regression for count data with no over / under dispersion issues
• Quasi-poisson or Negative binomial models where the models are overdispersed
• Logistic regression models where the response data are binary (e.g. present or absent; male or female, or proportional (e.g. percentages))
In this exercise, we will focus on GLM that use Poisson regression. Please download dataset for this exercise here. The dataset is investigated the biographical determinants of at species richness at a regional scale (Gotelli and Everson, 2002). The main purpose of this exercise is to replicate the Poisson regression of ant species richness against latitude, elevation and habitat type on their paper.
Answers to these exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page. Load the data-set and required package before running the exercise.
load data and check the data structure using
scatterplotMatrix function. Assess its covariation and data patterning
run GLM model and run VIF analysis to check for inflation. Pay attention to the collinearity
if there are any issues with the covariation try to center the predictor variables
Re-run VIF with the new variables
check for any influential data points outliers using influence measures (Cooks distance) and create the plot. If the value is less than 1 then it is OK to go
check for over dispersion. It needs to be around 1 to go to the next step.
check the model summary and what can we infer?
Since we have lots of variables, then we do model averaging. The first step to do is to set options in base R regarding missing values. Then try to asses which variables that have a significant influence on the response variable. Here we include latitude, elevation, and habitat variable to produce the best model.
Check validation plots
Produce base plot and the points of predicted values