Basic Generalised Linear Modelling – Part 1: Exercises

July 18, 2018
By

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

A generalized linear model (GLM) is a flexible generalization of an ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution.
The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

GLMs can be split into three groups:
Poisson regression for count data with no over / under dispersion issues
Quasi-poisson or Negative binomial models where the models are overdispersed
Logistic regression models where the response data are binary (e.g. present or absent; male or female, or proportional (e.g. percentages))

In this exercise, we will focus on GLM that use Poisson regression. Please download dataset for this exercise here. The dataset is investigated the biographical determinants of at species richness at a regional scale (Gotelli and Everson, 2002). The main purpose of this exercise is to replicate the Poisson regression of ant species richness against latitude, elevation and habitat type on their paper.

Answers to these exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page. Load the data-set and required package before running the exercise.

Exercise 1
load data and check the data structure using scatterplotMatrix function. Assess its covariation and data patterning

Exercise 2
run GLM model and run VIF analysis to check for inflation. Pay attention to the collinearity

Exercise 3
if there are any issues with the covariation try to center the predictor variables

Exercise 4
Re-run VIF with the new variables

Exercise 5
check for any influential data points outliers using influence measures (Cooks distance) and create the plot. If the value is less than 1 then it is OK to go

Exercise 6
check for over dispersion. It needs to be around 1 to go to the next step.

Exercise 7
check the model summary and what can we infer?

Exercise 8
Since we have lots of variables, then we do model averaging. The first step to do is to set options in base R regarding missing values. Then try to asses which variables that have a significant influence on the response variable. Here we include latitude, elevation, and habitat variable to produce the best model.

Exercise 9
Check validation plots

Exercise 10
Produce base plot and the points of predicted values

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)