by Joseph Rickert
Generalized Linear Models have become part of the fabric of modern statistics, and logistic regression, at least, is a “go to” tool for data scientists building classification applications. The ready availability of good GLM software and the interpretability of the results logistic regression makes it a good baseline classifier. Moreover, Paul Komarek argues that, with a little bit tweaking, the basic iteratively reweighted least squares algorithm used to evaluate the maximum likelihood estimates can be made robust and stable enough to allow logistic regression to challenge specialized classifiers such as support vector machines.
It is relatively easy to figure how to code a GLM in R. Even a total newcomer to R is likely to figure out that the glm()function is part of the core R language within a minute or so of searching. Thereafter though, it gets more difficult to find other GLM related stuff that R has to offer. Here is a far from complete, but hopefully helpful, list of resources.
Online documentation that I have found helpful includes the contributed book by Virasakdi Chongsuvivatwong and the tutorials from Princeton and UCLA. Here is slick visualization of a poisson model from the Freakonometrics blog.
But finding introductory materials is on GLMs is not difficult. Almost all of the many books on learning statistics with R have chapters on the GLM including the classic Modern Applied Statistics with S, by Venables and Ripley, and one of my favorite texts, Data Analysis and Graphics Using R, by Maindonald and Braun. It is more of a challenge, however, to sort through the more than 5,000 packages on CRAN to find additional functions that could help with various specialized aspects or extensions to the GLM. So here is a short list of GLM related packages.
Packages to help with convergence and improve the fit
- glm2 implements a refinement to the iteratively reweighted least squares algorithm in order to help with convergence issues commonly associated with nonstandard link functions.
- brglm fits binomial response models with a bias reduction method
- safeBinaryRegression provides a function that overloads glm() to provide a test for the existence of the maximum likelihood estimates for binomial models
- pscl provides goodness of fit measures for GLMs
Packages for variable selection and regularization
- bestglm selects a “best” subset of input variables for GLMs using cross validation and various information criteria.
- glmnet provides functions to fit linear regression, binary logistic regression and multinomial normal regression with convex penalties.
- penalized fits high dimensional logistic and poisson models with L1 and L2 penalties
Packages for special models
- mlogit fits multinomial logit models.
- lme4 provides functions to fit mixed-effect GLMS
- hglm fits hierarchical GLMs with both fixed and random effects
- glmmML provides functions to fit binomial and poisson models with clustering.
- arm provides functions for Bayesian GLMs (Look here for a discussion of how Bayesian ideas can help with GLM problems.)
- bayesm contains functions for Bayesian GLMs including binary and ordinal probit, multinomial logit, multinomial probit models and more
- MCMCglmm provides functions to fit mixed GLMs using MCMC techniques
GLMs for Big Data
- The bigglm() function in the biglm package fits GLMs that are too big to fit into memory.
- H20 package from 0xdata provides an R wrapper for the h2o.glm function for fitting GLMs on Hadoop and other platforms
- speedglm fits GLMs to large data sets using an updating procedure.
- RrevoScaleR (Revolution R Enterprise) provides parallel external memory algorithms for fitting GLMs on clusters, Hadoop, Teradata and other platforms
Generalized Additive Models, GAMS,generalize GLMs
- gam provides functions to fit the Generalized Additive Model
- gamm4 fits mixed GAMs.
- mgcv provides functions to fit GAMs with muttiple smoothing methods.
- VGAM provides functions to fit vector GLMs and GAMs.
Beyond the documentation and and a list of packages that may be useful, it is also nice to have the benefit of some practical experience. John Mount has written prolifically about logistic regression in his Win-Vector Blog over the past few years. His post, How robust is logistic regression, is an illuminating discussion of convergence issues surrounding Newton-Raphson/Iteratively-Reweighted-Least Squares. It contains pointers to examples illustrating the trouble caused by complete or quasi-complete separation as well as links to the academic literature. This post is a classic, but all of the other posts in the series are very much worth the read.
Finally, as a reminder of the trouble you can get into interpreting t-values from a GLM, here is another classic, a post from the S-News archives on the Hauck-Donner phenomenon.