What does a generalized linear model do?
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
What does a generalized linear model do? R supplies a modeling function called glm()
that fits generalized linear models (abbreviated as GLMs). A natural question is what does it do and what problem is it solving for you? We work some examples and place generalized linear models in context with other techniques.For predicting a categorical outcome (such as y = true/false) it is often advised to use a form of GLM called a logistic regression instead of a standard linear regression. The obvious question is: what is does the logistic regression do? We will explain what problem the logistic regression is trying to solve.
Consider the following trivial artificial R data set:
> d <- read.table(file='http://www.win-vector.com/dfiles/glmLoss/dGLMdat.csv',
header=T,sep=',')
> d
x1 x2 y
1 0.09793624 -0.50020073 FALSE
2 -0.54933361 0.00834841 TRUE
3 0.18499020 -0.79325364 TRUE
4 0.58316450 2.06501637 TRUE
5 0.09607855 0.42724062 TRUE
6 -0.44772937 0.23267758 FALSE
7 1.24981165 -0.24492799 TRUE
8 0.13378532 -0.21985529 TRUE
9 0.41987141 -0.63677825 FALSE
10 1.28558918 1.37708143 FALSE
11 0.32590303 0.90813181 TRUE
12 0.01148262 -1.35426485 FALSE
13 -0.98502686 1.85317024 TRUE
14 -0.23017795 -0.06923035 FALSE
15 1.29606888 -0.80930538 TRUE
16 0.31286797 0.21319610 TRUE
17 0.03766960 -1.13314348 TRUE
18 0.03662855 0.67440240 FALSE
19 1.62032558 -0.57165979 TRUE
20 -0.63236983 -0.30736577 FALSE
>
We can use R to fit a linear model that uses x1 and x2 to try and predict y:
> lm(y~x1+x2,data=d)
Call:
lm(formula = y ~ x1 + x2, data = d)
Coefficients:
(Intercept) x1 x2
0.55548 0.16614 0.07599
Behind the scenes R encodes y=true
as 1 and y=false
as 0 (this is called encoding with an indicator). Then a standard linear regression is performed to try and predict these values. A linear model (in this case) is a formula of the form a + b*x1 + c*x2 = y. Standard least squares linear regression picks a,b and c to minimize the sum of (a + b*x1 + c*x2 – y)^2 over all of the training data. In general a standard result that this “minimum square loss model” linear regression is also (under fairly general conditions) a “maximum likelihood model” (the model that considers the training data least surprising).
For general data this has the slight disadvantage that the model can be penalized for being “too right”. Predicting “2″ when the answer is “1″ is considered (in terms of training error) to be just as bad as predicting “0″ when the answer is “1.” Also it is well know that truely binomially distributed data is heteroscedastic because in a situation where our binomial outcome has probability p of coding to 1 the indicator has a variance of p*(1-p). Thus if p is varying over our data (which is one of the assumptions driving the attempt to build any sort of model) then we have the expected observations errors are a function of the observation values (which causes problems when fitting and when attempting to interpret coefficient significances). A first pass fix is to apply a stabilizing transform like sqrt() or arcsinh() ( see for example: “The transformation of Poisson, binomial and negative-binomial data”, FJ Anscombe, Biometrika, 1948 vol. 35 (3) pp. 246-254 ) which have the amazing property that they can make the observed variance nearly constant over a wide range of p (and therefore nearly independent of the expectation, as we want). However, for all of these corrections when fitting a linear model to a categorical outcome you are still overly dependent on the details of how you encoded that outcome as an indicator.
The next thing to try is a generalized linear model. A generalized linear model (in this case) fits s(a + b*x1 + c*x2) = y. For logistic regression we have s(z) = 1/(1+e^{-z}). This s(), called the sigmoid, has the following desirable properties:
- s(-Infinity) = 0
- s(Infinity) = 1
- s(0) = 1/2
- s(-x) = 1 – s(x)
- x > y implies s(x) > s(y)
- s’(x) = s(x) (1 – s(x))
So we certainly can no longer over-run our target y: s(x) ≤ 1 for all x, so we can’t ever be penalized for returning a prediction of 2. Thus all of the degrees of freedom or error budget can be used during fitting to try and address other (more important) fitting problems. To perform a logistic regression in R we just do the following:
m <- glm(y~x1+x2,data=d,family=binomial(link='logit'))
summary(m)
What this does is fit a maximum likelihood model to our data. That is a model that supplies probabilities for each datum and the product of all the predicted probabilities is least surprising (so the model tends to predict high values on y=true examples and low values on y=false examples). As discussed in the simpler derivation of logistic regression this is equivalent to finding the a,b and c such that maximize the product given by multiplying in a term of the form s(a + b*x1 + c*x2) for each positive example and multiplying in a term of the form (1 – s(a + b*x1 + c*x2)) for each negative example. Or in equations: the maximum likelihood solution picks a,b and c to maximize the following product over all training data:
That is: it picks the coefficients a,b and c so the training data is most consistent with the model (or least shocking given the model).
Notice this is not the same as minimizing the square loss (as we did for linear regression):
One great feature of linear regression that under fairly general conditions: the maximum likelihood solution is also the minimum square-loss solution. This breaks down for models of other forms (like logistic regression). We will continue our numeric example just to show the two solutions are not the same for logistic regression. The reason we work an example is statistics in general is hard to write about (as when you write about it you are managing many different concerns: data, model form and error). And statistics write-ups are even harder to read (as a lot of statistical writing claims to solve all problems at once which means authors tend not to admit to choices and trade-offs). So instead of trying to think through all possible intents we work a toy example (and given we run into complications in the toy example, you can be assured things are even more complicated once we expose techniques to real data).
Because many people (sometimes including this author) mis-remember GLMs as minimizing loss or transformed loss we continue with our worked example. Note that the balance equations (or the gradient of the logarithm of the likelihood equation, discussed in the simpler derivation of logistic regression) are obeyed in R’s solution:
d$s <- predict(m,type='response')
sum(with(d,2*(s-y)))
sum(with(d,2*(s-y)*x1))
sum(with(d,2*(s-y)*x2))
gives us:
[1] -1.050271e-12
[1] -1.173884e-12
[1] -7.624248e-13
all very near zero (as expected).
However, the square-loss has its own gradient which implies its own balance equations. These new square loss balance equations are not obeyed. The new checks (which are just the derivative of the loss function with respect to each of the parameters a,b,c) are:
sum(with(d,2*(s-y)*s*(1-s)))
sum(with(d,2*(s-y)*s*(1-s)*x1))
sum(with(d,2*(s-y)*s*(1-s)*x2))
and are not that close to zero:
[1] -0.05593889
[1] -0.09833072
[1] -0.2266276
We can further confirm the optimal solution the square loss conditions is different than the maximum likelihood solution by going ahead and finding the square loss solution (so we can compare it to the maximum likelihood solution we already have). Define a function f()
that computes the square-loss of given estimate:
s <- function(z) { 1/(1+exp(-z)) }
p <- function(x) { s(x[1] + x[2]*d$x1 + x[3]*d$x2) }
f <- function(x) {
v <- p(x)
d <- v-d$y
sum(d*d)
}
We confirm that this computes the square-loss correctly for the logistic solution:
f(m$coefficients)
m$coefficients
sum(with(d,(y-s)*(y-s)))
yields (as expected);
(Intercept) x1 x2
0.2415262 0.7573462 0.3529519
[1] 4.416638
[1] 4.416638
We now derive and plug in our modified coefficients and show the lower square loss:
opt <- optim(m$coefficients,f,,method='CG') # default optimizer failed to optimize
opt$par
f(opt$par)
yielding:
(Intercept) x1 x2
0.3156930 1.3125616 0.8139712
[1] 4.34706
which is a lower total square loss than our original 4.42. This confirms the two solutions (the one that maximizes likelihood and the one that minimizes square loss) are not the same for logistic regression.
The overall summary is: You can first try linear regression. If this is not appropriate for your problem you can then try pre-transforming your y-data (a log-like or logit transform) and seeing if that fits better. However, if you transform your y-data you are using a new error model (in the transformed space such as log(y)-units instead of y-units, this can be better or can be worse depending on your situation). If this error model is not appropriate you can move on to a generalized linear model. However, the generalized linear model does not minimize square error in y-units but maximizes data likelihood under the chosen model. The distinction is mostly technical and maximum likelihood is often a good objective (so you should be willing to give up your original square-loss objective). If you wan’t to go further still you can try a generalized additive model which in addition to re-shaping the y distribution uses splines to learn re-shapings of the x-data.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.