Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

One of the attractive aspects of logistic regression models (and linear models in general) is their compactness: the size of the model grows in the number of coefficients, not in the size of the training data. With R, though, glm models are not so concise; we noticed this to our dismay when we tried to automate fitting a moderate number of models (about 500 models, with on the order of 50 coefficients) to data sets of moderate size (several tens of thousands of rows). A workspace save of the models alone was in the tens of gigabytes! How is this possible? We decided to find out.

As many R users know (but often forget), a glm model object carries a copy of its training data by default. You can use the settings y=FALSE and model=FALSE to turn this off.

set.seed(2325235)

# Set up a synthetic classification problem of a given size
# and two variables: one numeric, one categorical
# (two levels).
synthFrame = function(nrows) {
d = data.frame(xN=rnorm(nrows),
xC=sample(c('a','b'),size=nrows,replace=TRUE))
d$y = (d$xN + ifelse(d$xC=='a',0.2,-0.2) + rnorm(nrows))>0.5 d } # first show that model=F and y=F help reduce model size dTrain = synthFrame(1000) model1 = glm(y~xN+xC,data=dTrain,family=binomial(link='logit')) model2 = glm(y~xN+xC,data=dTrain,family=binomial(link='logit'), y=FALSE) model3 = glm(y~xN+xC,data=dTrain,family=binomial(link='logit'), y=FALSE, model=FALSE) # # Estimate the object's size as the size of its serialization # length(serialize(model1, NULL)) # [1] 225251 length(serialize(model2, NULL)) # [1] 206341 length(serialize(model3, NULL)) # [1] 189562 dTest = synthFrame(100) p1 = predict(model1, newdata=dTest, type='response') p2 = predict(model2, newdata=dTest, type='response') p3 = predict(model3, newdata=dTest, type='response') sum(abs(p1-p2)) # [1] 0 sum(abs(p1-p3)) # [1] 0  So we see (as expected) that removing the training data from the model decreases the size of the model (as estimated by the size of its serialization), without affecting the model’s predictions. What happens when you increase the training data size? The size of the model (with y=FALSE and model=FALSE) should not grow. ndata = seq(from=0, to=100000, by=5000)[-1] # # A function to estimate the size of a model for # our synthetic problem, with a training set of size n # getModelSize = function(n) { data = synthFrame(n) model = glm(y~xN+xC,data=data,family=binomial(link='logit'), y=FALSE, model=FALSE) length(serialize(model, NULL)) } size1 = sapply(ndata, FUN=getModelSize) library(ggplot2) ggplot(data.frame(n=ndata, modelsize=size1), aes(x=n, y=modelsize)) + geom_point() + geom_line()  Lo and behold, we see that the model size still grows linearly in the size of the training data! The model objects are still holding something that is proportional to the size of the training data. Where? We can use our serialization trick to find the size of the individual components of a model: breakItDown = function(mod) { sapply(mod, FUN=function(x){length(serialize(x, NULL))}, simplify=T) }  Now let’s compare two models trained with datasets of different sizes (one ten times the size of the other). mod1 = glm(y~xN+xC,data=synthFrame(1000), family=binomial(link='logit'), y=FALSE, model=FALSE) c1 = breakItDown(mod1) mod2 = glm(y~xN+xC,data=synthFrame(10000), family=binomial(link='logit'), y=FALSE, model=FALSE) c2 = breakItDown(mod2) # For pretty-printing a vector to a vertical blog-friendly format: # return a string of vector formatted as a column with names # use cat to echo the value vfmtN = function(v) { width = max(sapply(names(v),nchar)) paste( sapply(1:length(v),function(i) { paste(format(names(v)[i], width=width), format(v[[i]])) }), collapse='\n') } cat(vfmtN(c1)) # coefficients 119 # residuals 18948 # fitted.values 18948 # effects 16071 # R 261 # rank 26 # qr 35261 # family 25160 # linear.predictors 18948 # deviance 30 # aic 30 # null.deviance 30 # iter 26 # weights 18948 # prior.weights 18948 # df.residual 26 # df.null 26 # converged 26 # boundary 26 # call 373 # formula 193 # terms 836 # data 16278 # offset 18 # control 140 # method 37 # contrasts 96 # xlevels 91 cat(vfmtN(c2)) # coefficients 119 # residuals 198949 # fitted.values 198949 # effects 160071 # R 261 # rank 26 # qr 359262 # family 25160 # linear.predictors 198949 # deviance 30 # aic 30 # null.deviance 30 # iter 26 # weights 198949 # prior.weights 198949 # df.residual 26 # df.null 26 # converged 26 # boundary 26 # call 373 # formula 193 # terms 836 # data 160278 # offset 18 # control 140 # method 37 # contrasts 96 # xlevels 91  Look carefully, and you will see that certain objects in the glm model are large, and growing with data size. r = c2/c1 cat(vfmtN(r)) # coefficients 1 # residuals 10.49974 # fitted.values 10.49974 # effects 9.960239 # R 1 # rank 1 # qr 10.18865 # family 1 # linear.predictors 10.49974 # deviance 1 # aic 1 # null.deviance 1 # iter 1 # weights 10.49974 # prior.weights 10.49974 # df.residual 1 # df.null 1 # converged 1 # boundary 1 # call 1 # formula 1 # terms 1 # data 9.846296 # offset 1 # control 1 # method 1 # contrasts 1 # xlevels 1 cat(vfmtN(r[r>1])) # residuals 10.49974 # fitted.values 10.49974 # effects 9.960239 # qr 10.18865 # linear.predictors 10.49974 # weights 10.49974 # prior.weights 10.49974 # data 9.846296  Now strictly speaking, all you need to know to apply a glm model are the coefficients of the model, and the appropriate link function. All the other things the glm model object carries around are for the purpose of characterizing the model. An example would be calculating coefficient significances (and really, for most purposes, one could just calculate the quantities one wants to know, save those, and throw the data away — but we’re here to discuss R as it is, not as it should be). Once you’ve examined a model and decided that it’s satisfactory, all you probably want to do is predict. So let’s try trimming all those large objects away. cleanModel1 = function(cm) { # just in case we forgot to set # y=FALSE and model=FALSE cm$y = c()
cm$model = c() cm$residuals = c()
cm$fitted.values = c() cm$effects = c()
cm$qr = c() cm$linear.predictors = c()
cm$weights = c() cm$prior.weights = c()
cm$data = c() cm } cm1 = cleanModel1(mod1) cm2 = cleanModel1(mod2) dTest = synthFrame(100) p1=predict(cm1, newdata=dTest, type='response') # FAILS # Error in qr.lm(object) : lm object does not have a proper 'qr' component. # Rank zero or should not have used lm(.., qr=FALSE).  Ooops. We can’t null out the qr member of the model object if we want to predict. Incidentally, this is related to the observation that if you try to call lm(...., y=FALSE, model=FALSE, qr=FALSE), the result is a model object that fails to either predict or summarize. Don’t ask me why qr=FALSE is even an option. But back to the glm. What’s in the model’s qr field? breakItDown(mod1$qr)
# qr  rank qraux pivot   tol
# 35042    26    46    34    30

breakItDown(mod2$qr) # qr rank qraux pivot tol # 359043 26 46 34 30  It turns out that we don’t actually need model’s qr$qr to predict, so let’s trim just that away:

cleanModel2 = function(cm) {
cm$y = c() cm$model = c()

cm$residuals = c() cm$fitted.values = c()
cm$effects = c() cm$qr$qr = c() cm$linear.predictors = c()
cm$weights = c() cm$prior.weights = c()
cm$data = c() cm } # More reduction in model size length(serialize(mod2, NULL)) # [1] 1701600 cm2 = cleanModel2(mod2) length(serialize(cm2, NULL)) # [1] 27584 # And prediction works, too resp.full = predict(mod2, newdata=dTest, type="response") resp.cm = predict(cm2, newdata=dTest, type="response") sum(abs(resp.full-resp.cm)) # [1] 0  Are we done? getModelSize = function(n) { data = synthFrame(n) model = cleanModel2(glm(y~xN+xC,data=data, family=binomial(link='logit'), y=FALSE, model=FALSE)) length(serialize(model, NULL)) } size2 = sapply(ndata, FUN=getModelSize) ggplot(data.frame(n=ndata, modelsize=size2), aes(x=n, y=modelsize)) + geom_point() + geom_line()  The models are substantially smaller than when we started, but they still grow with training data size. A rough explanation for this is that glm hides pointers to the environment and things from the environment deep in many places. We didn’t notice this when we built models in the global environment because all those pointers pointed to the same things, so even though the models are much bigger than they need to be, they are all “too big” by the same amount, and hence don’t appear to grow as the training data grows. But when you build the models in a function (as we did in getModelSize(), you get more transient environments that are proportional to the size of the training data — and so model size grows with training data size. This isn’t going to seem clear, because it depends on a lot of complicated implementation details (for a taste of how complicated it can get, see here). After much trial and error, this is the set of fields and attributes of the model that we found were growing with data size, and that we could eliminate without breaking predict(). stripGlmLR = function(cm) { cm$y = c()
cm$model = c() cm$residuals = c()
cm$fitted.values = c() cm$effects = c()
cm$qr$qr = c()
cm$linear.predictors = c() cm$weights = c()
cm$prior.weights = c() cm$data = c()

cm$family$variance = c()
cm$family$dev.resids = c()
cm$family$aic = c()
cm$family$validmu = c()
cm$family$simulate = c()
attr(cm$terms,".Environment") = c() attr(cm$formula,".Environment") = c()

cm
}

getModelSize = function(n) {
data = synthFrame(n)
model = stripGlmLR(glm(y~xN+xC,data=data,
y=FALSE, model=FALSE))
length(serialize(model, NULL))
}

size3 = sapply(ndata, FUN=getModelSize)

ggplot(data.frame(n=ndata, modelsize=size3), aes(x=n, y=modelsize)) +
geom_point() + geom_line()


Yahoo! It worked! The models are constant size with respect to training data size. And prediction works.

cm2 = stripGlmLR(mod2)
resp.full = predict(mod2, newdata=dTest, type="response")
resp.cm = predict(cm2, newdata=dTest, type="response")
sum(abs(resp.full-resp.cm))
# [1] 0


Comparing the size of the final stripped-down models (in variable size3 in the demonstration code) to the originals (size1), we find that the final model is 3/10th of a percent the size of the original model for small (n=5000) training data sets, and 0.015% the size of the original model for “large” (n=100,000) data sets. That’s a heckuva savings. And we probably haven’t gotten rid of all the unnecessary baggage, but at least we seem to have stopped the growth. This was enough trimming to accomplish our task for the client (producing working models that stored and loaded quickly), so we stopped.

One point and one caveat. You can null out model\$family entirely; the predict function will still return its default value, the link value (that is, predict(model, newdata=data)) will work). However, predict(model, newdata=data, type='response') will fail. You can still recover the response by passing the link value through the inverse link function: in the case of logistic regression, this is the sigmoid function, sigmoid(x) = 1/(1 + exp(-x)). I didn’t test type=terms.

The caveat: many of the other things besides predict that you might like to do with a glm model will fail on the stripped-down version: in particular summary(), anova() and step(). So any characterization that you want to do on a candidate model should be done before trimming down the fat. Once you have decided on a satisfactory model, you can strip it down and save it for use in future predictions.

• You can trim lm and gam models in a similar way, too. The exact fields to trim are a bit different. We will leave this as an exercise for the reader.
• We are aware of the bigglm package, for fitting generalized linear models to big data. We didn’t test it, but I would imagine that it doesn’t have this problem. Note, though that the problem here isn’t the size of the training data per se (which is only of moderate size); it’s the inordinate size of the resulting model.