Is the size of your lm model causing you headaches?

[This article was first published on Oracle R Enterprise, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

If you build an R lm model with a relatively large number of rows, you may be surprised by just how large that lm model is and what impact it has on your environment and application.

Why might you care about size? The most obvious is that the size of R objects impacts the amount of RAM available for further R processing or loading of more data. However, it also has implications for how much space is required to save that model or the time required to move it around the network. For example, you may want to move the model from the database server R engine to the client R engine when using Oracle R Enterprise Embedded R Execution. If the model is too large, you may encounter latency when trying to retrieve the model or even receive the following error:

Error in .oci.GetQuery(conn, statement, data = data, prefetch = prefetch,  :
  ORA-20000: RQuery error
Error : serialization is too large to store in a raw vector

If you get this error, there are at least a few options:

  • Perform summary component access, like coefficients, inside the embedded R function and return only what is needed
  • Save the model in a database R datastore and manipulate that model at the database server to avoid pulling it to the client
  • Reduce the size of the model by eliminating large and unneeded components

In this blog post, we focus on the third approach and look at the size of lm model components, what you can do to control lm model size, and the implications for doing so. With vanilla R, objects are the “memory” serving as the repository for repeatability. As a result, models tend to be populated with the data used to build them to ensure model build repeatability.

When working with database tables, this “memory” is not needed because governance mechanisms are already in place to ensure either data does not change or logs are available to know what changes took place. Hence it is unnecessary to store the data used to build the model into the model object.

An lm model consists of several components, for example:

coefficients, residuals, effects, fitted.values, rank, qr, df.residual, call, terms, xlevels, model, na.action

Some of these components may appear deceptively small using R’s object.size function. The following script builds an lm model to help reveal what R reports for the size of various components. The examples use a sample of the ONTIME airline arrival and departure delays data set for domestic flights. The ONTIME_S data set is an ore.frame proxy object for data stored in an Oracle database and consists of 219932 rows and 26 columns. The R data.frame ontime_s is this same data pulled to the client R engine using ore.pull and is ~39.4MB.

Note: The results reported below use R 2.15.2 on Windows. Serialization of some components in the lm model has been improved in R 3.0.0, but the implications are the same.

f.lm.1 <- function(dat) lm(ARRDELAY ~ DISTANCE + DEPDELAY, data = dat)

lm.fit.1 <- f.lm.1(ontime_s)

object.size(lm.fit.1)

54807720 bytes

Using the object.size function on the resulting model, the size is about 55MB. If only scoring data with this model, it seems like a lot of bloat for the few coefficients assumed needed for scoring. Also, to move this object over a network will not be instantaneous. But is this the true size of the model?

A better way to determine just how big an object is, and what space is actually required to store the model or time to move it across a network, is the R serialize function.

length(serialize(lm.fit.1,NULL))

[1] 65826324

Notice that the size reported by object.size is different from that of serialize – a difference of 11MB or ~20% greater.

What is taking up so much space? Let’s invoke object.size on each component of this lm model:

lapply(lm.fit.1, object.size)
$coefficients

424 bytes

$residuals

13769600 bytes

$effects

3442760 bytes

$rank

48 bytes

$fitted.values

13769600 bytes

$assign

56 bytes

$qr

17213536 bytes

$df.residual

48 bytes

$na.action

287504 bytes

$xlevels

192 bytes

$call

1008 bytes

$terms

4432 bytes

$model

6317192 bytes

The components residuals, fitted.values, qr, model, and even na.action are large. Do we need all these components?

The lm function provides arguments to control some aspects of model size. This can be done, for example, by specifying model=FALSE and qr=FALSE. However, as we saw above, there are other components that contribute heavily to model size.

f.lm.2 <- function(dat) lm(ARRDELAY ~ DISTANCE + DEPDELAY,
                           data = dat, model=FALSE, qr=FALSE)

lm.fit.2 <- f.lm.2(ontime_s)

length(serialize(lm.fit.2,NULL))

[1] 51650410

object.size(lm.fit.2)

31277216 bytes

The resulting serialized model size is down to about ~52MB, which is not significantly smaller than the full model.The difference with the result reported by object.size is now ~20MB, or 39% smaller.

Does removing these components have any effect on the usefulness of an lm model? We’ll explore this using four commonly used functions: coef, summary, anova, and predict. If we try to invoke summary on lm.fit.2, the following error results:

summary(lm.fit.2)

Error in qr.lm(object) : lm object does not have a proper ‘qr’ component.

Rank zero or should not have used lm(.., qr=FALSE).

The same error results when we try to run anova. Unfortunately, the predict function also fails with the error above. The qr component is necessary for these functions. Function coef returns without error.

coef(lm.fit.2)

(Intercept) DISTANCE DEPDELAY

0.225378249 -0.001217511 0.962528054

If only coefficients are required, these settings may be acceptable. However, as we’ve seen, removing the model and qr components, while each is large, still leaves a large model. The really large components appear to be the effects, residuals, and fitted.values. We can explicitly nullify them to remove them from the model.

f.lm.3 <- function(dat) {
mod <- lm(ARRDELAY ~ DISTANCE + DEPDELAY,
data = dat, model=FALSE, qr=FALSE)
mod$effects <- mod$residuals <- mod$fitted.values <- NULL
  mod
}

lm.fit.3 <- f.lm.3(ontime_s)

length(serialize(lm.fit.3,NULL))

[1] 24089000

object.size(lm.fit.3)

294968 bytes

Thinking the model size should be small, we might be surprised to see the results above. The function object.size reports ~295KB, but serializing the model shows 24MB, a difference of 23.8MB or 98.8%. What happened? We’ll get to that in a moment. First, let’s explore what effect nullifying these additional components has on the model.

To answer this, we’ll turn on model and qr, and focus on effects, residuals, and fitted.values. If we nullify effects, the anova results are invalid, but the other results are fine. If we nullify residuals, summary cannot produce residual and coefficient statistics, but it also produces an odd F-statistic with a warning:

Warning message:

In is.na(x) : is.na() applied to non-(list or vector) of type ‘NULL’

The function anova produces invalid F values and residual statistics, clarifying with a warning:

Warning message:

In anova.lm(mod) :

ANOVA F-tests on an essentially perfect fit are unreliable

Otherwise, both predict and coef work fine.

If we nullify fitted.values, summary produces an invalid F-statistics issuing the warning:

Warning message:

In mean.default(f) : argument is not numeric or logical: returning NA


However, there are no adverse effects on results on the other three functions.

Depending on what we need from our model, some of these components could be eliminated. But let’s continue looking at each remaining component, not with object.size, but serialize. Below, we use lapply to compute the serialized length of each model component. This reveals that the terms component is actually the largest component, despite object.size reporting only 4432 bytes above.

as.matrix(lapply(lm.fit.3, function(x) length(serialize(x,NULL))))

[,1]

coefficients 130

rank 26

assign 34

df.residual 26

na.action 84056

xlevels 55

call 275

terms 24004509

If we nullify the terms component, the model becomes quite compact. (By the way, if we simply nullify terms, summary, anova, and predict all fail.) Why is the terms component so large? It turns out it has an environment object as an attribute. The environment contains the variable dat, which contains the original data with 219932 rows and 26 columns. R’s serialize function includes this object and hence the reason the model is so large. The function object.size ignores these objects.

attr(lm.fit.1$terms, “.Environment”)  

ls(envir = attr(lm.fit.1$terms, “.Environment”))        
[1] “dat”
d <- get("dat",envir=envir)
dim(d)
[1] 219932 26
length(serialize(attr(lm.fit.1$terms, “.Environment”), NULL))
[1] 38959319
object.size(attr(lm.fit.1$terms, “.Environment”))
56 bytes

If we remove this object from the environment, the serialized object size also becomes small.

rm(list=ls(envir = attr(lm.fit.1$terms, “.Environment”)),
envir = attr(lm.fit.1$terms, “.Environment”))  
ls(envir = attr(lm.fit.1$terms, “.Environment”))
character(0)
length(serialize(lm.fit.1, NULL))
[1] 85500
lm.fit.1

Call:
lm(formula = ARRDELAY ~ DISTANCE + DEPDELAY, data = dat, model = FALSE,
    qr = FALSE)

Coefficients:
(Intercept)     DISTANCE     DEPDELAY 
   0.225378    -0.001218     0.962528 

Is the associated environment essential to the model? If not, we could empty it to significantly reduce model size. We’ll rebuild the model using the function f.lm.full

f.lm.full <- function(dat) lm(ARRDELAY ~ DISTANCE + DEPDELAY, data = dat)
lm.fit.full <- f.lm.full(ontime_s)
ls(envir=attr(lm.fit.full$terms, “.Environment”))
[1] “dat”
length(serialize(lm.fit.full,NULL))
[1] 65826324

We’ll create the model removing some components as defined in function:

line-height: 115%; font-family: “Courier New”;”>f.lm.small <- function(dat) {
  f.lm <- function(dat) {
  mod <- lm(ARRDELAY ~ DISTANCE + DEPDELAY, data = dat, model=FALSE)   
  mod$fitted.values <- NULL
  mod
}
  mod <- f.lm(dat)
  # empty the env associated with local function
  e <- attr(mod$terms, ".Environment")
  # set parent env to .GlobalEnv so serialization doesn’t include contents
parent.env(e) <- .GlobalEnv    
  rm(list=ls(envir=e), envir=e) # remove all objects from this environment
  mod
}

lm.fit.small <- f.lm.small(ontime_s)
ls(envir=attr(lm.fit.small$terms, “.Environment”)) 
character(0)
length(serialize(lm.fit.small, NULL))
[1] 16219251

We can use the same function with embedded R execution.

lm.fit.ere <- ore.pull(ore.tableApply(ONTIME_S, f.lm.small))
ls(envir=attr(lm.fit.ere$terms, “.Environment”))
character(0)
length(serialize(lm.fit.ere, NULL))
[1] 16219251
as.matrix(lapply(lm.fit.ere, function(x) length(serialize(x,NULL))))    
              [,1]  
coefficients  130   
residuals     4624354
effects       3442434
rank          26    
fitted.values 4624354
assign        34    
qr            8067072
df.residual   26    
na.action     84056 
xlevels       55    
call          245   
terms         938   

Making this change does not affect the workings of the model for coef, summary, anova, or predict. For example, summary produces expected results:

summary(lm.fit.ere)

Call:
lm(formula = ARRDELAY ~ DISTANCE + DEPDELAY, data = dat, model = FALSE)

Residuals:
     Min       1Q   Median       3Q      Max
-1462.45    -6.97    -1.36     5.07   925.08

Coefficients:
              Estimate Std. Error t value Pr(>|t|)   
(Intercept)  2.254e-01  5.197e-02   4.336 1.45e-05 ***
DISTANCE    -1.218e-03  5.803e-05 -20.979  < 2e-16 ***
DEPDELAY     9.625e-01  1.151e-03 836.289  < 2e-16 ***

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

Residual standard error: 14.73 on 215144 degrees of freedom
  (4785 observations deleted due to missingness)
Multiple R-squared: 0.7647,     Adjusted R-squared: 0.7647
F-statistic: 3.497e+05 on 2 and 215144 DF,  p-value: < 2.2e-16

Using the model for prediction also produces expected results.

lm.pred <- function(dat, mod) {
prd <- predict(mod, newdata=dat)
prd[as.integer(rownames(prd))] <- prd
cbind(dat, PRED = prd)
}

dat.test <- with(ontime_s, ontime_s[YEAR == 2003 & MONTH == 5,
c(“ARRDELAY”, “DISTANCE”, “DEPDELAY”)])
head(lm.pred(dat.test, lm.fit.ere))
       ARRDELAY DISTANCE DEPDELAY        PRED
163267        0      748       -2 -2.61037575
163268       -8      361        0 -0.21414306
163269       -5      484        0 -0.36389686
163270       -3      299        3  2.74892676
163271        6      857       -6 -6.59319662
163272      -21      659       -8 -8.27718564
163273       -2     1448        0 -1.53757703
163274        5      238        9  8.59836323
163275       -5      744        0 -0.68044960
163276       -3      199        0 -0.01690635

As shown above, an lm model can become quite large. At least for some applications, several of these components may be unnecessary, allowing the user to significantly reduce the size of the model and space required for saving or time for transporting the model. Relying on Oracle Database to store the data instead of the R model object further allows for significant reduction in model size.

To leave a comment for the author, please follow the link and comment on their blog: Oracle R Enterprise.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)