Is the size of your lm model causing you headaches?

April 15, 2013

(This article was first published on Oracle R Enterprise, and kindly contributed to R-bloggers)

you build an R lm model with a relatively
large number of rows, you may be surprised by just how large that lm model is and what impact it has on your
environment and application.

Why might you care about size? The most obvious is that the
size of R objects impacts the amount of RAM available for further R processing
or loading of more data. However, it also has implications for how much space
is required to save that model or the time required to move it around the
network. For example, you may want to move the model from the database server R
engine to the client R engine when using Oracle R Enterprise Embedded R
Execution. If the model is too large, you may encounter latency when trying to
retrieve the model or even receive the following error:

Error in .oci.GetQuery(conn,
statement, data = data, prefetch = prefetch,  :

  ORA-20000: RQuery error
Error : serialization is too large to store in a raw vector

If you get this error, there are at least a few options:

  • Perform summary component
    access, like coefficients, inside the embedded R function and
    return only what is needed
  • Save the model in a database R
    datastore and manipulate that model at the database server to avoid
    pulling it to the client
  • Reduce the size of the
    model by eliminating large and unneeded components

In this blog post, we focus on the third approach and
look at the size of lm model
components, what you can do to control lm model size, and the implications for
doing so. With vanilla R, objects are the "memory" serving as the repository for repeatability. As a result, models
tend to be populated with the data used to build them to ensure model build repeatability.

When working with database tables, this "memory" is not needed
governance mechanisms are already in place to ensure either data
does not change or logs are available to know what changes took place.
Hence it is unnecessary
to store the data used to build the model into the model object.

An lm model
consists of several components, for example:

coefficients, residuals,
effects, fitted.values, rank, qr, df.residual, call, terms, xlevels, model,

Some of these components may appear deceptively small using
R’s object.size function. The following script builds an lm model to help reveal what R
reports for the size of various components. The examples use a sample of the ONTIME airline arrival and departure delays data set for domestic flights. The ONTIME_S data set is an ore.frame proxy object for data stored in an Oracle database and consists of 219932 rows and 26 columns. The R data.frame ontime_s is this same data pulled to the client R engine using ore.pull and is ~39.4MB.

Note: The results
reported below use R 2.15.2 on Windows. Serialization of some components in the lm model
has been improved in R 3.0.0, but the implications are the same.

f.lm.1 <- function(dat) lm(ARRDELAY
~ DISTANCE + DEPDELAY, data = dat) <- f.lm.1(ontime_s)


54807720 bytes

the object.size function on the
resulting model, the size is about 55MB. If only scoring data with this model, it
seems like a lot of bloat for the few coefficients assumed needed for
scoring. Also, to move this object over
a network will not be instantaneous. But is this the true size of the model?

A better way to determine just how big an object is, and
what space is actually required to store the model or time to move it across a
network, is the R serialize


[1] 65826324

Notice that the size reported by object.size is different from that of serialize – a difference of 11MB or ~20% greater.

What is taking up so much space? Let’s invoke object.size on each component of this lm model:


424 bytes


13769600 bytes


3442760 bytes


48 bytes


13769600 bytes


56 bytes


17213536 bytes


48 bytes


287504 bytes


192 bytes


1008 bytes


4432 bytes


6317192 bytes

The components residuals,
fitted.values, qr, model, and even na.action are large. Do we need all
these components?

The lm function
provides arguments to control some aspects of model size. This can be done, for
example, by specifying model=FALSE
and qr=FALSE. However, as we saw
above, there are other components that contribute heavily to model size.

f.lm.2 <- function(dat) lm(ARRDELAY
                           data = dat, model=FALSE,
qr=FALSE) <- f.lm.2(ontime_s)


[1] 51650410


31277216 bytes

The resulting serialized model size is down to about ~52MB,
which is not significantly smaller than the full model.The difference with the result reported by object.size is now ~20MB, or 39% smaller.

Does removing these components have any effect on the usefulness
of an lm model? We’ll explore this using four commonly used
functions: coef, summary, anova, and predict.
If we try to invoke summary on, the following error results:


Error in qr.lm(object) : lm
object does not have a proper ‘qr’ component.

Rank zero or should not have used lm(..,

The same error results when we try to run anova. Unfortunately, the predict function also fails with the
error above. The qr component is
necessary for these functions. Function coef
returns without error.



0.225378249 -0.001217511 0.962528054

If only coefficients are
required, these settings may be acceptable. However, as we’ve seen, removing the model and qr components, while each is large, still leaves a large
model. The really large components appear
to be the effects, residuals, and fitted.values. We can explicitly nullify them to remove
them from the model.

f.lm.3 <- function(dat) {
mod <- lm(ARRDELAY ~
data = dat, model=FALSE, qr=FALSE)
mod$effects <- mod$residuals <- mod$fitted.values <- NULL
} <- f.lm.3(ontime_s)


[1] 24089000


294968 bytes

Thinking the model size should be small, we might be surprised
to see the results above. The function object.size reports ~295KB, but
serializing the model shows 24MB, a difference of 23.8MB or 98.8%. What happened?
We’ll get to that in a moment. First,
let’s explore what effect nullifying these additional components has on the

To answer this, we’ll
turn on model and qr, and focus on effects, residuals, and fitted.values.
If we nullify effects, the anova
results are invalid, but the other results are fine. If we nullify residuals, summary cannot produce residual and coefficient statistics,
but it also produces an odd F-statistic with a warning:

Warning message:

In :
applied to non-(list or vector) of type ‘NULL’

The function anova
produces invalid F values and residual statistics, clarifying with a warning:

Warning message:

In anova.lm(mod) :

ANOVA F-tests on an essentially perfect fit
are unreliable

Otherwise, both predict
and coef work fine.

If we nullify fitted.values,
summary produces an invalid F-statistics issuing the warning:

Warning message:

In mean.default(f) :
argument is not numeric or logical: returning NA

However, there are no adverse effects on results on the
other three functions.

Depending on what we need from our model, some of these
components could be eliminated. But let’s continue looking at each remaining
component, not with object.size,
but serialize. Below, we use lapply to compute the serialized length
of each model component. This reveals that the terms component is actually the
largest component, despite object.size
reporting only 4432 bytes above.

function(x) length(serialize(x,NULL))))


coefficients 130

rank 26

assign 34

df.residual 26

na.action 84056

xlevels 55

call 275

terms 24004509

If we nullify the terms
component, the model becomes quite compact. (By the way, if we simply nullify terms,
summary, anova, and predict all fail.) Why is the terms component so large? It
turns out it has an environment object as an attribute. The environment contains
the variable dat, which contains the original data with 219932 rows and 26 columns. R’s serialize function includes this object
and hence the reason the model is so large. The function object.size ignores these objects.

<environment: 0x1d6778f8>
ls(envir = attr($terms,
[1] "dat"
d <- get("dat",envir=envir)
[1] 219932 26
length(serialize(attr($terms, ".Environment"), NULL))
[1] 38959319
object.size(attr($terms, ".Environment"))
56 bytes

If we remove this object from the environment, the serialized object size also becomes

rm(list=ls(envir = attr($terms,
envir = attr($terms,
ls(envir = attr($terms, ".Environment"))
length(serialize(, NULL))
[1] 85500

lm(formula = ARRDELAY ~ DISTANCE + DEPDELAY, data = dat, model = FALSE,
    qr = FALSE)

(Intercept)     DISTANCE    
   0.225378    -0.001218    

Is the associated environment essential to the model? If not, we could
empty it to significantly reduce model size. We’ll rebuild the model using the function f.lm.full

<- function(dat) lm(ARRDELAY ~ DISTANCE
+ DEPDELAY, data = dat) <- f.lm.full(ontime_s)
ls(envir=attr($terms, ".Environment"))
[1] "dat"

[1] 65826324

We’ll create the model removing some components as defined in function:

line-height: 115%; font-family: "Courier New";">f.lm.small
<- function(dat) {
  f.lm <- function(dat) {
  mod <- lm(ARRDELAY ~ DISTANCE + DEPDELAY, data
= dat, model=FALSE)   
  mod$fitted.values <- NULL
  mod <- f.lm(dat)
  # empty the env associated with local function
  e <- attr(mod$terms, ".Environment")
  # set parent env to .GlobalEnv so serialization
doesn’t include contents
parent.env(e) <- .GlobalEnv    
  rm(list=ls(envir=e), envir=e) # remove all objects from this environment
<- f.lm.small(ontime_s)
ls(envir=attr($terms, ".Environment")) 
length(serialize(, NULL))
[1] 16219251

We can use the same function with embedded R execution. <- ore.pull(ore.tableApply(ONTIME_S, f.lm.small))
ls(envir=attr($terms, ".Environment"))
length(serialize(, NULL))
[1] 16219251
as.matrix(lapply(, function(x)
coefficients  130   
residuals     4624354
effects       3442434
rank          26    

fitted.values 4624354
assign        34    
qr            8067072
df.residual   26    
na.action     84056 
xlevels       55    
call          245   

terms         938   

Making this change does not affect the workings of the model
for coef, summary, anova, or predict.
For example, summary produces
expected results:


lm(formula = ARRDELAY ~ DISTANCE + DEPDELAY, data = dat, model = FALSE)

     Min       1Q  
Median       3Q      Max

-1462.45    -6.97   
-1.36     5.07   925.08

Estimate Std. Error t value Pr(>|t|)   
(Intercept)  2.254e-01  5.197e-02   4.336 1.45e-05 ***
DISTANCE    -1.218e-03  5.803e-05 -20.979  < 2e-16
DEPDELAY     9.625e-01  1.151e-03 836.289  <
2e-16 ***

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

Residual standard error: 14.73 on 215144 degrees of freedom
  (4785 observations deleted due to missingness)
Multiple R-squared: 0.7647,     Adjusted R-squared: 0.7647
F-statistic: 3.497e+05 on 2 and 215144 DF,  p-value: < 2.2e-16

Using the model for prediction also produces expected

lm.pred <- function(dat, mod) {
prd <- predict(mod, newdata=dat)
prd[as.integer(rownames(prd))] <-
cbind(dat, PRED = prd)

<- with(ontime_s, ontime_s[YEAR == 2003 & MONTH == 5,
0      748       -2
163268       -8     
361        0 -0.21414306
163269       -5     
484        0 -0.36389686
163270       -3     
299        3  2.74892676
6      857       -6
163272      -21     
659       -8 -8.27718564
163273       -2    
1448        0 -1.53757703
5      238       
9  8.59836323
163275       -5     
744        0 -0.68044960
163276       -3     
199        0 -0.01690635

As shown above, an lm model can become quite large. At least for some applications, several of these
components may be unnecessary, allowing the user to significantly reduce the
size of the model and space required for saving or time for transporting the model. Relying on Oracle Database to store the data instead of the R model object further allows for significant reduction in model size.

To leave a comment for the author, please follow the link and comment on their blog: Oracle R Enterprise. offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.