# Is the size of your lm model causing you headaches?

**Oracle R Enterprise**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

If

you build an R lm model with a relatively

large number of rows, you may be surprised by just how large that lm model is and what impact it has on your

environment and application.

Why might you care about size? The most obvious is that the

size of R objects impacts the amount of RAM available for further R processing

or loading of more data. However, it also has implications for how much space

is required to save that model or the time required to move it around the

network. For example, you may want to move the model from the database server R

engine to the client R engine when using Oracle R Enterprise Embedded R

Execution. If the model is too large, you may encounter latency when trying to

retrieve the model or even receive the following error:

**Error in .oci.GetQuery(conn, statement, data = data, prefetch = prefetch, :**

**ORA-20000: RQuery error**

**Error : serialization is too large to store in a raw vector**If you get this error, there are at least a few options:

- Perform summary component

access, like coefficients, inside the embedded R function and

return only what is needed - Save the model in a database R

datastore and manipulate that model at the database server to avoid

pulling it to the client - Reduce the size of the

model by eliminating large and unneeded components

In this blog post, we focus on the third approach and

look at the size of lm model

components, what you can do to control lm model size, and the implications for

doing so. With vanilla R, objects are the "memory" serving as the repository for repeatability. As a result, models

tend to be populated with the data used to build them to ensure model build repeatability.

When working with database tables, this "memory" is not needed

because

governance mechanisms are already in place to ensure either data

does not change or logs are available to know what changes took place.

Hence it is unnecessary

to store the data used to build the model into the model object.

An lm model

consists of several components, for example:

coefficients, residuals,

effects, fitted.values, rank, qr, df.residual, call, terms, xlevels, model,

na.action

Some of these components may appear deceptively small using

R’s object.size function. The following script builds an lm model to help reveal what R

reports for the size of various components. The examples use a sample of the ONTIME airline arrival and departure delays data set for domestic flights. The ONTIME_S data set is an ore.frame proxy object for data stored in an Oracle database and consists of 219932 rows and 26 columns. The R data.frame ontime_s is this same data pulled to the client R engine using ore.pull and is ~39.4MB.

*Note: The results reported below use R 2.15.2 on Windows. Serialization of some components in the lm model has been improved in R 3.0.0, but the implications are the same. *

f.lm.1 <- function(dat) lm(ARRDELAY

~ DISTANCE + DEPDELAY, data = dat)

lm.fit.1 <- f.lm.1(ontime_s)

object.size(lm.fit.1)

54807720 bytes

Using

the object.size function on the

resulting model, the size is about 55MB. If only scoring data with this model, it

seems like a lot of bloat for the few coefficients assumed needed for

scoring. Also, to move this object over

a network will not be instantaneous. But is this the true size of the model?

A better way to determine just how big an object is, and

what space is actually required to store the model or time to move it across a

network, is the R serialize

function.

length(serialize(lm.fit.1,NULL))

[1] 65826324

Notice that the size reported by object.size is different from that of serialize – a difference of 11MB or ~20% greater.

What is taking up so much space? Let’s invoke object.size on each component of this lm model:

lapply(lm.fit.1,

object.size)

$coefficients

424 bytes

$residuals

13769600 bytes

$effects

3442760 bytes

$rank

48 bytes

$fitted.values

13769600 bytes

$assign

56 bytes

$qr

17213536 bytes

$df.residual

48 bytes

$na.action

287504 bytes

$xlevels

192 bytes

$call

1008 bytes

$terms

4432 bytes

$model

6317192 bytes

The components residuals,

fitted.values, qr, model, and even na.action are large. Do we need all

these components?

The lm function

provides arguments to control some aspects of model size. This can be done, for

example, by specifying model=FALSE

and qr=FALSE. However, as we saw

above, there are other components that contribute heavily to model size.

f.lm.2 <- function(dat) lm(ARRDELAY

~ DISTANCE + DEPDELAY,

data = dat, model=FALSE,

qr=FALSE)

lm.fit.2 <- f.lm.2(ontime_s)

length(serialize(lm.fit.2,NULL))

[1] 51650410

object.size(lm.fit.2)

31277216 bytes

The resulting serialized model size is down to about ~52MB,

which is not significantly smaller than the full model.The difference with the result reported by object.size is now ~20MB, or 39% smaller.

Does removing these components have any effect on the usefulness

of an lm model? We’ll explore this using four commonly used

functions: coef, summary, anova, and predict.

If we try to invoke summary on lm.fit.2, the following error results:

summary(lm.fit.2)

Error in qr.lm(object) : lm

object does not have a proper ‘qr’ component.

Rank zero or should not have used lm(..,

qr=FALSE).

The same error results when we try to run anova. Unfortunately, the predict function also fails with the

error above. The qr component is

necessary for these functions. Function coef

returns without error.

coef(lm.fit.2)

(Intercept) DISTANCE DEPDELAY

0.225378249 -0.001217511 0.962528054

If only coefficients are

required, these settings may be acceptable. However, as we’ve seen, removing the model and qr components, while each is large, still leaves a large

model. The really large components appear

to be the effects, residuals, and fitted.values. We can explicitly nullify them to remove

them from the model.

f.lm.3 <- function(dat) {

mod <- lm(ARRDELAY ~

DISTANCE + DEPDELAY,

data = dat, model=FALSE, qr=FALSE)

mod$effects <- mod$residuals <- mod$fitted.values <- NULL

mod

}

lm.fit.3 <- f.lm.3(ontime_s)

length(serialize(lm.fit.3,NULL))

[1] 24089000

object.size(lm.fit.3)

294968 bytes

Thinking the model size should be small, we might be surprised

to see the results above. The function object.size reports ~295KB, but

serializing the model shows 24MB, a difference of 23.8MB or 98.8%. What happened?

We’ll get to that in a moment. First,

let’s explore what effect nullifying these additional components has on the

model.

To answer this, we’ll

turn on model and qr, and focus on effects, residuals, and fitted.values.

If we nullify effects, the anova

results are invalid, but the other results are fine. If we nullify residuals, summary cannot produce residual and coefficient statistics,

but it also produces an odd F-statistic with a warning:

Warning message:

In is.na(x) : is.na()

applied to non-(list or vector) of type ‘NULL’

The function anova

produces invalid F values and residual statistics, clarifying with a warning:

Warning message:

In anova.lm(mod) :

ANOVA F-tests on an essentially perfect fit

are unreliable

Otherwise, both predict

and coef work fine.

If we nullify fitted.values,

summary produces an invalid F-statistics issuing the warning:

Warning message:

In mean.default(f) :

argument is not numeric or logical: returning NA

However, there are no adverse effects on results on the

other three functions.

Depending on what we need from our model, some of these

components could be eliminated. But let’s continue looking at each remaining

component, not with object.size,

but serialize. Below, we use lapply to compute the serialized length

of each model component. This reveals that the terms component is actually the

largest component, despite object.size

reporting only 4432 bytes above.

as.matrix(lapply(lm.fit.3,

function(x) length(serialize(x,NULL))))

[,1]

coefficients 130

rank 26

assign 34

df.residual 26

na.action 84056

xlevels 55

call 275

terms 24004509

If we nullify the terms

component, the model becomes quite compact. (By the way, if we simply nullify terms,

summary, anova, and predict all fail.) Why is the terms component so large? It

turns out it has an environment object as an attribute. The environment contains

the variable dat, which contains the original data with 219932 rows and 26 columns. R’s serialize function includes this object

and hence the reason the model is so large. The function object.size ignores these objects.

attr(lm.fit.1$terms,

".Environment")

ls(envir = attr(lm.fit.1$terms,

".Environment"))

[1] "dat"

d <- get("dat",envir=envir)

dim(d)

[1] 219932 26

length(serialize(attr(lm.fit.1$terms, ".Environment"), NULL))

[1] 38959319

object.size(attr(lm.fit.1$terms, ".Environment"))

56 bytes

If we remove this object from the environment, the serialized object size also becomes

small.

rm(list=ls(envir = attr(lm.fit.1$terms,

".Environment")),

envir = attr(lm.fit.1$terms,

".Environment"))

ls(envir = attr(lm.fit.1$terms, ".Environment"))

character(0)

length(serialize(lm.fit.1, NULL))

[1] 85500

lm.fit.1

Call:

lm(formula = ARRDELAY ~ DISTANCE + DEPDELAY, data = dat, model = FALSE,

qr = FALSE)

Coefficients:

(Intercept) DISTANCE

DEPDELAY

0.225378 -0.001218

0.962528

Is the associated environment essential to the model? If not, we could

empty it to significantly reduce model size. We’ll rebuild the model using the function f.lm.full.

f.lm.full

<- function(dat) lm(ARRDELAY ~ DISTANCE

+ DEPDELAY, data = dat)

lm.fit.full <- f.lm.full(ontime_s)

ls(envir=attr(lm.fit.full$terms, ".Environment"))

[1] "dat"

length(serialize(lm.fit.full,NULL))

[1] 65826324

We’ll create the model removing some components as defined in function:

line-height: 115%; font-family: "Courier New";">f.lm.small

<- function(dat) {

f.lm <- function(dat) {

mod <- lm(ARRDELAY ~ DISTANCE + DEPDELAY, data

= dat, model=FALSE)

mod$fitted.values <- NULL

mod

}

mod <- f.lm(dat)

# empty the env associated with local function

e <- attr(mod$terms, ".Environment")

# set parent env to .GlobalEnv so serialization

doesn’t include contents

parent.env(e) <- .GlobalEnv

rm(list=ls(envir=e), envir=e) # remove all objects from this environment

mod

}

lm.fit.small

<- f.lm.small(ontime_s)

ls(envir=attr(lm.fit.small$terms, ".Environment"))

character(0)

length(serialize(lm.fit.small, NULL))

[1] 16219251

We can use the same function with embedded R execution.

lm.fit.ere <- ore.pull(ore.tableApply(ONTIME_S, f.lm.small))

ls(envir=attr(lm.fit.ere$terms, ".Environment"))

character(0)

length(serialize(lm.fit.ere, NULL))

[1] 16219251

as.matrix(lapply(lm.fit.ere, function(x)

length(serialize(x,NULL))))

[,1]

coefficients 130

residuals 4624354

effects 3442434

rank 26

fitted.values 4624354

assign 34

qr 8067072

df.residual 26

na.action 84056

xlevels 55

call 245

terms 938

Making this change does not affect the workings of the model

for coef, summary, anova, or predict.

For example, summary produces

expected results:

summary(lm.fit.ere)

Call:

lm(formula = ARRDELAY ~ DISTANCE + DEPDELAY, data = dat, model = FALSE)

Residuals:

Min 1Q

Median 3Q Max

-1462.45 -6.97

-1.36 5.07 925.08

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 2.254e-01 5.197e-02 4.336 1.45e-05 ***

DISTANCE -1.218e-03 5.803e-05 -20.979 < 2e-16

***

DEPDELAY 9.625e-01 1.151e-03 836.289 <

2e-16 ***

—

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

Residual standard error: 14.73 on 215144 degrees of freedom

(4785 observations deleted due to missingness)

Multiple R-squared: 0.7647, Adjusted R-squared: 0.7647

F-statistic: 3.497e+05 on 2 and 215144 DF, p-value: < 2.2e-16

Using the model for prediction also produces expected

results.

lm.pred <- function(dat, mod) {

prd <- predict(mod, newdata=dat)

prd[as.integer(rownames(prd))] <-

prd

cbind(dat, PRED = prd)

}

dat.test

<- with(ontime_s, ontime_s[YEAR == 2003 & MONTH == 5,

c("ARRDELAY",

"DISTANCE", "DEPDELAY")])

head(lm.pred(dat.test, lm.fit.ere))

ARRDELAY DISTANCE

DEPDELAY PRED

163267

0 748 -2

-2.61037575

163268 -8

361 0 -0.21414306

163269 -5

484 0 -0.36389686

163270 -3

299 3 2.74892676

163271

6 857 -6

-6.59319662

163272 -21

659 -8 -8.27718564

163273 -2

1448 0 -1.53757703

163274

5 238

9 8.59836323

163275 -5

744 0 -0.68044960

163276 -3

199 0 -0.01690635

As shown above, an lm model can become quite large. At least for some applications, several of these

components may be unnecessary, allowing the user to significantly reduce the

size of the model and space required for saving or time for transporting the model. Relying on Oracle Database to store the data instead of the R model object further allows for significant reduction in model size.

**leave a comment**for the author, please follow the link and comment on their blog:

**Oracle R Enterprise**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.