# New functions for linear model inference in Revolution R Enterprise 4.3

April 26, 2011
By

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The latest release of Revolution R Enterprise shows how Revolution Analytics’ package for big data, RevoScaleR, is continuing add new capabilities for Big Data statistics. RevoScaleR removes the limits on the size of the data that can be processed in R through the use of the highly efficient .Xdf binary file format. Xdf stores data by rows within columns and facilitates the use of external memory algorithms that process chunks of data. RevoScaleR functions operate directly from Xdf files so there is no need to read data into memory. The initial release of RevoScaleR included the functions rxLinMod (multiple regression) and rxLogit (binary logistic regression) enable users to build models using very large data sets. Both of these functions return model objects whose components can be graphed or fed into standard R functions so the integration with standard R is nearly seamless.

The new release of RevoScaleR includes a number of functions (Table 1) that complement rxLinMod and rxLogit and make it easier to do statistical inference.

Table 1: New RevoScaleR functions that complement rxLinMod and rxlogit

 rxCov returns the covariance matrix for a group of variables in an Xdf file rxCor returns the correlation matrix for a group of variables in an Xdf file rxSSCP returns the sum of squares or cross-product matrix for a group of variables in an Xdf file. When the variables are interpreted as the matrix X, that is the matrix of predictor variables in a regression model, the result of rxSSCP is the matrix X'X. rxCovCor a wrapper for the three functions above rxCovCoef returns the covariance matrix of the coefficients for a model built with rxLinMod or rxlogit. rxCorCoef returns the correlation matrix of the coefficients for a model built with rxLinMod or rxlogit. rxCovData returns the covariance matrix for the explanatory variables in model built with rxLinMod or rxLogit. It gives the same results as rxCov acting directly on the data rxCorData returns the correlation matrix for the explanatory variables in model built with rxLinMod or rxLogit. It gives the same results as rxCor acting directly on the data

To get a feel for what these new functions can do let’s look at some simple models built from the airlines data file that was used in the American Statistical Association’s 2009 Data Expo challenge. The ASA files contain information on all US domestic flights between 1987 and 2008. This data was read into a single Xdf file with over 123 million rows. The first example uses the new rxCor function to compute the correlation matrix of five variables: departure time, arrival time, air time, arrival delay and departure delay.

# Correlation Matrix
formula <- ~ DepTime + ArrTime + AirTime + ArrDelay + DepDelay
CorM <- rxCor(formula, data=working.file)
Computation time: 7.20 seconds.
CorM

DepTime    ArrTime      AirTime   ArrDelay    DepDelay
DepTime   1.00000000 0.72243775 -0.042412723 0.16885127 0.183462571
ArrTime   0.72243775 1.00000000  0.021413969 0.07656231 0.076315166
AirTime  -0.04241272 0.02141397  1.000000000 0.01650281 0.005269681
ArrDelay  0.16885127 0.07656231  0.016502811 1.00000000 0.887310003
DepDelay  0.18346257 0.07631517  0.005269681 0.88731000 1.000000000

The calculation took a little over 7 seconds on my 4 core (8 thread) 1.87 GHz Dell laptop out fitted with 8 GB of RAM running 64bit Windows & and Revolution R Enterprise 4.3.

Next, we build a small linear model with two independent variables and use rxCovCoef to compute the model covariance matrix. This process takes a little less than 5 seconds.

# Build a linear model
AD.model <- rxLinMod(ArrDelay ~ DepTime + ArrTime,data=working.file,covCoef=TRUE)
#Computation time: 4.67 seconds.
MCovM

(Intercept)       DepTime       ArrTime
(Intercept)  8.399560e-05 -2.193390e-06 -3.065595e-06
DepTime     -2.193390e-06  7.416598e-07 -5.257515e-07
ArrTime     -3.065595e-06 -5.257515e-07  6.780731e-07

In theory, this matrix is equal to the variance of the regression model, s, multiplied by the inverse of the square of the design matrix: (X’X)-1.  So dividing by and estimate of s2 should give us a pretty good estimate of (X’X)-1, a very useful item for doing statistical inference. A component of the model object produced by rxLinMod, AD.model$sigma, is a list whose first slot contains an estimator of s. So, the following code should produce and estimate of (X’X)-1. # Get an estimate of X prime X inverse sigma2 <- as.numeric(AD.model$sigma[[1]])^2
XprimeXinverse <- MCovM/sigma2
XprimeXinverse

(Intercept)       DepTime       ArrTime
(Intercept)  9.138475e-08 -2.386344e-09 -3.335278e-09
DepTime     -2.386344e-09  8.069041e-10 -5.720022e-10
ArrTime     -3.335278e-09 -5.720022e-10  7.377236e-10

To see how good of an estimate we have, let’s use function rxCovCoef to get an estimate of (X’X).

# Estimate of X prime X
formula <- ~ DepTime + ArrTime
XprimeX <- rxSSCP(formula,data=working.file)
#Computation time: 3.54 seconds.

XprimeX

Constant     DepTime     ArrTime
Constant  120950491  1654978671  1830027994
DepTime  1654978671 25397142048 27174128334
ArrTime  1830027994 27174128334 30699052023

Now, a little matrix multiplication and we should have the identity matrix.

XprimeX %*% XprimeXinverse

(Intercept)       DepTime       ArrTime
Constant  1.000031e+00 -3.650103e-07 -3.456415e-08
DepTime   2.359597e-04  1.000050e+00 -3.814792e-05
ArrTime  -2.516375e-05 -8.483162e-05  1.000103e+00

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...