The latest release of Revolution R Enterprise shows how Revolution Analytics’ package for big data, RevoScaleR, is continuing add new capabilities for Big Data statistics. RevoScaleR removes the limits on the size of the data that can be processed in R through the use of the highly efficient .Xdf binary file format. Xdf stores data by rows within columns and facilitates the use of external memory algorithms that process chunks of data. RevoScaleR functions operate directly from Xdf files so there is no need to read data into memory. The initial release of RevoScaleR included the functions rxLinMod (multiple regression) and rxLogit (binary logistic regression) enable users to build models using very large data sets. Both of these functions return model objects whose components can be graphed or fed into standard R functions so the integration with standard R is nearly seamless.
The new release of RevoScaleR includes a number of functions (Table 1) that complement rxLinMod and rxLogit and make it easier to do statistical inference.
Table 1: New RevoScaleR functions that complement rxLinMod and rxlogit
returns the covariance matrix for a group of variables in an Xdf file
returns the correlation matrix for a group of variables in an Xdf file
returns the sum of squares or cross-product matrix for a group of variables in an Xdf file. When the variables are interpreted as the matrix X, that is the matrix of predictor variables in a regression model, the result of rxSSCP is the matrix X'X.
a wrapper for the three functions above
returns the covariance matrix of the coefficients for a model built with rxLinMod or rxlogit.
returns the correlation matrix of the coefficients for a model built with rxLinMod or rxlogit.
returns the covariance matrix for the explanatory variables in model built with rxLinMod or rxLogit. It gives the same results as rxCov acting directly on the data
returns the correlation matrix for the explanatory variables in model built with rxLinMod or rxLogit. It gives the same results as rxCor acting directly on the data
To get a feel for what these new functions can do let’s look at some simple models built from the airlines data file that was used in the American Statistical Association’s 2009 Data Expo challenge. The ASA files contain information on all US domestic flights between 1987 and 2008. This data was read into a single Xdf file with over 123 million rows. The first example uses the new rxCor function to compute the correlation matrix of five variables: departure time, arrival time, air time, arrival delay and departure delay.
# Correlation Matrix formula <- ~ DepTime + ArrTime + AirTime + ArrDelay + DepDelay CorM <- rxCor(formula, data=working.file) Computation time: 7.20 seconds. CorM DepTime ArrTime AirTime ArrDelay DepDelay DepTime 1.00000000 0.72243775 -0.042412723 0.16885127 0.183462571 ArrTime 0.72243775 1.00000000 0.021413969 0.07656231 0.076315166 AirTime -0.04241272 0.02141397 1.000000000 0.01650281 0.005269681 ArrDelay 0.16885127 0.07656231 0.016502811 1.00000000 0.887310003 DepDelay 0.18346257 0.07631517 0.005269681 0.88731000 1.000000000
The calculation took a little over 7 seconds on my 4 core (8 thread) 1.87 GHz Dell laptop out fitted with 8 GB of RAM running 64bit Windows & and Revolution R Enterprise 4.3.
Next, we build a small linear model with two independent variables and use rxCovCoef to compute the model covariance matrix. This process takes a little less than 5 seconds.
# Build a linear model AD.model <- rxLinMod(ArrDelay ~ DepTime + ArrTime,data=working.file,covCoef=TRUE) #Computation time: 4.67 seconds. MCovM <- rxCovCoef(AD.model) MCovM (Intercept) DepTime ArrTime (Intercept) 8.399560e-05 -2.193390e-06 -3.065595e-06 DepTime -2.193390e-06 7.416598e-07 -5.257515e-07 ArrTime -3.065595e-06 -5.257515e-07 6.780731e-07
In theory, this matrix is equal to the variance of the regression model, s, multiplied by the inverse of the square of the design matrix: (X’X)-1. So dividing by and estimate of s2 should give us a pretty good estimate of (X’X)-1, a very useful item for doing statistical inference. A component of the model object produced by rxLinMod, AD.model$sigma, is a list whose first slot contains an estimator of s. So, the following code should produce and estimate of (X’X)-1.
# Get an estimate of X prime X inverse sigma2 <- as.numeric(AD.model$sigma[])^2 XprimeXinverse <- MCovM/sigma2 XprimeXinverse (Intercept) DepTime ArrTime (Intercept) 9.138475e-08 -2.386344e-09 -3.335278e-09 DepTime -2.386344e-09 8.069041e-10 -5.720022e-10 ArrTime -3.335278e-09 -5.720022e-10 7.377236e-10
To see how good of an estimate we have, let’s use function rxCovCoef to get an estimate of (X’X).
Now, a little matrix multiplication and we should have the identity matrix.
XprimeX %*% XprimeXinverse (Intercept) DepTime ArrTime Constant 1.000031e+00 -3.650103e-07 -3.456415e-08 DepTime 2.359597e-04 1.000050e+00 -3.814792e-05 ArrTime -2.516375e-05 -8.483162e-05 1.000103e+00
Not too bad!