There is no “Too Big” Data, is there?

April 23, 2014
By

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A few years ago, a former classmate came back to me with a simple problem. He was working for some insurance company (and still is, don’t worry, chatting with me is not yet a reason for dismissal), and his problem was that their dataset was too large to run (standard) codes to get a regression, and some predictions. My answer was too use sub-sampling techniques, and I still believe that this might be a good idea (actually, I wrote a long post, on that issue, entitled too large datasets for regression ? What about subsampling). But I wanted to go further, since I did not discuss predictions obtained with sub-sampling techniques.

So, consider here a logistic regression $Y_i\sim\mathcal{B}(p_i)$, based on some covariates. We have $k$ explanatory variables ($k$ will be large, but not too large) and $n$ observations (with $n­\gg k$), $\frac{p_i}{1-p_i}=\exp[\boldsymbol{X}_i^{\text{\sffamily T}}\boldsymbol{\beta}]$ Here we have a big (potentially) matrix product $\boldsymbol{X}^{\text{\sffamily T}}\boldsymbol{\beta}$ i.e. ${\begin{matrix}{}_{ \uparrow} \\ {}_{n} \\ {}_{ \downarrow}\end{matrix}} \underset{\leftarrow \ \ k+1 \ \ \rightarrow}{\begin{bmatrix}\vdots&\vdots&&\vdots\\1 & X_{1,i} &\cdots &X_{k,i}\\\vdots&\vdots&&\vdots\end{bmatrix}}\begin{bmatrix}\beta_0 \\\beta_1\\ \vdots \\ \beta_k \end{bmatrix}$ with a large $\boldsymbol{X}$ matrix. Here, assume that we have a $100,000\times101$ matrix, with $100,000$ individual observations, and $100$ possible variables (and the intercept). Actually, in my model, only $2$ variables were actually used in the real model. Assume further that explanatory variables are – potentially – correlated.

```n=100000
library(mnormt)
k=50
r=.2
Sig=matrix(r,k,k)
diag(Sig)=1
X=rmnorm(n,varcov=Sig)
U=pnorm(rmnorm(n,varcov=Sig))
p=exp(-U[,1]-X[,1]-1)/(1+exp(-U[,1]-X[,1]-1))
Y=rbinom(n,size=1,p)
df=data.frame(Y,U,X)
names(df)=c("Y",paste("U",1:50,sep=""),paste("X",1:50,sep=""))
reg=glm(Y~.,data=df,family="binomial")```

In some sense, it is not too big, since we can run a regression on that dataset with a simple laptop (even if it can still be seen as a large dataset, in the sense discussed in http://businessweek.com/…). But let us consider an alternative strategy, to be able to get some predictions – or some model – in the case we cannot run a regression. Two strategies will be compared,

• generate $100$  datasets with $n/10$ observations, by sub-sampling
• generate $100$  datasets with $n/100$ observations, by sub-sampling,

On each dataset, we can now run a regression, and compare the estimation of the coefficients with the “true” regression (on the whole dataset, since here, we can still run it). Then, since out of $100$ explanatory variables, only $2$ were actually used to generate the ouput, we should probably remove unnecessary variables in our model. So, some stepwise procedures were used.

```L1=L2=L1s=L2s=list()
library(MASS)
ns1=n/10
ns2=n/100
for(s in 1:100){
i=sample(1:n,size=ns1,replace=TRUE)
reg_sub=glm(Y~.,data=df[i,],family="binomial")
L1[[s]]=reg_sub
L1s[[s]]=stepAIC(reg_sub)
i=sample(1:n,size=ns2,replace=TRUE)
reg0=glm(Y~.,data=df[i,],family="binomial")
L2[[s]]=reg_sub
L2s[[s]]=stepAIC(reg_sub)
}```

For instance, if we consider the very first coefficient which should appear in the regression (let us forget about the intercept), or the second coefficient (which was not considered to generate the dataset), we get

```VC=c(-1,-1,rep(0,49),-1,rep(0,49))
coef=function(k){
C1=unlist(lapply(L1,function(x) coefficients(x)[k]))
C2=unlist(lapply(L2,function(x) coefficients(x)[k]))
m=summary(reg)\$coefficients
u=seq(quantile(C2,.2),quantile(C2,.8),length=501)
v=dnorm(u,m[k,1],m[k,2])
plot(u,v,col="white",xlab="",ylab="",axes=FALSE)
axis(1)
polygon(c(u,rev(u)),c(v,rep(0,length(u))),col="grey",border=NA)
abline(v=VC[k],lty=2)
}

coef(2)```

where the density in grey is the Gaussian density of some estimator obtained from the large (and complete) dataset and boxplots are estimates obtained on sub-samples (without the stepwise procedure, just to make sure I will keep that variable).

For coefficients associated to variables not used to generate the dataset, we get graphs like the following

So, clearly, the smaller the dataset, the large the dispersion of the estimates. But far, nothing new. In my previous post – too large datasets for regression ? What about subsampling – my point was to discuss computational times, and a possible optimal size of sub-datasets. Now, what about the impact of sub-sampling on predictions. Here, we fit a model on a small sample, but we can get a prediction on the whole dataset. In order to describe the goodness of fit of our regression model, let us plot ROC curves. More specifically, three kinds of lines will be plotted,

• the ROC curve for the $\widehat{Y}_i$‘s obtained with the model on the complete dataset [red]
• the ROC curves for the $\widehat{Y}_i^{(b)}$‘s obtained with the model on the $b$‘s subsample [light blue]
• the ROC curve for the $\widetilde{Y}_i$ ’s obtained by averaging the previous estimates [blue]

$\widetilde{Y}_i=\frac{1}{B}\sum_{b=1}^B\widehat{Y}_i^{(b)}$

```S=predict(reg,type="response")
Y=def\$Y
M.ROC=ROC.curve(S,Y)
plot(M.ROC[1,],M.ROC[2,],type="s",col="red")

Z=df\$Y*0
for(si in 1:100){
S=predict(L1s[[si]],type="response",newdata=df)
Z=Z+S
Y=df\$Y
M.ROC=ROC.curve(S,Y)
lines(M.ROC[1,],M.ROC[2,],type="s",col="light blue")
}

S=Z/100
Y=df\$Y
M.ROC=ROC.curve(S,Y)
lines(M.ROC[1,],M.ROC[2,],type="s",col="blue",lwd=2)```

If we consider sub-samples of size $n/10$, we get the following, and when we consider sub-samples of size $n/100$, without the stepwise procedure (most variables have a small coefficient, not significant) and after the stepwise procedure Clearly – and that should not be a surprise – looking at predictions when the model was fitted on !% of the dataset is not great (ROC curves are substantially below the red ROC curve). But the interesting point is that averaging yields great results. In terms of ROC curve, we have the same

• running one regression on our $100,000\times101$  matrix
• averaging prediction after running $100$ regressions on some $1,000\times101$  matrices

Except that the first one might not be possible to run, if the dataset was larger. And I have to admit that with the stepwise procedure, with $100$ variables (where $98$ should – theoretically – be renoved), it took some time! But still. I have the feeling the sub-sampling is extremely promising in the context of too large datasets.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.