**Revolutions**, and kindly contributed to R-bloggers)

*by Joseph Rickert*

When

talking with data scientists and analysts — who are working with large scale data

analytics platforms such as Hadoop — about the best way to do some sophisticated

modeling task it is not uncommon for someone to say, "We have *all* of the data. Why not just use it all?" This sort of comment often initially sounds pragmatic and reasonable

to almost everyone. After all, wouldn’t a model based on all of the data be

better than a model based on a subsample? Well, maybe not — it depends, of

course, on the problem at hand as well as time and computational constraints. To

illustrate the kinds of challenges that large data sets present, let’s just look

at something very simple using the airlines data set from the 2009 ASA

challenge.

Here are some of the results for a regression of ArrDelay on

CRSDepTime with a random sample of 12,283 records drawn from that data set:

# Coefficients:

# Estimate

Std. Error t value Pr(>|t|)

# (Intercept)

-0.85885 0.80224 -1.071

0.284

# CRSDepTime 0.56199

0.05564 10.100 2.22e-16

# Multiple

R-squared: 0.008238

# Adjusted R-squared: 0.008157

And here are some results from the same model using 120,947,440

records:

#Coefficients:

# Estimate

Std. Error t value Pr(>|t|)

# (Intercept)

-2.4021635 0.0083532 -287.6 2.22e-16 ***

# CRSDepTime 0.6990404

0.0005826 1199.9 2.22e-16 ***

# Multiple

R-squared: 0.01176

# Adjusted R-squared: 0.01176

More data data didn’t yield an obvously better model! I don’t think anyone

would really find this to be much of a surprise. We are dealing with a not very

good model to begin with. Nevertheless, the example does provide the

opportunity to investigate how estimates of the coefficients change with sample

size. This next graph shows the coeffients of the slope plotted against sample

size with sample sizes ranging from 12,283 to 12,094,709 records. Each

regression was done on a random sample that includes about 12,000 points more

than the previous one. The graph also shows the standard estimate for the

confidence interval for the coefficient at each point in red. Notice that after

some initial instability, the coefficient estimates settle down to something

close to the value of beta obtained using all of the data.

The rapid approach to the full-data-set value of the coefficient is even

more apparent in the following graph that shows the difference between the

estimated value of the beta coefficient at each sample and the value obtained using

all of the data. The maximum difference from the fourth sample on is 0.07. This is pretty close indeed. In cases like this, if you

believed that your samples were representative of the entire data set, working

with all of the data to evaluate possible models would be a waste of time an possibly counterproductive.

I am certainly not arguing that one never wants to use all of the data.

For one thing, when scoring a model or making predictions the goal is to do

something with all of the records. Moreover, in more realistic modeling

situations where there are thousands of predictor variables 120M observations

might not be enough data to conclude anything. A large model can digest degrees

of freedom very quickly and severely limit the ability to make any kind of statistical

inference. I do want to argue, however, that with large data sets the ability

to work with random samples of the data confers the freedom to examine several

models quickly with considerable confidence that results would be decent

estimates of what would be obtained in using the full data set.

I did the random sampling and regressions in my little example using

functions from Revolution Analytics RevoScaleR package. Initially, all of the

data was read from the csv files that comprise the FAA data set into the binary

.xdf file format that is used by the RevoScaleR package. Then the random samples were selected

by using the rxDataStep function of RevoScaleR. This function was designed to

quickly manipulate large data sets. The

code below reads a record, draws a random number with a value between 1 and 9999

and assigns it to the variable urns.

rxDataStep(inData = working.file, outFile = working.file, transforms=list(urns = as.integer(runif(.rxNumRows,1,10000))), overwrite=TRUE)

Random samples for each regression were drawn by looping throught the

appropriate values of the variables. Notice how the call to R’s runif()

function happens within the transforms parameter of rxDataStep. It took about

33 seconds to do the full regression on my laptop which made it feasible to

undertake the extravagent number of calculations necessary to do the 1,000

regressions in a few hours after dinner.

I think there are three main take aways from this exercise:

- Lots of data does not necessarily equate to “Big Data”
- For exploratory modeling you want to work in an environment that allows

for the rapid prototyping and provides the statistical tools for model

evaluation and visualizations. There is no better environment that R for this

kind of work, and the Revolution’s distribution of R offers the ability to work

with very large samples. - The ability draw random samples from large data sets is the way to balance

accuracy against computational constraints.

To my way of thinking, the single most important capability to implement

in any large scale data platform that is going to support sophisticated

analytics is the ability to quickly construct, high quality random samples.

**leave a comment**for the author, please follow the link and comment on their blog:

**Revolutions**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...