Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Seth Mottaghinejad, Data Scientist at Microsoft

## R and big data

There are many R packages dedicated to letting users (or useRs if you prefer) deal with big data in R. (We will intentionally avoid using proper case for 'big data', because (1) the term has been somewhat hackneyed, and (2) for the sake of this article we can think of big data as any dataset too large to fit into memory as a data.frame so that standard R functions can run on them.) Even without third party packages, base R still puts some toolkits at our disposal, which boil down to doing one of two things: We can either format the data more economically so that it can still be squeeze into memory, or we can deal with the data piecemeal, bypassing the need to load it into memory all at once.

An example of the first approach is to format character vectors as factor, when doing so is appropriate, because factor is stored as integer under the hood which takes less space than the strings it represents. There are of course many other advantages to using factor, but let's not digress. An example of the second approach consists of processing the data only a certain number of rows at a time, i.e. chunk by chunk, where each chunk can fit into memory and brought into R as a data.frame.

## RevoScaleR and big data

The aforementioned chunk-wise processing of data is what the RevoScaleR package does behind the scenes. For example, if we run the rxLinMod function (the counterpart to base R's lm function), we can run a regression model on a very large dataset, presumably is too large to fit into memory. Even if the dataset could still fit into memory (servers nowadays can have easily have 500GB of RAM), processing it using lm would take considerably longer than rxLinMod because the latter is a parallel algorithm, meaning that it breaks up the data into chunks and decides what intermediate results are kept from each chunk and how they are aggregated at the end to produce a single result, which in the case of rxLinMod is a linear model. (What those intermediate results are and how they are to be aggregated is what makes developing parallel algorithms a challenging and interesting problem. A great deal of effort is spent on taking a non-parallel algorithm and rewriting it to work in a parallel way. But that's a topic for another day.)

On a local machine, there are two kinds of data types that work with RevoScaleR:

• We can run a function directly on a flat file (such as a CSV or a fixed format file).
• We can convert the data to a format called XDF (external data frame). An XDF dataset, just like an R frame, is a format that can only be understood by R (more specifically, by the set of analytical and data processing functions in the RevoScaleRpackage). But unlike a data.frame, an XDF data format lets the user store the data on disk (the letter X stands for “external”), which means the data can be as large as the available disk on the machine without putting any strains on R.

## Flat files vs XDF

In either of the above two cases, the data is sitting on disk and brought into R chunk-by-chunk to be operated on by one of the RevoScaleR parallel functions. So the next question is: which of the two methods is preferable? In other words, should we store the data as flat files such as CSV or should we convert the data into XDF? Part of the answer to the question depends on requirements sometimes beyond our control. For example, the data science team may have a broad set of analysis tools and hence prefer to have data in a format such as CSV which all the tools can understand. Furthermore, the team may be averse to the idea of having separate copies of the data (one in CSV and another in XDF) because keeping them in sync might be a hassle (not to mention the extra space required to store the copies). Barring such non-technical requirements, the question of CSV vs XDF boils down to which one is faster, and this is what we intend to address here by running you through some benchmark tests.

## Comparison benchmarks

The benchmark tests consist of running the rxSummary function (analogous to the summary function in base R) on increasingly larger chunks of the Criteo dataset. The dataset is made up of 14 numeric and 26 non-numeric variables. We will not concern ourselves with what these variables are, as it has little bearing on the benchmark tests, but suffice it to say that some of the non-numeric variables have relatively few unique values (good candidates for factor conversion) and others are best left as character. This is not uncommon in most datasets, where we can have categorical data such as gender or state which have predefined unique values and should be treated as factor as well as string variables such as alphanumeric IDs or raw text, which are for the most part unique (or have high cardinality) and should remain as character variables.

The tests run across two main comparison axes: We compare how runtime is affected by

• Reading directly from a CSV file to summarize it versus converting the CSV file to an XDF file and then summarizing it.
• Declaring all or some variablesfactor as opposed to character when reading the data. More specifically, we will be iterating over four possibilities:
• declaring all non-numeric variables to be character
• declaring all non-numeric variables to be factor
• declaring non-numeric variables with 10,000 (chosen somewhat arbitrarily) or fewer unique values to befactor and the rest character
• same as above, but also explicitly providing the levels for thefactor variables at the point of declaration

For the sake of clarity, the individual tests are once again shown in the table that follows, where each test has also been labeled to make it easier to refer to them later.

 column types / data source read from CSV and summarize convert to XDF and summarize all non-numeric variables are character 1a 2a all non-numeric variables are factor 1b 2b variables with cardinality < 10K are factor 1c 2c variables with cardinality < 10K are factor and levels are provided 1d 2d

The data sizes that we've chosen to test the above on have 500K, 5M, and 50M rows. So for a dataset of a given size, there will be a total of 8 tests, two tests for case 1 times four tests for case 2, giving us a total of 3*2*4 = 24 tests in total.

## Setting expectations

Before we look at numbers, here's a few things to keep in mind:

• Two important things happen when we declare a variable to be a factor: (1) rxSummary provides us with counts of unique values for the variable, and (2) under the hood, the variable gets stored as integer, which when writing to an XDF file can speed up IO. Both of these will affect runtime and are therefore relevant to our tests.
• In tests 1a-1d, we are reading from a CSV file to summarize the results, whereas in tests 2a-2d we are reading from the original CSV to copy or write its content to an XDF file, and then reading from the XDF file to summarize it. While it may seem that the latter would always be slower, we need to remember that an XDF file is a more efficient format and therefore reading from it will be much faster than its CSV counterpart. But what we need to determine is if it is fast enough to make up for the time it took to convert from CSV to XDF.
• The case for deciding whether a variable should be formatted as factor or character depends on questions such as whether we want to get counts for the variable (factor), use it as a categorical variable in a regression model (factor), apply string manipulation functions to the variable (character), use it as a key to join it with other datasets (character or factor if extra caution is taken) and so on. Generally, turning all the non-numeric variables in the data into factor (tests 1b and 2b) is not a good idea, especially if the data contains variables that are more-or-less unique (such as ID variables), because of the overhead associated with creating factor variables.
• Our choice of letting non-numeric variables with 10,000 or fewer unique values be factor and the rest character (tests 1c, 2c, 1d, 2d) is somewhat arbitrary, but in effect it mirrors the notion of factor variables having a predefined domain of acceptable values. If we don't know this information ahead of time, then we must make an initial pass through the data to find out, which would add to the runtime, but we ignore this overhead. In practice, information about candidate factor variables and their associated levels can be gleaned from a “data dictionary” if one is maintained, and if not, a judicious guess can be made by sampling from the data.

## A high-level overview of the R code

We invite the user to pore over the entire R code, but we will point to the most important pieces here. First we store all four iterations of the aforementioned column types in the col_info_types object. Each object in col_info_types is itself a list, one for each column in the data. It specifies whether that column is numeric, character, or factor, and in the case of col_info_some_fac_with_lev it also specifies what the levels are when the column is a factor.

col_info_types <- list("CI1_all_char" = col_info_all_char,
“CI2_all_fac” = col_info_all_fac,
“CI3_some_fac” = col_info_some_fac,
“CI4_with_lev” = col_info_some_fac_with_lev)

To run tests 1a-1d, we write a function called summarize_with_rxTextData which will point to the original CSV file using the RxTextData function in RevoScaleR and then pass the result to rxSummary to get summaries from. The RxTextData function has an argument called colInfo to which we will pass the four column types above, one at a time. The function will return two pieces of information: (1) runtime for RxTextData, (2) runtime for rxSummary.

summarize_with_rxTextData <- function(ds, colInfo, ...) {
start.time <- Sys.time()

st1 <- system.time(
text_ds <- RxTextData(ds,
firstRowIsColNames = FALSE,
missingValueString = “”,
delimiter = “t”, colInfo = colInfo, …))

closeAllConnections()

st2 <- system.time(rxs <- rxSummary( ~ ., data = text_ds))

print(Sys.time() – start.time)

return(list(time_01_RxTextData = st1, time_02_rxSummary = st2))

}

results_01_rxTextData <- lapply(col_info_types,
summarize_with_rxTextData,
ds = criteo_csv)

For the sake of illustration, here are the runtimes for a dataset with 50K rows.

> sapply(results_01_rxTextData,
function(x) x$time_01_RxTextData)[1:3, ] CI1_all_char CI2_all_fac CI3_some_fac CI4_with_lev user.self 0.009 0.008 0.007 0.008 sys.self 0.000 0.000 0.000 0.000 elapsed 0.009 0.008 0.007 0.008 > sapply(results_01_rxTextData, function(x) x$time_02_rxSummary)[1:3, ]

CI1_all_char CI2_all_fac CI3_some_fac CI4_with_lev

user.self        0.011       0.078        0.087        0.089
sys.self         0.001       0.001        0.001        0.001
elapsed          3.438      10.091        9.102        5.977

As we can see, the runtime for RxTextData is negligible (since all it does is point to a file and specify the column types), so all the runtime is taken up by rxSummary. It took rxSummary approximately 3.4, 10.1, 9.1 and 6 seconds respectively to run tests 1a-1d.

To run tests 2a-2d, we write a function called summarize_with_rxImport which will convert the CSV file to XDF using the rxImport function and then pass the result to rxSummary to get the summaries, just like the last case. Just like RxTextDatarxImport also has an argument called colInfo which allows us to specify the column types. The function returns two pieces of information: (1) runtime for rxImport, (2) runtime for rxSummary.

summarize_with_rxImport <- function(ds, colInfo, ...) {

start.time <- Sys.time()

st1 <- system.time(
rxImport(inData = ds,
outFile = criteo_xdf,
overwrite = TRUE,
colInfo = colInfo, …))

st2 <- system.time(rxs <- rxSummary( ~ ., data = criteo_xdf))

closeAllConnections()

print(Sys.time() – start.time)

return(list(time_01_rxImport = st1, time_02_rxSummary = st2))

}

results_02_rxImport <- lapply(col_info_types,
summarize_with_rxImport,
ds = criteo_csv)

And here are the results for the dataset with 50K rows:

> sapply(results_02_rxImport,
function(x) x$time_01_rxImport)[1:3, ] CI1_all_char CI2_all_fac CI3_some_fac CI4_with_lev user.self 0.065 0.066 0.067 0.078 sys.self 0.000 0.000 0.000 0.000 elapsed 4.548 5.607 5.214 5.615 > sapply(results_02_rxImport, function(x) x$time_02_rxSummary)[1:3, ]

CI1_all_char CI2_all_fac CI3_some_fac CI4_with_lev

user.self        0.011       0.077        0.048        0.064
sys.self         0.000       0.000        0.001        0.002
elapsed          0.478       1.007        0.616        0.667

Things look a bit different now: rxImport takes a considerable amount of time to run (most which spent on IO operations), but once the XDF object is created rxSummary can process it very quickly to get the results. It took approximately 5, 6.6, 5.8, and 6.3 seconds to run tests 2a-2d respectively.

The numbers above were reported just to give the reader a closer view of the raw results. Since a dataset of size 50K is rather small, we now run these tests for datasets of size 500K, 5M and 50M, collect the corresponding runtimes.  Finally, we summarize the results in a chart.

## Going over the results

The chart shows the runtime (broken up by time taken up by RxTextData/rxImport vs rxSummary), using different data sizes: 500K, 5M, and 50M (labeled in thousands of rows, as 500, 5000, and 50000 respectively and color-coded). The variable on the y-axis, which we call “adjusted runtime”, is the average run time for processing 50K rows of the data. To make it easier to identify the tests, the way the plots are presented in the chart correspond to the way they were presented in table shown earlier. Here are some interesting findings to come out of it:

• For all the tests in column one, all the runtime is due to rxSummary, since as we mentioned before, RxTextData merely points to the data (and tags on some metadata through colInfo) without doing any processing.
• Test 1a beats 2a, and 1d beats 2d regardless of data size. In other words, when all non-numeric columns are treated as character or when certain factors are present in the data and their levels are known a priori, it is faster to read and summarize the data directly from the original CSV file than to convert it to XDF first. However, it bears mentioning that (1) the difference in runtime is almost negligible, and (2) we are running a single analytical function (rxSummary), whereas in practice we often run multiple analytical functions off the data (such as rxSummary followed by rxLogit followed by rxPredict) which will tip the balance in favor of converting to XDF. To put it in economic terms, there is a fixed cost (runtime) associated with XDF conversion, but once this fixed cost is paid, the variable cost of running our analysis on the XDF file (as opposed to the original CSV file) is reduced considerably, and the more analysis functions we run off of the XDF file, the more the cost reduction will make up for the fixed cost.
• Test 1b loses out to 2b, and 1c loses out to 2c: i.e. when all or certain variables need to be converted to factors but factor levels are not known ahead of time, requiring an initial pass through the data to determine the levels, it is faster to convert the data to XDF first than to do deal directly with the CSV file. Moreover, for both 1b and 2b we can see that running rxSummary does not scale well with the data (the blue bar on the right in 1b and 2b rises sharply compared to the red and green), underlining what we mentioned earlier, namely that declaring non-numeric variables that are more-or-less unique as factor can significantly up the overhead.

## Concluding remarks

Let's conclude by first acknowledging that the above results should be taken with a grain of salt, as they are somewhat dependent on the kind of data at hand. However, the Criteo dataset makes a good use case because it contains a good mix of numeric and non-numeric data. It is therefore reasonable to expect the above conclusions to still hold water with data that's almost exclusively numeric or exclusively non-numeric as the case may be. We should also not be too surprised with some of the results, especially the slowdown in runtime when all non-numeric data is declared as factor (tests 1b and 2b), since it mirrors letting stringsAsFactors = TRUE when running read.table in base R, which also results in considerable slowdown for larger-sized datasets. It's also important to note that all of our results have been on a single machine, and that other considerations come into play in a distributed environment such as a Hadoop cluster or in modern database. In a later article, we will examine what the results look like when run in such an environment with even larger data sizes.