No surprise that Teradata Aster runs each SQL, SQL-MR, and SQL-GR command in parallel on many clusters with distributed data. But when faced with the task of running many similar but independent jobs one has to do extra work to parallelize them in Aster. When running a SQL script the next command has to wait for the previous to finish. This makes sense when commands contribute to the pipeline with results of each job passed down to next one. But what if the jobs are independent and produce their own results each. For example, cross-validation of linear regression or other models is divided into independent jobs each working with its respective partition (of total K in case of K-fold cross-validation). These jobs could run in parallel in Aster with little help from R. This post will illustrate how to run K linear regression models in parallel in Aster as part of the K-fold cross-validation procedure.
Further more, the examples will be concerned only with the step in K-fold cross-validation that creates K models on overlapping but different partitions of the training dataset. We will show how to construct K independent linear regression models in parallel on Aster, each for one of the K partitions of the table (not the same as table partitioning in Aster).
Data and R Packages
We will use Dallas Open Data data set available from here (including Aster load scripts).
To simplify examples we will also use R package toaster for Aster and several other packages – all available from CRAN:
Data set, Model and K Folds
which results in:
 “area” “value” “lon” “lat”
These 4 fields will make up our simple linear model to determine the value of construction using its area and location. And now the same in R terms:
This problem is not beyond R memory limits but our goal is to execute linear regression in Aster. We enlist toaster‘s computeLm function that returns R lm object:
Lastly, we need to define the folds (partitions) on the table to build linear regression model on each of them. Usually, this step performs equal and random division into partitions. Doing this with R and Aster is actually not extremely difficult but will take us beyond the scope of the main topic. For this reason alone we propose quick and dirty method of dividing building permits into 12 partitions (K=12) using issue date’s month value (in SQL):
Again, do not replicate this method in real cross-validation task but use it as a template or a prototype only.
To make each fold’s compliment (used to train 12 models later) we simply exclude each month’s data, e.g. selecting the compliment to the fold 6 in its entirety (in SQL):
Computing Cross-Validation Models in Aster with R
This results in the list fit.folds that contains 12 linear regression models for each fold respectively.
Next, we replace the for loop with the specialized foreach function designed for parallel execution in R. There is no parallel execution yet but all necessary structure for transition to parallel processing:
foreach performs the same iterations from 1 to 12 as for loop and combines results into list by default.
Parallel Computing in Aster with R