June 1, 2013
By

In my previous post (http://statcompute.wordpress.com/2013/05/25/test-drive-of-parallel-computing-with-r) on 05/25/2013, I’ve demonstrated the power of parallel computing with various R packages. However, in the real world, it is not straight-forward to utilize these powerful tools in our day-by-day computing tasks without carefully formulate the problem. In the example below, I am going to show how to use the FOREACH package doing a grid search for an optimal set of free parameters in a projection pursuit regression (PPR).

PPR is a powerful non-parametric regression model and is able to approximate any arbitrary function with a sufficiently complex setting, which consists of the smoothness tuning parameter and the number of smoothers used in the case shown below. In practice, a grid search strategy by the cross-sample validation is often employed to identify the optimal combination of these two free parameters. However, the challenge is the high computing cost occurred with the grid search. In the example below, if we’d like to try 6 values of the smoothness parameter and 10 smoothers respectively, then totally 60 combinations need to be tested in PPR. If the serial training with FOR() loops is applied to these 60 PPR, the computing time would be tediously lengthy. In this case, since the training of each 60 PPR is independent of each other, the parallel training with FOREACH() might best suit this scenario.

library(MASS)
data(Boston)
X <- I(as.matrix(Boston[-14]))
st.X <- scale(X)
Y <- I(as.matrix(Boston[14]))
boston <- data.frame(X = st.X, Y)

# DIVIDE THE WHOLE DATA INTO TWO SEPARATE SETS
set.seed(2013)
rows <- sample(1:nrow(boston), nrow(boston) - 200)
set1 <- boston[rows, ]
set2 <- boston[-rows, ]

library(doParallel)
registerDoParallel(cores = 8)
library(foreach)

# GRID SEARCH BASED ON THE MINIMUM SSE WITH PARALLEL COMPUTING
cv.sse <- foreach(b = seq(0, 10, 2), .combine = rbind) %dopar% {
foreach(n = 1:10, .combine = rbind) %dopar% {
# TRAIN A PROJECTION PURSUIT REGRESSION WITH VARIOUS SETTINGS AND TRAINING DATA
ppreg <- ppr(Y ~ X, data = set1, nterms = n, sm.method = "supsmu", bass = b)
# CALCULATE SSE WITH VALIDATION DATA
test.sse <- sum((set2$Y - predict(ppreg, set2))^2) data.frame(bass = b, nterms = n, sse = test.sse) } } # PRINT OUT THE BEST SETTING BASED ON THE GRID SEARCH print(best.setting <- cv.sse[cv.sse$sse == min(cv.sse$sse), ]) # OUTPUT WITH THE LOWEST SSE BY GRID SEARCH # # bass nterms sse # 17 2 7 2126.07 # GENERATE A HEAT MAP TO VISUALIZE THE GRID SEARCH OUTCOME library(ggplot2) bass <- factor(cv.sse$bass)
nterms <- factor(cv.sse$nterms) sse <- factor(floor(cv.sse$sse / 100) * 100)
jpeg('cv.jpeg', width = 800, height = 500, quality = 100)
qplot(x = bass, y = nterms, fill = sse, geom = 'tile')
dev.off()


From the output, it appears that SSE (Sum of Squared Errors) reaches the lowest, e.g. 2,126, with the smoothness parameter equal to 2 and the number of smoothers equal to 7. In addition, a heat map might be also used to visualize outcomes of the grid search in a more intuitive way.