Looking at the "Curse of Dimensionality" with R, foreach, and lattice

[This article was first published on Life in Code, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Here are the results of a “Curse of Dimensionality” homework assignment for Terran Lane’s Introduction to Machine Learning class. Pretty pictures, interesting results, and a good exercise in explicit parallelism with R.

It’s neat to see distance scaling linearly with standard deviation, and linearly with the Lth-root for an L-norm metric (e.g. Euclidean = L2 norm). It took me a while to “get” how the maximum metric worked — it’s simpler than I was trying to make it, though I’m still not sure I can explain it properly.

In general, I’m a big fan of R’s lattice package, though I’m never quite happy with the legend/key settings.

The code is below. One thing that I’ve learned from this exercise is that I prefer not to have the foreach call in the outermost loop. For one, error and/or progress reporting from inside a foreach isn’t immediately emitted to the console, and it’s annoying/confusing to have a long run with no feedback. Second and less aesthetically, it seems that foreach excels when it has jobs of medium complexity iterated over a medium number of elements. Too few elements + very complex job means one or more cores will finish early and sit idle. Too many elements + very simple job means too many jobs will be spawned, and (I’m guessing here) a lot of time will be wasted with job setup and teardown.

dimcurse = function(dims=1:1e3, .sd=c(1e-2, 0.1, .2, .5, 1), reps=1e3, cores=4,
                    rfun=rnorm, .methods=c('manhattan', 'euclidean', 'maximum'), .verbose=F){
    ## examine distance between 2 points in dims dimensions for different metrics
    ## chosen at random according to rfun (rnorm by default)
    ##
    require(foreach)
    require(doMC)
    registerDoMC
    registerDoMC(cores=cores)
    ## use multicore/foreach on the outside (each sd)
    finret = foreach(ss=iter(.sd), .combine=rbind, .verbose=.verbose) %dopar% {
        ## pre-allocate, don't append/rbind!
        ret = data.frame(.sd=1:(reps*length(dims))*length(methods), .dim=1, .rep=1, .dist=1, method=factor(NA, levels=.methods))
        ii = 1
        for (.dim in dims) {
          for (method in .methods) {
            ## reset .rep counter
            .rep = 1
            while (.rep 

For plotting, I’ve used:

xyplot(.dist ~ (.dim)|method, groups=.sd, data =aa, pch='.',
    par.settings=list(background=list(col='darkgrey')), layout=c(3,1),
    scales=list(y=list(relation='free'), alternating=1),
    auto.key=list(space='bottom', columns=5, points=F, lines=T),
    xlab='Dimension', ylab='Distance', main='Distance between 2 N-dimensional points \n drawn from a normal distribution with varying sd (legend) using different metrics (columns)',
)

xyplot(.dist ~ (.dim)|as.character(.sd), groups=method, data =aa, pch='.',
    par.settings=list(background=list(col='darkgrey')), layout=c(5,1),
    scales=list(y=list(relation='free'), alternating=1),
    auto.key=list(space='bottom', columns=3, points=F, lines=T),
    xlab='Dimension', ylab='Distance', main='Distance between 2 N-dimensional points \n drawn from a normal distribution with varying SD (columns) using different metrics (legend)',
)

To leave a comment for the author, please follow the link and comment on their blog: Life in Code.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)