Looking at the "Curse of Dimensionality" with R, foreach, and lattice

March 20, 2011
By

(This article was first published on Life in Code, and kindly contributed to R-bloggers)

Here are the results of a "Curse of Dimensionality" homework assignment for Terran Lane's Introduction to Machine Learning class. Pretty pictures, interesting results, and a good exercise in explicit parallelism with R.




It's neat to see distance scaling linearly with standard deviation, and linearly with the Lth-root for an L-norm metric (e.g. Euclidean = L2 norm). It took me a while to "get" how the maximum metric worked -- it's simpler than I was trying to make it, though I'm still not sure I can explain it properly.

In general, I'm a big fan of R's lattice package, though I'm never quite happy with the legend/key settings.

The code is below. One thing that I've learned from this exercise is that I prefer not to have the foreach call in the outermost loop. For one, error and/or progress reporting from inside a foreach isn't immediately emitted to the console, and it's annoying/confusing to have a long run with no feedback. Second and less aesthetically, it seems that foreach excels when it has jobs of medium complexity iterated over a medium number of elements. Too few elements + very complex job means one or more cores will finish early and sit idle. Too many elements + very simple job means too many jobs will be spawned, and (I'm guessing here) a lot of time will be wasted with job setup and teardown.

dimcurse = function(dims=1:1e3, .sd=c(1e-2, 0.1, .2, .5, 1), reps=1e3, cores=4,
                    rfun=rnorm, .methods=c('manhattan', 'euclidean', 'maximum'), .verbose=F){
    ## examine distance between 2 points in dims dimensions for different metrics
    ## chosen at random according to rfun (rnorm by default)
    ##
    require(foreach)
    require(doMC)
    registerDoMC
    registerDoMC(cores=cores)
    ## use multicore/foreach on the outside (each sd)
    finret = foreach(ss=iter(.sd), .combine=rbind, .verbose=.verbose) %dopar% {
        ## pre-allocate, don't append/rbind!
        ret = data.frame(.sd=1:(reps*length(dims))*length(methods), .dim=1, .rep=1, .dist=1, method=factor(NA, levels=.methods))
        ii = 1
        for (.dim in dims) {
          for (method in .methods) {
            ## reset .rep counter
            .rep = 1
            while (.rep <= <= reps) {
                ## 2 points, one per row, randomly chosen
                tmp = matrix(rfun(2*.dim, sd=ss), nrow=2)
                ## calc dist
                mydist = as.numeric(dist(tmp, method=method))
                ## fill return dataframe
                ret[ii, ] = list(ss, .dim, .rep, mydist, method)
                ## update rep counter
                .rep = .rep +1
                ## update ret row counter
                ii = ii +1
            }
          }
        }
        return(ret)
    } ## rbind happens here
    return(finret)
}

For plotting, I've used:
xyplot(.dist ~ (.dim)|method, groups=.sd, data =aa, pch='.',
    par.settings=list(background=list(col='darkgrey')), layout=c(3,1),
    scales=list(y=list(relation='free'), alternating=1),
    auto.key=list(space='bottom', columns=5, points=F, lines=T),
    xlab='Dimension', ylab='Distance', main='Distance between 2 N-dimensional points \n drawn from a normal distribution with varying sd (legend) using different metrics (columns)',
)

xyplot(.dist ~ (.dim)|as.character(.sd), groups=method, data =aa, pch='.',
    par.settings=list(background=list(col='darkgrey')), layout=c(5,1),
    scales=list(y=list(relation='free'), alternating=1),
    auto.key=list(space='bottom', columns=3, points=F, lines=T),
    xlab='Dimension', ylab='Distance', main='Distance between 2 N-dimensional points \n drawn from a normal distribution with varying SD (columns) using different metrics (legend)',
)

To leave a comment for the author, please follow the link and comment on his blog: Life in Code.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.