**Life in Code**, and kindly contributed to R-bloggers)

Here are the results of a “Curse of Dimensionality” homework assignment for Terran Lane’s Introduction to Machine Learning class. Pretty pictures, interesting results, and a good exercise in explicit parallelism with R.

It’s neat to see distance scaling linearly with standard deviation, and linearly with the Lth-root for an L-norm metric (e.g. Euclidean = L2 norm). It took me a while to “get” how the `maximum` metric worked — it’s simpler than I was trying to make it, though I’m still not sure I can explain it properly.

In general, I’m a big fan of R’s `lattice` package, though I’m never quite happy with the legend/key settings.

The code is below. One thing that I’ve learned from this exercise is that I prefer not to have the `foreach` call in the outermost loop. For one, error and/or progress reporting from inside a `foreach` isn’t immediately emitted to the console, and it’s annoying/confusing to have a long run with no feedback. Second and less aesthetically, it seems that foreach excels when it has jobs of medium complexity iterated over a medium number of elements. Too few elements + very complex job means one or more cores will finish early and sit idle. Too many elements + very simple job means too many jobs will be spawned, and (I’m guessing here) a lot of time will be wasted with job setup and teardown.

dimcurse = function(dims=1:1e3, .sd=c(1e-2, 0.1, .2, .5, 1), reps=1e3, cores=4, rfun=rnorm, .methods=c('manhattan', 'euclidean', 'maximum'), .verbose=F){ ## examine distance between 2 points in dims dimensions for different metrics ## chosen at random according to rfun (rnorm by default) ## require(foreach) require(doMC) registerDoMC registerDoMC(cores=cores) ## use multicore/foreach on the outside (each sd) finret = foreach(ss=iter(.sd), .combine=rbind, .verbose=.verbose) %dopar% { ## pre-allocate, don't append/rbind! ret = data.frame(.sd=1:(reps*length(dims))*length(methods), .dim=1, .rep=1, .dist=1, method=factor(NA, levels=.methods)) ii = 1 for (.dim in dims) { for (method in .methods) { ## reset .rep counter .rep = 1 while (.rep <= reps) { ## 2 points, one per row, randomly chosen tmp = matrix(rfun(2*.dim, sd=ss), nrow=2) ## calc dist mydist = as.numeric(dist(tmp, method=method)) ## fill return dataframe ret[ii, ] = list(ss, .dim, .rep, mydist, method) ## update rep counter .rep = .rep +1 ## update ret row counter ii = ii +1 } } } return(ret) } ## rbind happens here return(finret) }

For plotting, I've used:

xyplot(.dist ~ (.dim)|method, groups=.sd, data =aa, pch='.', par.settings=list(background=list(col='darkgrey')), layout=c(3,1), scales=list(y=list(relation='free'), alternating=1), auto.key=list(space='bottom', columns=5, points=F, lines=T), xlab='Dimension', ylab='Distance', main='Distance between 2 N-dimensional points \n drawn from a normal distribution with varying sd (legend) using different metrics (columns)', ) xyplot(.dist ~ (.dim)|as.character(.sd), groups=method, data =aa, pch='.', par.settings=list(background=list(col='darkgrey')), layout=c(5,1), scales=list(y=list(relation='free'), alternating=1), auto.key=list(space='bottom', columns=3, points=F, lines=T), xlab='Dimension', ylab='Distance', main='Distance between 2 N-dimensional points \n drawn from a normal distribution with varying SD (columns) using different metrics (legend)', )

**leave a comment**for the author, please follow the link and comment on his blog:

**Life in Code**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...