Impact of Dimensionality on Data in Pictures

[This article was first published on joy of data » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I am excited to announce that this is supposed to be my first article published also on r-bloggers.com 🙂

The processing of data needs to take dimensionality into account as usual metrics change their behaviour in subtle ways, which impacts the efficiency of algorithms and methods that are based on distances / similarities of data points. This has been tagged the “curse of dimensionality“. Just as well in some cases high dimensionality can make help us when investigating data – “blessing of dimensionality”. In general it is good to know what’s going on and so let’s have a look at what dimensionality does to data.

Hyper-Spheres and Neighbourhoods

vol-n-sphere

# full code at:
# https://github.com/joyofdata/miscellaneous/blob/master/high-dim/sphere_volume.R

library(ggplot2)
library(grid)

vol_of_sphere <- function(dim, N) {
  m <- matrix(runif(dim*N,-1,1), ncol=dim)
  sum(sqrt(rowSums(m^2)) <= 1) / N
}

df <- data.frame(
  dim <- 1:10,
  v <- sapply(1:10, function(dim) vol_of_sphere(dim,1000000))
)

# plotting aove computed data
# [...]

 Within the euclidian spaces of conventional dimensions 1,2 and 3 we tend to think of the neighbourhood of a point, containing close or similar points, as the interior of an interval, circle or sphere. Those are all canonical concepts based on the euclidian distance from center to the respective point being at most some constant. In above chart this constant is simply 1 and we compare the volume of the hyper-sphere with radius 1 to the volume of the smallest hyper-box containing it – edges being of length 2. For a one dimensional space the box, as well as the “hyper-sphere” is an interval of length 2 – and on we go. Reaching the 10th dimension the ratio is no longer visually distiguishable from 0. And 10 dimensions is not much at all for real data sets. The effect is that points are rarely any longer close to points we would (interpreting the data) consider similar (meaning “living in the same neighbourhood”).

Linear Separability

rand-lin-sep

# full code at:
# https://github.com/joyofdata/miscellaneous/blob/master/high-dim/random_soft_linsep.R

library(kernlab)
library(ggplot2)
library(grid)

err_svm_for_dim <- function(num_samples, num_points_per_class, dim, C) {

  rand_svm <- function(N,dim, C) {
    v <- as.factor(c(rep("a",N),rep("b",N)))

    m <- rbind(matrix(rnorm(2*N*dim), ncol=dim))

    svm <- ksvm(m,v,kernel="vanilladot",C=C,kpar=list())

    return(svm@error)
  }

  v_res <- mapply(rand_svm, rep(num_points_per_class, num_samples), 
             rep(dim, num_samples), rep(C, num_samples))

  q <- quantile(v_res, c(.1,.5,.9))
  names(q) <- NULL

  return(c(
    q01 = q[1], q05 = q[2], q09 = q[3], 
    ls0 = sum(v_res == 0) / num_samples, 
    ls1 = sum(v_res <= 1/(2*num_points_per_class)) / num_samples, 
    ls2 = sum(v_res <= 1/num_points_per_class) / num_samples
  ))
}

res_N10 <- sapply(1:50, function(d) err_svm_for_dim(
  num_samples=1000,num_points_per_class=10,dim=d,C=10))
res_N20 <- sapply(1:50, function(d) err_svm_for_dim(
  num_samples=1000,num_points_per_class=20,dim=d,C=10))
res_N30 <- sapply(1:50, function(d) err_svm_for_dim(
  num_samples=1000,num_points_per_class=30,dim=d,C=10))

# plotting above computed data
# [...]

 This chart shows how the probability of two sets of N points each being linearly separable quickly approaches 1 with the dimension increasing. Effectively this implies that for example for N = 10 (that is 20 points) from 7th dimension on an observed linear separability can no longer be statistically significant at level 5%. In general we see that (I assume for mathematical reasons connected with the definition of linear separability – I leave this as an exercise to the reader*) that from a certain dimension on linear separability is guaranteed.

For technical reasons the “true” probability functions are likely to ascend sooner. I used a linear kernel SVM which will in some separable cases settle for a not fully separating discriminant and then falsely be counted as not linearly separable. Then again data always contains noise and outliers – so we would mostly tend to interprete an almost separable set as actually separable.

Distances and Metrics

dist-0

# full code at:
# https://github.com/joyofdata/miscellaneous/blob/master/high-dim/rand_dist_origin.R

library(ggplot2)
library(grid)

max_dim <- 100

rand_dist_origin <- function(N,dim) {
  m <- matrix(rnorm(N*dim),ncol=dim)
  v <- sqrt(rowSums(m^2))
  return(c(avg = mean(v), min = min(v), max = max(v)))
}

v <- mapply(rand_dist_origin, rep(100000,99), 2:max_dim)

# setting up data frame and plotting it with ggplot2
# [...]

As the chart indicates (extrapoloating the green and blue trails) the relative difference between the maximum observed distance and the minimum observed distance vanishes with increase of dimension – this means variations in distance convey less and less information on similarity and handicaps clustering using methods k-NN (assuming it uses the euclidian distance).

CodeCogsEqn(8)

It has been argued in [6], that under certain reasonable assumptions on the data distribution, the ratio of the distances of the nearest and farthest  neighbors to a given target in high dimensional space is almost 1 for a wide variety of data distributions and distance functions. In such a case, the nearest neighbor problem becomes ill defined, since the contrast between the distances to different data points does not exist. In such cases, even the concept of proximity may not be meaningful from a qualitative perspective: a problem which is even more fundamental than the performance degradation of high dimensional algorithms. [Agg]

 

On the Surprising Behavior of Distance Metrics in High Dimensional Space” [Agg] suggests using fractional norms to define metrics to fight the loss of meaning of distance for high dimensions. Basically a canonical extension of the L(k)-norms which are usually only defined for k = 1,2,… with k=2 defining the euclidian distance and k=1 defining Manhattan distance into positive fractions below 1. For k=1/2 this would give:

CodeCogsEqn(9)

[Agg] “On the Surprising Behavior of Distance Metrics in High Dimensional Space” Aggarwal, Hinneburg, Keim

* 😉

The post Impact of Dimensionality on Data in Pictures appeared first on joy of data.

To leave a comment for the author, please follow the link and comment on their blog: joy of data » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)