Gina Gruenhage has just arxived a new paper describing an algorithm we call cMDS. Here’s what it’s for: if you do any kind of data analysis you often find yourself comparing datapoints using some kind of distance metric. All’s well if you have a unique reasonable distance metric you can use, but often what you have is a family of possible distance functions, and very little idea how to choose among them. What if the patterns in the data change according to how you measure distance?
For example, you have a set of character strings, and you compare them using the Levenshtein distance, which says that the more edits you have to make to turn string A into string B, the more different they are (so that “abba” is closer to “ab ba” than to “baba”). There are different kinds of changes you can make to a string: you can add a character, remove it, or replace it with another.
Let’s say we know that character deletion happens more rarely. We’d like to take that into account in the metric by giving a higher weight to character deletions. How high should that weight be?
The problem also occurs all the time with spatial, temporal or spatiotemporal data: choosing a particular distance function often means choosing a particular spatial or temporal scale at which to compare the patterns.
If one has a distance function that comes with an almost arbitrary parameter you have to set, then presumably one would like to know if the results depend on that parameter or not.
In the plot above, we imagine that along the x axis we change the way we measure distance. The different curves are the different datapoints, and their relative position along the y axis indicates relative distance. In the first case (A), nothing happens as we change the hyperparameter. In (B), clusters are present for only a certain range. In (C), neighbourhoods relationships change, and points that start out as neighbours become far apart. In (D), points simply become undistinguishable.
All of these patterns (as well as others) can happen in real data and we suggest a way to find out about them by embedding datapoints as curves, using a variant of MDS. As one reviewer pointed out our variant is closely related to dynamical MDS algorithms for graph visualisation, see paper for details.
Here’s an example of what one can do with cMDS. We downloaded from the Gapminder website some data on EU countries. Some of the variables, like birth rate, are demographic. Some of the variables, like GDP, are economic. We can compare countries based on their demography, their economy, or some weighted version of both. Using cMDS we can look at what changes as you vary the weight of the economic variables:
In each panel the two dimensions are artificial: they are meant to optimally reflect a distance pattern (so that e.g. the Netherlands and Finland are close but Finland is relatively far Bulgaria). cMDS output shows that NLD and FLD stay close together throughout, while EST and BGR drift apart. See an interactive version of the plot here.
We’ve released a cMDS package, downloadable from GitHub.
In R, run:
library(devtools) devtools::install_github("cmdsr","ginagruenhage") library(cmdsr)