Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Multi-dimensional scaling (MDS) (Kruskal 1964) is a dimensionality reduction method used for visualising and denoising high-dimensional data. However, since MDS requires calculating the distances between all pairs of data points, it does not scale well to datasets with a large number of samples.

We released lmds v0.1.0, an implementation of Landmark MDS (LMDS) (de Silva and Tenenbaum 2004). Landmark MDS only calculates the distances between a set of landmarks and all other data points, thereby sacrificing determinism for scalability.

## Regular MDS

A single-cell transcriptomics dataset (???) is used to demonstrate (L)MDS, containing 392 profiles which measure the abundance levels of 2000 differentmolecules within individual cells. Note that while the dataset is thus only a 392×2000 matrix, LMDS is designed to scale to much higher dimensionality, as demonstrated in the last section.

Simply looking at the raw expression values as a heatmap reveals little to no information:

library(tidyverse)
set.seed(1)
dataset <- dyno::fibroblast_reprogramming_treutlein
cell_info <- data.frame(grouping = dataset$grouping) pheatmap::pheatmap( t(as.matrix(dataset$expression)),
show_colnames = FALSE,
show_rownames = FALSE,
annotation_col = cell_info
)

Applying MDS quickly reveals the underlying bifurcating topology of the dataset (from MEF to myocytes and neurons).

# compute distance matrix
dist <- dynutils::calculate_distance(dataset$expression, method = "pearson") dim(dist) ## [1] 392 392 # compute MDS dimred_mds <- cmdscale(dist) # plot points qplot(dimred_mds[,1], dimred_mds[,2], colour = dataset$grouping) +
theme_bw() +
labs(x = "Comp 1", y = "Comp 2", colour = "Group")

Regular MDS, however, requires computing all pairwise distances between data points. This dataset only contains 392 data points, but for datasets it is increasingly infeasible to apply MDS.

## Landmark MDS

Landmark MDS (LMDS) (de Silva and Tenenbaum 2004) is an extension of MDS which scales much better with respect to the number of data points in the dataset. A short while ago, we published an R package on CRAN implementing this algorithm, lmds v0.1.0.

Landmark MDS only computes the distance matrix between a set of landmarks and all other data points. MDS is then only performed on the landmarks, and all other datapoints are projected into the landmark space.

library(lmds)
# compute distances between random landmarks and all data points
dist_landmarks <- select_landmarks(
dataset$expression, distance_method = "pearson", num_landmarks = 150 ) dim(dist_landmarks) ## [1] 150 392 # perform LMDS dimred_lmds <- cmdscale_landmarks(dist_landmarks) # plot points qplot(dimred_lmds[,1], dimred_lmds[,2], colour = dataset$grouping) +
theme_bw() +
labs(x = "Comp 1", y = "Comp 2", colour = "Group")

Most frequently, these two steps can be applied together as follows:

dimred_lmds2 <- lmds(
dataset\$expression,
distance_method = "pearson",
num_landmarks = 150
)

## Execution time

In the figure below, the execution times of MDS and LMDS are compared by increasing the size of a random dataset until the execution of either algorithms exceeds 10 seconds.

## Conclusion

LMDS is a heuristic for MDS which scales linearly with respect to the number of points in the dataset. Go ahead and check out our implementation for this algorithm, available on CRAN. If you encounter any issues, feel free to let me know by creating an issue post on Github.

## References

de Silva, Vin, and Joshua B Tenenbaum. 2004. “Sparse Multidimensional Scaling Using Landmark Points.” Technical Report, Stanford University, 41.

Kruskal, J. B. 1964. “Multidimensional Scaling by Optimizing Goodness of Fit to a Nonmetric Hypothesis.” Psychometrika 29 (1): 1–27. https://doi.org/10.1007/BF02289565.