Using UMAP in R with rPython

[This article was first published on schochastics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I wrote about dimensionality reduction methods before and now, there seems to be a new rising star in that field, namely the Uniform Manifold Approximation and Projection, short UMAP. The paper can be found here, but be warned: It is really math-heavy. From the abstract:

UMAP is constructed from a theoretical framework based in Riemannian geometry and algebraic topology. The result is a practical scalable algorithm that applies to real world data. The UMAP algorithm is competitive with t-SNE for visualization quality, and arguably preserves more of the global structure with superior run time performance.

This sounds promising, although the details are not so easy to comprehend.

There is already an implementation for python from the authors on github and I am pretty sure that there will be an R package fairly soon. But for the time being, we can use the Python version with the help of the rPython package.

#used packages
library(tidyverse)  # for data wrangling

UMAP in R with rPython

To use the Python version of UMAP in R, you first need to install it from github. The following code defines a function, which internally calls the UMAP Python function1.

#install.packages(rPython)
umap <- function(x,n_neighbors=10,min_dist=0.1,metric="euclidean"){
  x <- as.matrix(x)
  colnames(x) <- NULL
  rPython::python.exec( c( "def umap(data,n,mdist,metric):",
                  "\timport umap" ,
                  "\timport numpy",
                  "\tembedding = umap.UMAP(n_neighbors=n,min_dist=mdist,metric=metric).fit_transform(data)",
                  "\tres = embedding.tolist()",
                  "\treturn res"))

  res <- rPython::python.call( "umap", x,n_neighbors,min_dist,metric)
  do.call("rbind",res)
}

The parameters are set to what is recommended by the authors. There are many different distance metrics implemented in the Python version that can also be used in this R function. Check out the Python code for options.

Below is a quick example using the infamous iris data.

data(iris)
res <- umap(iris[,1:4])
tibble(x = res[,1],y = res[,2],species = iris$Species) %>% 
  ggplot(aes(x = x,y = y,col = species))+
  geom_point()+
  theme(legend.position = "bottom")

In my last post on dimensionality reduction methods, I used FIFA 18 player data to illustrate different methods. Of course we can also use this data with UMAP.

fifa_umap <- umap(fifa_data)

Here is what the result looks like.

tibble(x = fifa_umap[,1], y = fifa_umap[,2], Position = fifa_tbl$position2) %>% 
  ggplot(aes(x = x,y = y, col = Position))+
  geom_point()+
  theme(legend.position = "bottom")

One of the authors said in a tweeet, that inter-cluster distances are captured well by UMAP. For the FIFA player data this seems to be the case. The sausage-like point cloud transitions from defensive players to offensive players on the x axis. Midfielders are nicely embedded in between. I am pretty sure, however, that tweaking the parameters may yield even better results.


  1. The function can also be found on github as a gist.

To leave a comment for the author, please follow the link and comment on their blog: schochastics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)