# Using UMAP in R with rPython

**schochastics**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I wrote about dimensionality reduction methods

before and now, there seems to be a new rising star in that field, namely the

*Uniform Manifold Approximation and Projection*, short UMAP.

The paper can be found here, but be warned:

It is really math-heavy. From the abstract:

UMAP is constructed from a theoretical framework based in Riemannian geometry and algebraic

topology. The result is a practical scalable algorithm that applies to real

world data. The UMAP algorithm is competitive with t-SNE for visualization

quality, and arguably preserves more of the global structure with

superior run time performance.

This sounds promising, although the details are not so easy to comprehend.

There is already an implementation for python from the authors on github

and I am pretty sure that there will be an R package fairly soon. But for the time being, we can

use the Python version with the help of the `rPython`

package.

#used packages library(tidyverse) # for data wrangling

# UMAP in R with rPython

To use the Python version of UMAP in R, you first need to install it from github.

The following code defines a function, which internally calls the `UMAP`

Python function^{1}.

#install.packages(rPython) umap <- function(x,n_neighbors=10,min_dist=0.1,metric="euclidean"){ x <- as.matrix(x) colnames(x) <- NULL rPython::python.exec( c( "def umap(data,n,mdist,metric):", "\timport umap" , "\timport numpy", "\tembedding = umap.UMAP(n_neighbors=n,min_dist=mdist,metric=metric).fit_transform(data)", "\tres = embedding.tolist()", "\treturn res")) res <- rPython::python.call( "umap", x,n_neighbors,min_dist,metric) do.call("rbind",res) }

The parameters are set to what is recommended by the authors. There are many different

distance metrics implemented in the Python version that can also be used in this

R function. Check out the Python code

for options.

Below is a quick example using the infamous iris data.

data(iris) res <- umap(iris[,1:4]) tibble(x = res[,1],y = res[,2],species = iris$Species) %>% ggplot(aes(x = x,y = y,col = species))+ geom_point()+ theme(legend.position = "bottom")

In my last post on dimensionality reduction methods, I used FIFA 18 player data to illustrate different methods. Of course we can also

use this data with UMAP.

fifa_umap <- umap(fifa_data)

Here is what the result looks like.

tibble(x = fifa_umap[,1], y = fifa_umap[,2], Position = fifa_tbl$position2) %>% ggplot(aes(x = x,y = y, col = Position))+ geom_point()+ theme(legend.position = "bottom")

One of the authors said in a tweeet,

that inter-cluster distances are captured well by UMAP. For the FIFA player data this seems to be the case.

The sausage-like point cloud transitions from defensive players to offensive players on the x axis.

Midfielders are nicely embedded in between. I am pretty sure, however, that tweaking the parameters

may yield even better results.

**leave a comment**for the author, please follow the link and comment on their blog:

**schochastics**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.