# UMAP clustering in Python

**poissonisfish**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The aim of this short Python tutorial is to introduce the uniform manifold approximation and projection (UMAP) algorithm, using 76,533 single-cell expression profiles from the human primary motor cortex. The data are available from the Cell Types database, which is part of the Allen Brain Map platform.

The UMAP has quickly established itself as a go-to clustering tool well poised to expand our knowledge of various many things, including the human brain. I hope by the end of this tutorial you will have a broad understanding of the UMAP algorithm and how to implement it.

# Introduction

### The UMAP algorithm

Uniform manifold approximation and projection (UMAP)^{1} is a scalable and efficient dimension reduction algorithm that performs competitively among state-of-the-art methods such as t-SNE^{2}, and widely applied for unsupervised clustering.

To effectively approximate a uniformly distributed manifold in the data, this neighbour-graph technique first defines fuzzy simplicial sets using local metric spaces and then patches them together into a single topological structure. Of note, these local metric spaces are determined from the distance of each individual example to its *k*-th nearest neighbour, where the user-provided *k* balances the local-global structure representation. Next, a low-dimensional layout – the so-called embedding – is constructed from the fuzzy set cross-entropy that matches the largest edge weights from the topological structure with the shortest pairwise distances in the layout itself, and vice-versa.

In practice, to implement the UMAP users must provide the minimum distance to separate close observations in the embedding, as well as the numbers of training epochs, embedding dimensions and neighbours. Tweaking these hyperparameters may drastically change the resulting embedding, so these should be carefully tuned.

In addition to the main manuscript^{1} and this excellent walkthrough, you may find the presentation below from the first author himself just as enlightening.

Since its release, the UMAP has been widely applied to large, high-dimensional molecular data and particularly single-cell expression profiling^{3} ^{4} ^{5}, a technology that will be introduced next.

### Single-cell expression profiling

In the course of the last two decades, developments in molecular and cellular biology unlocked ever-increasing detail over developmental and metabolic processes in living organisms. Unlike DNA sequencing, RNA sequencing and quantification (RNA-Seq) captures the activity of tens of thousands of genes simultaneously and therefore more closely reflects the interplay among the various different cellular processes.

However, the earliest RNA-Seq workflows presented technical challenges that prevented their application to single cells and consequently access to cell-to-cell variability. Perhaps most critically, these workflows required large amounts of total RNA and lacked the ability to isolate and label single cells. Only recently, the advent of single-cell RNA-Seq (scRNA-Seq) solved these two fundamental problems and helped uncovering cell identity^{6} ^{7}. Among other things, we now have access to a much richer understanding of health and disease, which holds the key to discovering therapeutic targets for a range of diseases.

This widespread technology is largely based on microfluidic chips where RNA molecules of individual cells are tagged with unique barcode sequences. For a end-to-end overview of scRNA-Seq, below is a great explanation from an instrument and chip manufacturer.

For the present UMAP tutorial we will use scRNA-Seq expression profiles from the human primary motor cortex (M1). The dataset and some additional experimental information are available here – no need to download it manually!

# Let’s get started with Python

From an IPython console such as that from Spyder, we start off with importing a handful of modules. Most can be installed from the `conda-forge`

channel, while in contrast `umap`

and `datatable`

can be installed with the commands `!pip install umap-learn`

and `!pip install datatable`

, respectively. Besides the usual `numpy`

and `pandas`

dependencies for handling structured data, `datatable`

and `seaborn`

are each imported to more quickly load the scRNA-Seq and easily produce a two-dimensional embedding scatterplot, respectively.

Next, we set up a working directory of our choice and download both the large scRNA-Seq dataset and the associated metadata, the latter to derive the cell subgroup labels and distinguish the embedded expression profiles. To this end I opted for a couple of `wget`

Bash commands, but any equivalent method will also work.

# Imports | |

#!pip install datatable | |

import os, umap | |

import numpy as np | |

import pandas as pd | |

import datatable as dt | |

import seaborn as sns | |

os.chdir('PATH/TO/WDIR') | |

# Download expression matrix (matrix.csv) and metadata (metadata.csv) | |

!wget https://idk–etl–prod–download–bucket.s3.amazonaws.com/aibs_human_m1_10x/matrix.csv | |

!wget https://idk–etl–prod–download–bucket.s3.amazonaws.com/aibs_human_m1_10x/metadata.csv |

Once the downloads are complete, you will certainly notice `matrix.csv`

is approximately 7GB in size. We could try lifting this behemoth using `pd.read_csv()`

with an appropriate choice of `chunksize`

, but we will instead use the nimble `dt.fread()`

. A file preview using the Bash command `cut -d, -f-5 matrix.csv | head`

suggests that `sample_name`

aside, all columns are encoded as integer type. Therefore, to facilitate the data ingestion I passed a Bash command to exclude `sample_name`

prior to importing and apply a `dt.int32`

encoding on the filtered data. Please make sure to leverage parallelisation if possible, by setting and appropriate value of `nthreads`

when executing `dt.fread()`

.

# Check first five columns in matrix.csv | |

#!cut -d, -f-5 matrix.csv | head | |

# Import data with Bash command discarding first column | |

matrix = dt.fread(cmd='cut -d, -f2- matrix.csv', | |

header=True, sep=',', columns=dt.int32) # ~7 GB (76533, 50281) | |

# Import metadata | |

metadata = pd.read_csv('metadata.csv') |

The resulting object `matrix`

comprises a total of 76,533 expression profiles across 50,281 genes or expression features. If your RAM allows, the `to_numpy()`

and `to_pandas()`

methods will directly convert the `datatable`

to the familiar NumPy or Pandas formats, respectively. To learn more about how to manipulate `datatable`

objects check out the official documentation.

Next, for the sake of the quality and runtime of the UMAP training, we use `np.apply_along_axis()`

to identify and discard expression features that equal or exceed 50% of null expression levels. I leave any further data cleansing up to you.

# Remove expression features with > 50% zero-valued expression levels | |

is_expressed = np.apply_along_axis(lambda x: np.mean(x == 0) < .5, arr=matrix, axis=0) | |

matrix = matrix[:,is_expressed.tolist()] | |

# Log2-transform | |

matrix = np.log2(matrix.to_numpy() + 1) |

Now that 4,078 expression features were selected and log-transformed, we can proceed with fitting the UMAP and examining the resulting two-dimensional embedding. For this purpose I employed a `min_dist`

of 0.25, `n_neighbors`

of 30 and the default Euclidean `metric`

. One advantage of the Euclidean metric is that it implicitly factors in the absolute differences between every pair of expression profiles, under the expectation that similar cells match their profiles to the magnitude of expression. Finally, a fixed `random_state`

will ensure we arrive at the same embedding.

Before moving on with the UMAP embedding visualisation, it would help to extract some experimental information from the `metadata`

table in order to make sense of the embedding. One of the many interesting labelling options is cell `subclass`

, which will be used below.

# Define UMAP | |

brain_umap = umap.UMAP(random_state=999, n_neighbors=30, min_dist=.25) | |

# Fit UMAP and extract latent vars 1-2 | |

embedding = pd.DataFrame(brain_umap.fit_transform(matrix), columns = ['UMAP1','UMAP2']) | |

# Produce sns.scatterplot and pass metadata.subclasses as color | |

sns_plot = sns.scatterplot(x='UMAP1', y='UMAP2', data=embedding, | |

hue=metadata.subclass_label.to_list(), | |

alpha=.1, linewidth=0, s=1) | |

# Adjust legend | |

sns_plot.legend(loc='center left', bbox_to_anchor=(1, .5)) | |

# Save PNG | |

sns_plot.figure.savefig('umap_scatter.png', bbox_inches='tight', dpi=500) |

A grand total of 76,533 human brain cells are represented in the figure above, how cool is that? Under just a few minutes, the UMAP captured a topological representation of the single-cell expression dataset that – given no label information – neatly sorted the different cell subtypes from the human primary motor cortex. Any expert out there daring to comment whether the proximity of the clusters makes anatomical sense?

With that I hope you have a better understanding of the UMAP and how to apply it using Python. See you next time!

# References

**leave a comment**for the author, please follow the link and comment on their blog:

**poissonisfish**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.