A plot of genes on chromosomes

[This article was first published on R – On unicorns and genes, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Marta Cifuentes and Wayne Crismani asked on Twitter if there is a web tool similar to the Arabidopsis Chromosome Map Tool that makes figures of genes on chromosomes for humans. This will not really be an answer to the question — not a web tool, not conveniently packaged — but I thought that would be a nice plot to make in R with ggplot2. We will use the ggrepel package to help with labelling, and get information from the Ensembl REST API with httpr and jsonlite.

The plot and the final code to generate it

Below I will go through the functions that get us there, but here is the end product. The code is on GitHub as usual.

## Some Ensembl genes to test with

ensembl_genes <- c("ENSG00000125845", ## BMP2
                   "ENSG00000181690", ## PLAG1
                   "ENSG00000177508", ## IRX3
                   "ENSG00000140718") ## FTO

chr_sizes <- get_chromosome_sizes_from_ensembl(species = "homo_sapiens")

coords <- get_coordinates_from_ensembl(ensembl_genes)

plot_genes_test <- plot_genes(coords,
                              chr_sizes)

We will use Ensembl and access the genes via Ensembl Gene IDs. Here, I’ve looked up the Ensembl Gene IDs for four genes I like (in humans).

We need to know how long human chromosomes are in order to plot them, so we have written a function for that; we also need to get coordinates for the genes, and we have written a function for that. They are both below. These functions call up the Ensembl REST API to get the data from the Ensembl database.

Finally, there is a plotting function that takes the coordinates and the chromosome sizes as input and return a ggplot2 plot.

Getting the data out of the Ensembl REST API

Now, starting from the top, we will need to define those functions to interact with the Ensembl REST API. This marvellous machine allows us to get data out of the Ensembl database over HTTP, writing our questions in the URL. It is nicely described with examples from multiple languages on the Ensembl REST API website.

An alternative to using the REST API would be to download gene locations from BioMart. This was my first thought. BioMart is more familiar to me than the REST API, and it also has the nice benefit that it is easy to download every gene and store it away for the future. However, there isn’t a nice way to get chromosome lengths from BioMart, so we would have to download them from somewhere else. This is isn’t hard, but I thought using the REST API for both tasks seemed nicer.

## Plot showing the location of a few genomes on chromosomes

library(httr)
library(jsonlite)
library(ggplot2)
library(ggrepel)
library(purrr)

## Get an endpoint from the Ensembl REST api and return parsed JSON

get_from_rest_api <- function(endpoint_string,
                              server = "https://rest.ensembl.org/") {
  rest <- GET(paste(server, endpoint_string, sep = ""),
              content_type("application/json"))
  
  stop_for_status(rest)
  
  fromJSON(toJSON(content(rest)))
}

This first function gets content by sending a request, checking whether it worked (and stopping with an error if it didn’t), and then unpacking the content into an R object.

## Get chromosomes sizes from the Ensembl REST API

get_chromosome_sizes_from_ensembl <- function(species) {

  json <- get_from_rest_api(paste("info/assembly/", species, sep = ""))

  data.frame(name = as.character(json$top_level_region$name),
             length = as.numeric(json$top_level_region$length),
             stringsAsFactors = FALSE)
}

This second function asks for the genome assembly information for a particular species with the GET info/assembly/:species endpoint, and extracts the chromosome lengths into a data frame. The first part of data gathering is done, now we just need the coordinates fort the genes of interest.

## Get coordinates from Ensembl ids

get_coordinates_from_ensembl <- function(ensembl_ids) {
 
  map_dfr(ensembl_ids,
          function(ei) {
            json <- get_from_rest_api(paste("lookup/id/", ei, sep = ""))
            
            data.frame(position = (json$start + json$end)/2,
                       chr = json$seq_region_name,
                       display_name = json$display_name,
                       stringsAsFactors = FALSE)
          })
}

This function asks for the gene information for each gene ID we’ve given it with the GET lookup/id/:id endpoint, and extracts the rough position (mean of start and end coordinate), chromosome name, and the ”display name”, which in the human case will be a gene symbol. (For genes that don’t have a gene symbol, we would need to set up this column ourselves.)

At this point, we have the data we need in two data frames. That means it’s time to make the plot.

Plotting code

We will build a plot with two layers: first the chromosomes (as a geom_linerange) and then the gene locations (as a geom_text_repel from the ggrepel package). The text layer will move the labels around so that they don’t overlap even when the genes are close to each other, and by setting the nudge_x argument we can move them to the side of the chromosomes.

Apart from that, we change the scale to set he order of chromosomes and reverse the scale of the y-axis so that chromosomes start at the top of the plot.

The function returns a ggplot2 plot object, so one can do some further customisation after the fact — but for some features one would have to re-write things inside the function.

plot_genes <- function(coordinates,
                       chromosome_sizes) {

  ## Restrict to chromosomes that are in data  
  chrs_in_data <-
    chromosome_sizes[chromosome_sizes$name %in% coordinates$chr,]
  chr_order <- order(as.numeric(chrs_in_data$name))
  
  ggplot() +
    geom_linerange(aes(x = name,
                       ymin = 1,
                       ymax = length/1e6),
                   size = 2,
                   colour = "grey",
                   data = chrs_in_data) +
    geom_text_repel(aes(x = chr,
                        y = position/1e6,
                        label = display_name),
                    nudge_x = 0.33,
                    data = coordinates) +
    scale_y_reverse() +
    ## Fix ordering of chromosomes on x-axis
    scale_x_discrete(limits = chrs_in_data$name[chr_order],
                     labels = chrs_in_data$name[chr_order]) +
    theme_bw() +
    theme(panel.grid = element_blank()) +
    xlab("Chromosome") +
    ylab("Position (Mbp)")
  
}

Possible extensions

One feature from the Arabidopsis inspiration that was missing is the position of centromeres. We should be able to use the option ?bands=1 in the GET info/assembly/:species to get cytogenetic band information and separate p and q arms of chromosomes. This will not be universal though, and not available for most species I care about.

Except to make cartoons of gene positions, I think this might be a nice way to make plots of genome regions with very course resolution, i.e. linkage mapping results, where one could add lines to show genomic confidence intervals, for example.

To leave a comment for the author, please follow the link and comment on their blog: R – On unicorns and genes.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)