R vs. Perl/mySQL – an applied genomics showdown

March 8, 2013
By

(This article was first published on bRogramming, and kindly contributed to R-bloggers)

R vs. Perl/mySQL - an applied genomics showdown

Recently I was given an assignment for a class I'm taking that got me thinking about speed in R. This isn't something I'm usually concerned with, but the first time I tried to run my solution (ussing plyr's ddply() it was going to take all night to compute. I consulted the professor that taught me to use R and he said that if I could boil the computationally intensive portion of the program down to an apply() command I could take advantage of the vectorized nature of R and greatly improve my speed. I constructed my solution around this methodology and had great success. The assignment as it was given is as follows:

Match Regulatory Elements in the Genome to the Closest Gene

Genomic genius Ginny Genes has performed a gigantic genome-wide experiment to find regulatory elements in human embryonic stem cells. She wants to know what are the possible target genes of these regulatory elements. She has the locations of the regulatory elements in a bed-formatted, text, tab-delimited file with the following fields:
1) Chromosome
2) Start position in chromosome
3) End position in chromosome
4) Randomly assigned name
Your assignment is to help Ginny by finding for each regulatory element both the closest gene upstream and the closest gene downstream of the regulatory element.
The input file “RegulatoryElements.bed” contains the list of regulatory positions. You should open this file, read each line one at a time, search for the gene upstream and downstream closest based on the transcription start site, and print out the following information for each on a single line:
1) Chromosome of gene
2) Start position of gene
3) End position of gene
4) Name of gene (common name - name2 field)
5) Genomic strand (+ or -)
6) Name of corresponding regulatory element

My solution:

First I need to read the files described above into R. This is simple enough, but for added speed I use the colClasses = argument in the read.table() function. Basically this allows me to pre-specify what class each column of the file should be read in as. This saves some time because normally R decides this for you and if you're reading in a lot of data, this decision requires a lot of information and time. For example, if I have a file with a million rows of data, and all of the entries in column 1 are numbers except the last one, which is a character string, then R must check each entry in this column before eventually assigning it to be of class factor. This is time consuming. Also, R by default converts any column with even one string in it as a factor, which requires some conversion. I the example I just gave of 999,999,999 numbers and 1 string, if all 999,999,999 numbers were unique, I would get a factor with a million levels. This would take some time to create. Notice below that I avoid creating factor variables except in particularly simple to convert cases where there are only a few possible levels. Reading in things like Gene Name ascharacter columns instead of factors saves some time.
# Clear memory
rm(list = ls())

# Load in the file containing Regulatory Element data
mydat <- read.table("C:/Users/rnorberg/Desktop/Classes/BIOL 525/RegulatoryElements.bed",
header = F, sep = "\t", colClasses = c("factor", "numeric", "numeric",
"character"))
names(mydat) <- c("chrom", "txStart", "txEnd", "name")

# Load in Genes data
genes <- read.table("C:/Users/rnorberg/Desktop/Classes/BIOL 525/wgEncodeGencodeCompV12.txt",
sep = "\t", header = F, colClasses = c("integer", "character", "factor",
"factor", rep("numeric", 4), "integer", "character", "character", "integer",
"character", "factor", "factor", "character"))
names(genes) <- c("bin", "name", "chrom", "strand", "txStart", "txEnd", "cdsStart",
"cdsEnd", "exonCount", "exonStarts", "exonEnds", "score", "name2", "cdsStartStat",
"cdsEndStat", "exonFrames")
These are pretty big data frames, particularly the genes data frame:
dim(mydat)
## [1] 129061      4
dim(genes)
## [1] 167536     16
Lets actually compare the time it takes to read the data with and without specifying my colClasses up front. This is easily done by wrapping the command you wish to time (in this case my call to read.table()) inside the function system.time(). I'll compare for just the large genes data frame.
# Time without specifying colClasses
system.time(genes <- read.table("C:/Users/rnorberg/Desktop/Classes/BIOL 525/wgEncodeGencodeCompV12.txt",
sep = "\t", header = F))
##    user  system elapsed 
## 27.19 0.11 27.41

# Time with colClasses specified
system.time(genes <- read.table("C:/Users/rnorberg/Desktop/Classes/BIOL 525/wgEncodeGencodeCompV12.txt",
sep = "\t", header = F, colClasses = c("integer", "character", "factor",
"factor", rep("numeric", 4), "integer", "character", "character", "integer",
"character", "factor", "factor", "character")))
##    user  system elapsed 
## 8.47 0.05 8.54
That's a pretty big difference!
A few more words about the system.time() function: This function doesn't just return a summary of the time ellapsed, it also actually executes the command wrapped inside. So after running either of the commands above I get a summary of the time ellapsed in my output and the object genes now exists in my workspace.
Now that I have my data loaded in, it's time to start mathcing regulatory elements in my data frame mydat to nearby genes in my huge data frame named genes.
Since it is my goal to eventually use an apply() function to accomplish this, and apply() functions are most efficient when each iteration only produces one number like an index or subscript, I will create an index variable in thegenes data frame that I will be searching through to find the nearest gene.
# Create and ID variable for the genes dataframe
genes$ID <- 1:nrow(genes)
Next, lets think about what we are actually going to do. For each regulatory element in mydat (one is described in each row), I want to find the closest gene, both upstream and downstream. For a particular regulatory element, I can immediately eliminate from my search all of the genes that are on a different chromosome. I anticipate that this will save me lots of search time by narrowing down where I'm searching. But I don't want to subset my enormousgenes dataframe for each iteration of my solution function that I will be apply()ing to mydat. So here I call on a cool function split() to break genes into seperate data frames (and deposit them into a list) one time instead of subsetting genes over and over again. To get a better idea of what split() does, let's look at a simple example.
First create an example data frame:
ex <- data.frame(chrom = rep(1:2, each = 13), txStart = letters)
ex
##    chrom txStart
## 1 1 a
## 2 1 b
## 3 1 c
## 4 1 d
## 5 1 e
## 6 1 f
## 7 1 g
## 8 1 h
## 9 1 i
## 10 1 j
## 11 1 k
## 12 1 l
## 13 1 m
## 14 2 n
## 15 2 o
## 16 2 p
## 17 2 q
## 18 2 r
## 19 2 s
## 20 2 t
## 21 2 u
## 22 2 v
## 23 2 w
## 24 2 x
## 25 2 y
## 26 2 z
Now split() it by the column chrom:
split(ex, ex$chrom)
## $`1`
## chrom txStart
## 1 1 a
## 2 1 b
## 3 1 c
## 4 1 d
## 5 1 e
## 6 1 f
## 7 1 g
## 8 1 h
## 9 1 i
## 10 1 j
## 11 1 k
## 12 1 l
## 13 1 m
##
## $`2`
## chrom txStart
## 14 2 n
## 15 2 o
## 16 2 p
## 17 2 q
## 18 2 r
## 19 2 s
## 20 2 t
## 21 2 u
## 22 2 v
## 23 2 w
## 24 2 x
## 25 2 y
## 26 2 z
This is a list of length 2. Each element in the list is a data frame. Notice the names of the list elements:
names(split(ex, ex$chrom))
## [1] "1" "2"
The list element named "1" contains all of the rows from the original data frame (ex) that had the value “1” in the chrom column.
Now lets diverge again from the programming and talk about the science of what we're trying to do. We are trying to find the gene closest to each transcription factor because that is the gene that the transcription factor most likely acts on. Transcription happens directionally. It always happens from the 5' end to the 3' end of the DNA strand. This would be left to right on the positive strand and right to left on the negative strand - both strands are numbered left to right, even the negative strand. How do I know this?
sum(genes$txEnd < genes$txStart)
## [1] 0
None of the txEnd (locant of where the gene ends) values are less than the txStart (where the gene begins) values. If the locants were numbered in the order of the way they were transcribed, all of the txStart sites on the negative strand would be less than the txEnd sites on the lagging (-) strand.
It only makes sense that a regulatory element would act on a gene downstream of it, so on the positive strand, we should look for the nearest txStart site and on the negative strand we should look for the nearest txEnd site. This is actually the transcription start site. So if we need to search different ways depending on which strand we are on, we can further subdivide the search task. Now I want to split() my genes data frame by chromosome and strand. How might I go about split()ing by 2 variables?
The easiest way would be to create a new variable that is a combination of chrom and strand, then split by that. An example:
ex$strand <- rep(c("+", "-"), 13)
ex$new_variable <- paste(ex$chrom, ex$strand)
ex
##    chrom txStart strand new_variable
## 1 1 a + 1 +
## 2 1 b - 1 -
## 3 1 c + 1 +
## 4 1 d - 1 -
## 5 1 e + 1 +
## 6 1 f - 1 -
## 7 1 g + 1 +
## 8 1 h - 1 -
## 9 1 i + 1 +
## 10 1 j - 1 -
## 11 1 k + 1 +
## 12 1 l - 1 -
## 13 1 m + 1 +
## 14 2 n - 2 -
## 15 2 o + 2 +
## 16 2 p - 2 -
## 17 2 q + 2 +
## 18 2 r - 2 -
## 19 2 s + 2 +
## 20 2 t - 2 -
## 21 2 u + 2 +
## 22 2 v - 2 -
## 23 2 w + 2 +
## 24 2 x - 2 -
## 25 2 y + 2 +
## 26 2 z - 2 -
split(ex, ex$new_variable)
## $`1 -`
## chrom txStart strand new_variable
## 2 1 b - 1 -
## 4 1 d - 1 -
## 6 1 f - 1 -
## 8 1 h - 1 -
## 10 1 j - 1 -
## 12 1 l - 1 -
##
## $`1 +`
## chrom txStart strand new_variable
## 1 1 a + 1 +
## 3 1 c + 1 +
## 5 1 e + 1 +
## 7 1 g + 1 +
## 9 1 i + 1 +
## 11 1 k + 1 +
## 13 1 m + 1 +
##
## $`2 -`
## chrom txStart strand new_variable
## 14 2 n - 2 -
## 16 2 p - 2 -
## 18 2 r - 2 -
## 20 2 t - 2 -
## 22 2 v - 2 -
## 24 2 x - 2 -
## 26 2 z - 2 -
##
## $`2 +`
## chrom txStart strand new_variable
## 15 2 o + 2 +
## 17 2 q + 2 +
## 19 2 s + 2 +
## 21 2 u + 2 +
## 23 2 w + 2 +
## 25 2 y + 2 +
Notice again the names of this list:
names(split(ex, ex$new_variable))
## [1] "1 -" "1 +" "2 -" "2 +"
This does the job nicely, but I can do this slightly more directly without cluttering up my workspace:
genes.list <- split(genes, paste(genes$chrom, genes$strand))
From this a list is created and each element of the list is a dataframe that is a subset of the genes dataframe. The first element is all those rows in genes which have chrom==1 and strand== -, named “chr1 -” and so on and so forth.
Next, I create 2 variables in mydat to pick out the appropriate element of genes.list to search through.
mydat$myvar1 <- paste(mydat$chrom, "+")
mydat$myvar2 <- paste(mydat$chrom, "-")
Now each row of mydat is associated with 2 elements of genes.list - the '+' and '-' data frames for the correct chromosome.
Now it is time to create a function to return the ID number (which is really a row number in the original genes dataframe) of the closest gene on the positive strand. I'll create a seperate function to find the closest gene on the negative strand.
In order to apply() this function later, I must reference my columns by indices, not by name. For example if I wanted to apply() a function that simply printed the chrom for each row in mydat, I would have to write print(x[1])and I could not write print(x$chrom) because x is a vector of just one row, not a data frame with names.
find.pos.indices <- function(x) {
f1 <- x[[5]]
f2 <- as.numeric(x[[2]]) # This is the value of the reg. element's txStart, converted to numeric (from integer)
myindex1 <- genes.list[[f1]][which.min(abs(genes.list[[f1]]$txStart - f2)),
"ID"]
# This finds the row number within the correct subsetted dataframe that
# contains the closest gene, then returns the value of ID in that row.
}
Here lies the crux of this program. apply() functions in R take advantage of the vectorized nature of the language to do what would otherwise be done in loops in a much more efficient way. So I apply() my function over each row of the data frame named mydat and deposit the results into pos.indices, a vector of integers.
pos.indices <- apply(mydat, 1, find.pos.indices)
pos.indices is just a list of rownumbers that correspond to closest genes, so I use this to attach the variables of interest in the genes data frame to the rows of mydat. For example if pos.indices were only the number 10, R would assign the 10th value of genes$txStart to mydat$PtxStart.
mydat$PtxStart <- genes$txStart[pos.indices]
mydat$PtxEnd <- genes$txEnd[pos.indices]
mydat$Pname2 <- genes$name2[pos.indices]
mydat$Pstrand <- genes$strand[pos.indices]
Now repeat the same procedure from above, but for the negative strand.
find.neg.indices <- function(x) {
f1 <- x[[6]]
f2 <- as.numeric(x[[3]])
myindex1 <- genes.list[[f1]][which.min(abs(genes.list[[f1]]$txStart - f2)),
"ID"]
}
neg.indices <- apply(mydat, 1, find.neg.indices)

mydat$NtxStart <- genes$txStart[neg.indices]
mydat$NtxEnd <- genes$txEnd[neg.indices]
mydat$Nname2 <- genes$name2[neg.indices]
mydat$Nstrand <- genes$strand[neg.indices]

# Get rid of the variables I created
mydat$myvar1 <- NULL
mydat$myvar2 <- NULL
Done! Now lets save the resulting data frame and see how long this entire process takes.
# Write the dataframe to a .csv file
write.csv(mydat, file = "< Path/Filename.csv >")
If you want to clock the time of this program, save the code as a .R file (an R script), then enter the following command:
system.time(source("<path_to_R_script>"))
On my computer, the entire process from reading in the data to writing the resulting data frame to a .csv file took 58.42 seconds. A friend of mine used Perl to interface with a mySQL database that already had the data loaded into it and his program took over 700 seconds to accomplish the same task. This was the way we were supposed to complete the assignment for class, but instead of struggling with the unfamiliar Perl syntax I used my native language ® and had great success. Many of my classmates had run times of 7 or more hours. Needless to say, that is a far cry from under a minute.
One of the ideas of the assignment was to get us to interface with a database and I did not do this, but I was kind of proud of myself that I managed to do the entire assignment in base R with no packages.

To leave a comment for the author, please follow the link and comment on his blog: bRogramming.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.