BLATting the internet: the most frequent gene?

January 23, 2014
By

(This article was first published on What You're Doing Is Rather Desperate » R, and kindly contributed to R-bloggers)

I enjoyed this story from the OpenHelix blog today, describing a Microsoft Research project to mine DNA sequences from web pages and map them to UCSC genome builds.

Laura DeMare asks: what was the most-hit gene?


First, visit the UCSC Table Browser. For the human genome hg19 build, the relevant group is “Phenotype and Literature”, the track is “Web Sequences” and the table is pubsBingBlat. Check the “genome” button under regions, enter a filename (I chose bingblat.gz, with gzip compression) and then click “get output”.

(notes: there are other “Bing” tables – but pubsBingBlat contains gene symbols, so seems easiest to work with in the first instance. The following analysis could also be done using the UCSC MySQL server.)

Open the file in R. It contains 313 510 rows. Gene symbols are in the last column, “locus”, which contains either one symbol or two separated by a comma.

bingblat <- read.table("bingblat.gz", header = T, sep = "\t",
                       stringsAsFactors = F, comment.char = "", quote = "")
dim(bingblat)
# [1] 313510     27

names(bingblat)
#  [1] "X.bin"       "chrom"       "chromStart"  "chromEnd"    "name"       
#  [6] "score"       "strand"      "thickStart"  "thickEnd"    "reserved"   
# [11] "blockCount"  "blockSizes"  "chromStarts" "tSeqTypes"   "seqIds"     
# [16] "seqRanges"   "publisher"   "pmid"        "doi"         "issn"       
# [21] "journal"     "title"       "firstAuthor" "year"        "impact"     
# [26] "classes"     "locus"

head(bingblat$locus)
[1] "WASH2P,WASH7P" "WASH7P"        "OR4F5"         "OR4F5"        
[5] "OR4F5"         "OR4F5"

If we split locus on comma, then unlist, we’ll get a vector of gene symbols and we can count them up using table. The top 10:

genes <- unlist(strsplit(bingblat$locus, ","))
genes.df <- as.data.frame(table(genes))
head(genes.df[order(genes.df$Freq, decreasing=T),], 10)

#           genes  Freq
# 5308      GAPDH 11096
# 8886   MIR4436A  4341
# 6628      IGHG1  4245
# 186        ACTB  3280
# 242       ADAM6  2466
# 7172   KIAA0125  2135
# 7704  LINC00226  2082
# 4155     ELK2AP  1714
# 15728      TP53  1398
# 6043        HBB  1380

GAPDH, by a long way. You’d guess that this is related to the frequent use of GAPDH as a “housekeeping gene” in assays for differential expression.

Here’s the full list with counts, as a public (I hope) TSV file at Dropbox.


Filed under: bioinformatics, genomics, R, statistics Tagged: blat, microsoft research, ucsc

To leave a comment for the author, please follow the link and comment on his blog: What You're Doing Is Rather Desperate » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.