SQL commands in R

January 22, 2013
By

(This article was first published on bRogramming, and kindly contributed to R-bloggers)


For a class I'm taking this semester on genomics we're dealing with some pretty large data and for this reason we're learning to use mySQL. I decided to be a geek and do the assignments in R as well to demonstrate the ability of R to handle pretty large data sets quickly. Here's our first bit of work in mySQL, solved in R:



BIOL 525 Lecture 3: In class work

Download, create, and import the following tables either from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ or

From the Downloads folder in the Resources section of the wiki:

1) GENCODE gene table: wgEncodeGencodeCompV12.sql, wgEncodeGencodeCompV12txt.gz (or wgEncodeGencodeCompV12txt)

2) GENCODE gene attributes table: wgEncodeGencodeAttrsV12.sql, wgEncodeGencodeAttrsV12.txt.gz (or wgEncodeGencodeAttrsV12.txt)

I decided to download the .txt files from the class website, save them in my designated folder for this class and then infile them to R using the read.table() command. This is the simplest way of reading data into R. Note the argument sep='\t' in the read.table() command. This lets R know that the file is a tab-delimited file and allows it to be read in correctly. After reading in the data I manually named the columns using the information given in the .sql files on the class webpage.
attributes <- read.table("C:/Users/rnorberg/Desktop/Classes/BIOL 525/wgEncodeGencodeAttrsV12.txt", 
sep = "\t")
mynames <- c("geneId", "geneName", "geneType", "geneStatus", "transcriptId",
"transcriptName", "transcriptType", "transcriptStatus", "havanaGeneId",
"havanaTranscriptId", "ccdsId", "level", "transcriptClass")
names(attributes) <- mynames

genes <- read.table("C:/Users/rnorberg/Desktop/Classes/BIOL 525/wgEncodeGencodeCompV12.txt",
sep = "\t")
mynames1 <- c("bin", "name", "chrom", "strand", "txStart", "txEnd", "cdsStart",
"cdsEnd", "exonCount", "exonStarts", "exonEnds", "score", "name2", "cdsStartStat",
"cdsEndStat", "exonFrames")
names(genes) <- mynames1

Answer the following questions:

1) How many entries are in the Gencode gene table?

dim(genes)[1]
## [1] 167536

2) How many entries are in the genomic interval chr17:40,830,967-41,642,846.

nrow(subset(genes, chrom == "chr17" & txStart >= 40830967 & txEnd <= 41642846))
## [1] 142

3) How many of these entries are for genes on the negative strand.

nrow(subset(genes, chrom == "chr17" & txStart >= 40830967 & txEnd <= 41642846 & 
strand == "-"))
## [1] 90

4) How many of these negative strand genes have more than 15 exons.

nrow(subset(genes, chrom == "chr17" & txStart >= 40830967 & txEnd <= 41642846 & 
strand == "-" & exonCount > 15))
## [1] 23

5) How many uniquely named genes (name2 field) are on the negative strand and have more than 15 exons. List them.

length(unique(subset(genes, chrom == "chr17" & txStart >= 40830967 & txEnd <= 
41642846 & strand == "-" & exonCount > 15)$name2))
## [1] 3

6) How many entries (transcripts) are there for BRCA1.

nrow(subset(genes, chrom == "chr17" & txStart >= 40830967 & txEnd <= 41642846 & 
strand == "-" & exonCount > 15 & name2 == "BRCA1"))
## [1] 15

** Bonus **

7) The name2 field in the wgEncodeGencodeCompV7 and the geneName field in wgEncodeGencodeAttrsV7 tables are linked. For the three genes in #5, determine all possible geneStatus from the wgEncodeGencodeAttrsV7 table.

Hint: You can ORDER BY multiple fields.

my3genes <- unique(subset(genes, chrom == "chr17" & txStart >= 40830967 & txEnd <= 
41642846 & strand == "-" & exonCount > 15)$name2)
unique(subset(attributes, geneName %in% my3genes)$geneStatus)
## [1] KNOWN
## Levels: KNOWN NOVEL PUTATIVE

To leave a comment for the author, please follow the link and comment on his blog: bRogramming.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.