SQL commands in R

[This article was first published on bRogramming, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


For a class I’m taking this semester on genomics we’re dealing with some pretty large data and for this reason we’re learning to use mySQL. I decided to be a geek and do the assignments in R as well to demonstrate the ability of R to handle pretty large data sets quickly. Here’s our first bit of work in mySQL, solved in R:



BIOL 525 Lecture 3: In class work

Download, create, and import the following tables either from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ or

From the Downloads folder in the Resources section of the wiki:

1) GENCODE gene table: wgEncodeGencodeCompV12.sql, wgEncodeGencodeCompV12txt.gz (or wgEncodeGencodeCompV12txt)

2) GENCODE gene attributes table: wgEncodeGencodeAttrsV12.sql, wgEncodeGencodeAttrsV12.txt.gz (or wgEncodeGencodeAttrsV12.txt)

I decided to download the .txt files from the class website, save them in my designated folder for this class and then infile them to R using the read.table() command. This is the simplest way of reading data into R. Note the argument sep='\t' in the read.table() command. This lets R know that the file is a tab-delimited file and allows it to be read in correctly. After reading in the data I manually named the columns using the information given in the .sql files on the class webpage.
attributes <- read.table("C:/Users/rnorberg/Desktop/Classes/BIOL 525/wgEncodeGencodeAttrsV12.txt", 
    sep = "\t")
mynames <- c("geneId", "geneName", "geneType", "geneStatus", "transcriptId", 
    "transcriptName", "transcriptType", "transcriptStatus", "havanaGeneId", 
    "havanaTranscriptId", "ccdsId", "level", "transcriptClass")
names(attributes) <- mynames

genes <- read.table("C:/Users/rnorberg/Desktop/Classes/BIOL 525/wgEncodeGencodeCompV12.txt", 
    sep = "\t")
mynames1 <- c("bin", "name", "chrom", "strand", "txStart", "txEnd", "cdsStart", 
    "cdsEnd", "exonCount", "exonStarts", "exonEnds", "score", "name2", "cdsStartStat", 
    "cdsEndStat", "exonFrames")
names(genes) <- mynames1

Answer the following questions:

1) How many entries are in the Gencode gene table?

dim(genes)[1]

## [1] 167536

2) How many entries are in the genomic interval chr17:40,830,967-41,642,846.

nrow(subset(genes, chrom == "chr17" & txStart >= 40830967 & txEnd <= 41642846))

## [1] 142

3) How many of these entries are for genes on the negative strand.

nrow(subset(genes, chrom == "chr17" & txStart >= 40830967 & txEnd <= 41642846 & 
    strand == "-"))

## [1] 90

4) How many of these negative strand genes have more than 15 exons.

nrow(subset(genes, chrom == "chr17" & txStart >= 40830967 & txEnd <= 41642846 & 
    strand == "-" & exonCount > 15))

## [1] 23

5) How many uniquely named genes (name2 field) are on the negative strand and have more than 15 exons. List them.

length(unique(subset(genes, chrom == "chr17" & txStart >= 40830967 & txEnd <= 
    41642846 & strand == "-" & exonCount > 15)$name2))

## [1] 3

6) How many entries (transcripts) are there for BRCA1.

nrow(subset(genes, chrom == "chr17" & txStart >= 40830967 & txEnd <= 41642846 & 
    strand == "-" & exonCount > 15 & name2 == "BRCA1"))

## [1] 15

** Bonus **

7) The name2 field in the wgEncodeGencodeCompV7 and the geneName field in wgEncodeGencodeAttrsV7 tables are linked. For the three genes in #5, determine all possible geneStatus from the wgEncodeGencodeAttrsV7 table.

Hint: You can ORDER BY multiple fields.

my3genes <- unique(subset(genes, chrom == "chr17" & txStart >= 40830967 & txEnd <= 
    41642846 & strand == "-" & exonCount > 15)$name2)
unique(subset(attributes, geneName %in% my3genes)$geneStatus)

## [1] KNOWN
## Levels: KNOWN NOVEL PUTATIVE

To leave a comment for the author, please follow the link and comment on their blog: bRogramming.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)