SQL commands in R

Posted on January 22, 2013 by Robert Norberg in R bloggers | 0 Comments

[This article was first published on bRogramming, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

For a class I’m taking this semester on genomics we’re dealing with some pretty large data and for this reason we’re learning to use mySQL. I decided to be a geek and do the assignments in R as well to demonstrate the ability of R to handle pretty large data sets quickly. Here’s our first bit of work in mySQL, solved in R:

BIOL 525 Lecture 3: In class work

Download, create, and import the following tables either from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ or

From the Downloads folder in the Resources section of the wiki:

1) GENCODE gene table: wgEncodeGencodeCompV12.sql, wgEncodeGencodeCompV12txt.gz (or wgEncodeGencodeCompV12txt)

2) GENCODE gene attributes table: wgEncodeGencodeAttrsV12.sql, wgEncodeGencodeAttrsV12.txt.gz (or wgEncodeGencodeAttrsV12.txt)

I decided to download the .txt files from the class website, save them in my designated folder for this class and then infile them to R using the read.table() command. This is the simplest way of reading data into R. Note the argument sep='\t' in the read.table() command. This lets R know that the file is a tab-delimited file and allows it to be read in correctly. After reading in the data I manually named the columns using the information given in the .sql files on the class webpage.

attributes <- read.table("C:/Users/rnorberg/Desktop/Classes/BIOL 525/wgEncodeGencodeAttrsV12.txt", 
    sep = "\t")
mynames <- c("geneId", "geneName", "geneType", "geneStatus", "transcriptId", 
    "transcriptName", "transcriptType", "transcriptStatus", "havanaGeneId", 
    "havanaTranscriptId", "ccdsId", "level", "transcriptClass")
names(attributes) <- mynames

genes <- read.table("C:/Users/rnorberg/Desktop/Classes/BIOL 525/wgEncodeGencodeCompV12.txt", 
    sep = "\t")
mynames1 <- c("bin", "name", "chrom", "strand", "txStart", "txEnd", "cdsStart", 
    "cdsEnd", "exonCount", "exonStarts", "exonEnds", "score", "name2", "cdsStartStat", 
    "cdsEndStat", "exonFrames")
names(genes) <- mynames1

Answer the following questions:

1) How many entries are in the Gencode gene table?

dim(genes)[1]

## [1] 167536

2) How many entries are in the genomic interval chr17:40,830,967-41,642,846.

nrow(subset(genes, chrom == "chr17" & txStart >= 40830967 & txEnd <= 41642846))

## [1] 142

3) How many of these entries are for genes on the negative strand.

nrow(subset(genes, chrom == "chr17" & txStart >= 40830967 & txEnd <= 41642846 & 
    strand == "-"))

## [1] 90

4) How many of these negative strand genes have more than 15 exons.

nrow(subset(genes, chrom == "chr17" & txStart >= 40830967 & txEnd <= 41642846 & 
    strand == "-" & exonCount > 15))

## [1] 23

5) How many uniquely named genes (name2 field) are on the negative strand and have more than 15 exons. List them.

length(unique(subset(genes, chrom == "chr17" & txStart >= 40830967 & txEnd <= 
    41642846 & strand == "-" & exonCount > 15)$name2))

## [1] 3

6) How many entries (transcripts) are there for BRCA1.

nrow(subset(genes, chrom == "chr17" & txStart >= 40830967 & txEnd <= 41642846 & 
    strand == "-" & exonCount > 15 & name2 == "BRCA1"))

## [1] 15

Bonus

7) The name2 field in the wgEncodeGencodeCompV7 and the geneName field in wgEncodeGencodeAttrsV7 tables are linked. For the three genes in #5, determine all possible geneStatus from the wgEncodeGencodeAttrsV7 table.

Hint: You can ORDER BY multiple fields.

my3genes <- unique(subset(genes, chrom == "chr17" & txStart >= 40830967 & txEnd <= 
    41642846 & strand == "-" & exonCount > 15)$name2)
unique(subset(attributes, geneName %in% my3genes)$geneStatus)

## [1] KNOWN
## Levels: KNOWN NOVEL PUTATIVE

To leave a comment for the author, please follow the link and comment on their blog: bRogramming.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

SQL commands in R

BIOL 525 Lecture 3: In class work

Download, create, and import the following tables either from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ or

1) GENCODE gene table: wgEncodeGencodeCompV12.sql, wgEncodeGencodeCompV12txt.gz (or wgEncodeGencodeCompV12txt)

2) GENCODE gene attributes table: wgEncodeGencodeAttrsV12.sql, wgEncodeGencodeAttrsV12.txt.gz (or wgEncodeGencodeAttrsV12.txt)

Answer the following questions:

1) How many entries are in the Gencode gene table?

2) How many entries are in the genomic interval chr17:40,830,967-41,642,846.

3) How many of these entries are for genes on the negative strand.

4) How many of these negative strand genes have more than 15 exons.

5) How many uniquely named genes (name2 field) are on the negative strand and have more than 15 exons. List them.

6) How many entries (transcripts) are there for BRCA1.

Bonus

7) The name2 field in the wgEncodeGencodeCompV7 and the geneName field in wgEncodeGencodeAttrsV7 tables are linked. For the three genes in #5, determine all possible geneStatus from the wgEncodeGencodeAttrsV7 table.

Hint: You can ORDER BY multiple fields.

Related

BIOL 525 Lecture 3: In class work

Download, create, and import the following tables either from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ or

1) GENCODE gene table: wgEncodeGencodeCompV12.sql, wgEncodeGencodeCompV12txt.gz (or wgEncodeGencodeCompV12txt)

2) GENCODE gene attributes table: wgEncodeGencodeAttrsV12.sql, wgEncodeGencodeAttrsV12.txt.gz (or wgEncodeGencodeAttrsV12.txt)

Answer the following questions:

1) How many entries are in the Gencode gene table?

2) How many entries are in the genomic interval chr17:40,830,967-41,642,846.

3) How many of these entries are for genes on the negative strand.

4) How many of these negative strand genes have more than 15 exons.

5) How many uniquely named genes (name2 field) are on the negative strand and have more than 15 exons. List them.

6) How many entries (transcripts) are there for BRCA1.

** Bonus **

7) The name2 field in the wgEncodeGencodeCompV7 and the geneName field in wgEncodeGencodeAttrsV7 tables are linked. For the three genes in #5, determine all possible geneStatus from the wgEncodeGencodeAttrsV7 table.

Hint: You can ORDER BY multiple fields.

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Bonus

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)