Bug of R package ChIPpeakAnno
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I used R package ChIPpeakAnno for annotating peaks, and found that it handle the DNA strand in the wrong way. Maybe the developers were from the computer science but not biology background.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | > require(ChIPpeakAnno) > packageVersion("ChIPpeakAnno") [1] '2.10.0' > peak <- RangedData(space="chr1", IRanges(24736757, 24737528)) > data(TSS.human.GRCh37) > ap <- annotatePeakInBatch(peak, Annotation=TSS.human.GRCh37) > ap RangedData with 1 row and 9 value columns across 1 space space ranges | peak strand <factor> <IRanges> | <character> <character> 1 ENSG00000001461 1 [24736757, 24737528] | 1 + feature start_position end_position insideFeature <character> <numeric> <numeric> <character> 1 ENSG00000001461 ENSG00000001461 24742284 24799466 upstream distancetoFeature shortestDistance fromOverlappingOrNearest <numeric> <numeric> <character> 1 ENSG00000001461 -5527 4756 NearestStart |
In this example, I defined a peak ranging from chr1:24736757 to chr1:24737528 and annotated the peak using ChIPpeakAnno package.
It returns that the nearest gene is ENSG00000001461, whose gene symbol is NIPAL3.
1 2 3 4 5 | > require(org.Hs.eg.db) > gene.ChIPpeakAnno <- select(org.Hs.eg.db, key=ap$feature, keytype="ENSEMBL", columns=c("ENSEMBL", "ENTREZID", "SYMBOL")) > gene.ChIPpeakAnno ENSEMBL ENTREZID SYMBOL 1 ENSG00000001461 57185 NIPAL3 |
When looking at the peak in Genome Browser, I found the nearest gene is STPG1.
The gene symbol, STPG1, was converted to ENTREZID for future processing.
1 2 3 4 | > gene.nearest <- select(org.Hs.eg.db, key="STPG1", keytype="SYMBOL", columns=c("ENSEMBL", "ENTREZID", "SYMBOL")) > gene.nearest SYMBOL ENSEMBL ENTREZID 1 STPG1 ENSG00000001460 90529 |
We can query the coordination of these two genes, and compare their distances to the peak.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | > require(TxDb.Hsapiens.UCSC.hg19.knownGene) > knownGene <- transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, by="gene") > > gene.ChIPpeakAnno.info <- knownGene[[gene.ChIPpeakAnno$ENTREZID]] > gene.ChIPpeakAnno.info GRanges with 4 ranges and 2 metadata columns: seqnames ranges strand | tx_id tx_name <Rle> <IRanges> <Rle> | <integer> <character> [1] chr1 [24742245, 24781314] + | 618 uc010oek.2 [2] chr1 [24742245, 24799473] + | 619 uc001bjh.3 [3] chr1 [24742245, 24799473] + | 620 uc009vrc.3 [4] chr1 [24782628, 24792864] + | 621 uc001bji.3 --- seqlengths: chr1 chr2 ... chrUn_gl000249 249250621 243199373 ... 38502 |
After getting the information of the gene annotated by ChIPpeakAnno, I also found that the ranges of the gene is also slightly different from the one returned by annotatePeakInBatch function.
As the gene, NIPAL3, is encoded in + strand, the nearest distance is:
1 2 | > min(abs(start(peak) - start(gene.ChIPpeakAnno.info))) [1] 5488 |
While the gene, STPG1, is encoded in – strand, the end of the gene coordination is actually the start position of the gene and the start of the gene coordination is the end position. So the distance should be calculated by the end coordination.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | > gene.nearest.info <- knownGene[[gene.nearest$ENTREZID]] > gene.nearest.info GRanges with 6 ranges and 2 metadata columns: seqnames ranges strand | tx_id tx_name <Rle> <IRanges> <Rle> | <integer> <character> [1] chr1 [24683489, 24700300] - | 4706 uc010oej.2 [2] chr1 [24683489, 24740262] - | 4707 uc001bja.3 [3] chr1 [24683489, 24740262] - | 4708 uc001bjb.3 [4] chr1 [24683489, 24740262] - | 4709 uc001bjc.3 [5] chr1 [24683489, 24741587] - | 4710 uc001bjd.3 [6] chr1 [24695211, 24718169] - | 4711 uc001bjf.3 --- seqlengths: chr1 chr2 ... chrUn_gl000249 249250621 243199373 ... 38502 > min(abs(end(peak) - end(knownGene[[gene.nearest$ENTREZID]]))) [1] 2734 |
STPG1 is more close to the peak than NIPAL3. Those genes encoded in – strand can’t be well-handled by ChIPpeakAnno package, it sucks.
It’s not hard to implement a function to annotate peaks. Preparing the gene annotation and calculating the distances among genes and peaks for finding the nearest gene of the peak. That’s it.
Related Posts
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.