ChIPseeker for ChIP peak annotation

April 13, 2014
By

(This article was first published on YGC » R, and kindly contributed to R-bloggers)

ChIPpeakAnno WAS the only one R package for ChIP peak annotation. I used it for annotating peak in my recent study.

I found it does not consider the strand information of genes. I reported the bug to the authors, but they are reluctant to change.

So I decided to develop my own package, ChIPseeker, and it's now available in Bioconductor.

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
> require(TxDb.Hsapiens.UCSC.hg19.knownGene)
> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
> require(ChIPseeker)
> peakfile = system.file("extdata", "sample_peaks.txt", package="ChIPseeker")
> peakfile
[1] "/usr/local/Cellar/r/3.0.2/R.framework/Versions/3.0/Resources/library/ChIPseeker/extdata/sample_peaks.txt"
> peakAnno <- annotatePeak(peakfile, tssRegion=c(-3000, 1000), TranscriptDb=txdb, annoDb="org.Hs.eg.db")
>> loading peak file...
>> preparing features information...
>> identifying nearest features...
>> calculating distance from peak to TSS...
>> assigning genomic annotation...
>> adding gene annotation...
Loading required package: org.Hs.eg.db
>> done...
> head(peakAnno)
GRanges with 6 ranges and 17 metadata columns:
      seqnames                 ranges strand |    length    summit      tags
         <Rle>              <IRanges>  <Rle> | <integer> <integer> <integer>
  [1]    chr10 [105137980, 105138593]      * |       614       174         7
  [2]    chr10 [ 42644416,  42645383]      * |       968       669        27
  [3]     chr6 [162188189, 162188742]      * |       554       174         8
  [4]     chr4 [  2307246,   2307622]      * |       377       188         9
  [5]    chr13 [ 51791808,  51792240]      * |       433       267         8
  [6]    chr19 [ 58731678,  58732653]      * |       976       472        15
      X.10.log10.pvalue. fold_enrichment    FDR...                   annotation
               <numeric>       <numeric> <numeric>                  <character>
  [1]              52.80           15.32      <NA>    Exon (38885 exon 3 of 11)
  [2]              77.15            9.13      0.79                   Intergenic
  [3]              59.88           17.82      <NA> Intron (27755 intron 5 of 6)
  [4]              85.25           26.62      1.03    Exon (18944 exon 5 of 10)
  [5]              65.32           15.08      0.74                   Intergenic
  [6]              55.00            8.04      <NA>                   Intergenic
       geneChr geneStart   geneEnd geneLength geneStrand      geneId
      <factor> <integer> <integer>  <integer>   <factor> <character>
  [1]    chr10 105127724 105148822      21099          +        6877
  [2]    chr10  42827314  42863493      36180          -      441666
  [3]     chr6 161551057 161695107     144051          -       56895
  [4]     chr4   2249160   2263739      14580          -       10608
  [5]    chr13  51796470  51858377      61908          +      220108
  [6]    chr19  58740070  58775008      34939          +       27300
      distanceToTSS         ENSEMBL      SYMBOL
          <integer>     <character> <character>
  [1]        -10256 ENSG00000148835        TAF5
  [2]        218110            <NA>   LOC441666
  [3]        493635 ENSG00000026652      AGPAT4
  [4]         43883 ENSG00000123933        MXD4
  [5]         -4662 ENSG00000150510     FAM124A
  [6]         -8392 ENSG00000198131      ZNF544
                                                                              GENENAME
                                                                           <character>
  [1] TAF5 RNA polymerase II, TATA box binding protein (TBP)-associated factor, 100kDa
  [2]                                                zinc finger protein 91 pseudogene
  [3]                                   1-acylglycerol-3-phosphate O-acyltransferase 4
  [4]                                                       MAX dimerization protein 4
  [5]                                             family with sequence similarity 124A
  [6]                                                          zinc finger protein 544
  ---
  seqlengths:
    chr1 chr10 chr11 chr12 chr13 chr14 ...  chr5  chr6  chr7  chr8  chr9  chrX
      NA    NA    NA    NA    NA    NA ...    NA    NA    NA    NA    NA    NA
>

annotatePeak function can accept peak file or GRanges object that contained peak information.
The sample peakfile shown above is a sample output from MACS. All the information contained in peakfile will be kept in the output of annotatePeak.

The annotation column annotates genomic features of the peak, that is whether peak is located in Promoter, Exon, UTR, Intron or Intergenic. If the peak is annotated by Exon or Intron, more detail information will be provided. For instance, Exon (38885 exon 3 of 11) indicates that the peak is located at the 3th exon of gene 38885 (entrezgene ID) which contain 11 exons in total.

All the names of column that contains information of nearest gene annotated will start with gene in camelCaps style.

If parameter annoDb="org.Hs.eg.db" is provided, extra columns of ENSEMBL, SYMBOL and GENENAME will be provided in the final output.

For more detail, please refer to the package vignette. Comments or suggestions are welcome.

Related Posts

To leave a comment for the author, please follow the link and comment on his blog: YGC » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.