# New features in genomation package

**Recipes, scripts and genomics**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

- Extending genomation to work with paired-end BAM files
- Accelerate functions responsible for reading genomic files
- Parallelizing data processing in ScoreMatrixList
- Arithmetic, indicator and logic operations as well as subsetting work on ScoreMatrix objects
- Improvements and new arguments in visualization functions
- Faster heatmaps
- New clustering possibilities in heatmaps: “clustfun” argument in multiHeatMatrix
- Defining which matrices are used for clustering: “clust.matrix” in multiHeatMatrix
- Central tendencies in line plots: centralTend in plotMeta
- Smoothing central tendency: smoothfun in plotMeta
- Plotting dispersion around central lines in line plots: dispersion in plotMeta

- Calculating scores that correspond to k-mer or PWM matrix occurence: patternMatrix function
- Integration with Travis CI for auto-testing

Genomation is an R package to summarize, annotate and visualize genomic intervals. It contains a collection of tools for visualizing and analyzing genome-wide data sets, i.e. RNA-seq, bisulfite sequencing or chromatin-immunoprecipitation followed by sequencing (ChIP-seq) data.

Recently we added new features to genomation and here we present them on example of binding profiles of 6 transcription factors around the CTCF binding sites derived from ChIP-seq. All new functionalities are available in the latest version of genomation that can be found on its github website.

# install the package from github library(devtools) install_github("BIMSBbioinfo/genomation",build_vignettes=FALSE)

# Extending genomation to work with paired-end BAM files

Genomation can work with paired-end BAM files. Mates from reads are treated as fragments (they are stitched together).

library(genomation) genomationDataPath = system.file('extdata',package='genomationData') bam.files = list.files(genomationDataPath, full.names=TRUE, pattern='bam$') bam.files = bam.files[!grepl('Cage', bam.files)]

# Accelerate functions responsible for reading genomic files

This is achived by using *readr::read_delim* function to read genomic files instead of *read.table*. Additionally if skip=“auto” argument is provided in *readGeneric_or track.line=“auto” in other functions that read genomic files, e.g. _readBroadPeak* then UCSC header is detected (and first track).

library(GenomicRanges) ctcf.peaks = readBroadPeak(file.path(genomationDataPath, 'wgEncodeBroadHistoneH1hescCtcfStdPk.broadPeak.gz')) ctcf.peaks = ctcf.peaks[seqnames(ctcf.peaks) == 'chr21'] ctcf.peaks = ctcf.peaks[order(-ctcf.peaks$signalValue)] ctcf.peaks = resize(ctcf.peaks, width=1000, fix='center')

# Parallelizing data processing in ScoreMatrixList

We use *ScoreMatrixList* function to extract coverage values of all transcription factors around ChIP-seq peaks. *ScoreMatrixList* was improved by adding new argument *cores*that indicates number of cores to be used at the same time (by using *parallel:mclapply*).

sml = ScoreMatrixList(bam.files, ctcf.peaks, bin.num=50, type='bam', cores=2) # descriptions of file that contain info. about transcription factors sampleInfo = read.table(system.file('extdata/SamplesInfo.txt', package='genomationData'),header=TRUE, sep='\t') names(sml) = sampleInfo$sampleName[match(names(sml),sampleInfo$fileName)]

# Arithmetic, indicator and logic operations as well as subsetting work on score matrices

Arithmetic, indicator and logic operations work on *ScoreMatrix*, *ScoreMatrixBin* and *ScoreMatrixList*

objects, e.i.:

Arith: “+”, “-”, “*”, “^{”,} “%%”, “%/%”, “/”

Compare: “==”, “>”, “<”, “!=”, “<=”, “>=”

Logic: “&”, “|”

sml1 = sml * 100 sml1 ## scoreMatrixlist of length:5 ## ## 1. scoreMatrix with dims: 1681 50 ## 2. scoreMatrix with dims: 1681 50 ## 3. scoreMatrix with dims: 1681 50 ## 4. scoreMatrix with dims: 1681 50 ## 5. scoreMatrix with dims: 1681 50

Subsetting:

sml[[6]] = sml[[1]] sml ## scoreMatrixlist of length:6 ## ## 1. scoreMatrix with dims: 1681 50 ## 2. scoreMatrix with dims: 1681 50 ## 3. scoreMatrix with dims: 1681 50 ## 4. scoreMatrix with dims: 1681 50 ## 5. scoreMatrix with dims: 1681 50 ## 6. scoreMatrix with dims: 1681 50 sml[[6]] <- NULL

# Improvements and new arguments in visualization functions

Due to large signal scale of rows of each element in the *ScoreMatrixList* we scale them.

sml.scaled = scaleScoreMatrixList(sml)

## Faster heatmaps

*HeatMatrix* and *multiHeatMatrix* function works faster by faster assigning colors. Heatmap profile of scaled coverage shows a colocalization of Ctcf, Rad21 and Znf143.

multiHeatMatrix(sml.scaled, xcoords=c(-500, 500))

## New clustering possibilities in heatmaps: “clustfun” argument in multiHeatMatrix

*clustfun* allow to add more clustering functions and integrate them with the heatmap function multiHeatMatrix. It has to be a function that returns a vector of integers indicating the cluster to which each point is allocated. Previous version of *multiHeatMatrix* could cluster rows of heatmaps using only k-means algorithm.

# k-means algorithm with 2 clusters cl1 <- function(x) kmeans(x, centers=2)$cluster multiHeatMatrix(sml.scaled, xcoords=c(-500, 500), clustfun = cl1)

# hierarchical clustering with Ward's method for agglomeration into 2 clusters cl2 <- function(x) cutree(hclust(dist(x), method="ward"), k=2) multiHeatMatrix(sml.scaled, xcoords=c(-500, 500), clustfun = cl2) ## The "ward" method has been renamed to "ward.D"; note new "ward.D2"

## Defining which matrices are used for clustering: “clust.matrix” in multiHeatMatrix

clust.matrix argument indicates which matrices are used for clustering. It can be a numerical vector of indexes of matrices or a character vector of names of the ‘ScoreMatrix’ objects in 'ScoreMatrixList'. Matrices that are not in clust.matrix are ordered according to the result of the clustering algorithm. By default all matrices are clustered.

multiHeatMatrix(sml.scaled, xcoords=c(-500, 500), clustfun = cl1, clust.matrix = 1)

## Central tendencies in line plots: centralTend in plotMeta

We extended visualization capabilities for meta-plots. *plotMeta* function can plot not only mean, but also median as central tendency and it can be set up using *centralTend* argument. Previously user could plot only mean.

plotMeta(mat=sml.scaled, profile.names=names(sml.scaled), xcoords=c(-500, 500), winsorize=c(0,99), centralTend="mean")

## Smoothing central tendency: smoothfun in plotMeta

We added *smoothfun* argument to smooth central tendency as well as dispersion bands around it which is shown in the next figure. Smoothfun has to be a function that returns a list that contains a vector of y coordinates (vector named '$y').

plotMeta(mat=sml.scaled, profile.names=names(sml.scaled), xcoords=c(-500, 500), winsorize=c(0,99), centralTend="mean", smoothfun=function(x) stats::smooth.spline(x, spar=0.5))

## Plotting dispersion around central lines in line plots: dispersion in plotMeta

We added new argument *dispersion* to plotMeta that shows dispersion bands around *centralTend*. It can take one of the arguments:

- “se” shows standard error of the mean and 95 percent confidence interval for the mean
- “sd” shows standard deviation and 2*(standard deviation)
- “IQR” shows 1st and 3rd quartile and confidence interval around the median based on the median +/- 1.57 * IQR/sqrt(n) (notches)

plotMeta(mat=sml, profile.names=names(sml), xcoords=c(-500, 500), winsorize=c(0,99), centralTend="mean", smoothfun=function(x) stats::smooth.spline(x, spar=0.5), dispersion="se", lwd=4)

# Calculating scores that correspond to k-mer or PWM matrix occurence: patternMatrix function

We added new function *patternMatrix* that calculates k-mer and PWM occurrences over predefined equal width windows. If one pattern (character of length 1 or PWM matrix) is given then it returns ScoreMatrix, if more than one character ot list of PWM matrices then ScoreMatrixList. It finds either positions of pattern hits above a specified threshold and creates score matrix filled with 1 (presence of pattern) and 0 (its absence) or matrix with score themselves. *windows* can be a DNAStringList object or GRanges object (but then genome argument has to be provided, a BSgenome object).

#ctcf motif from the JASPAR database ctcf.pfm = matrix(as.integer(c(87,167,281,56,8,744,40,107,851,5,333,54,12,56,104,372,82,117,402, 291,145,49,800,903,13,528,433,11,0,3,12,0,8,733,13,482,322,181, 76,414,449,21,0,65,334,48,32,903,566,504,890,775,5,507,307,73,266, 459,187,134,36,2,91,11,324,18,3,9,341,8,71,67,17,37,396,59)), ncol=19,byrow=TRUE) rownames(ctcf.pfm) <- c("A","C","G","T") prior.params = c(A=0.25, C=0.25, G=0.25, T=0.25) priorProbs = prior.params/sum(prior.params) postProbs = t( t(ctcf.pfm + prior.params)/(colSums(ctcf.pfm)+sum(prior.params)) ) ctcf.pwm = Biostrings::unitScale(log2(postProbs/priorProbs)) library(BSgenome.Hsapiens.UCSC.hg19) hg19 = BSgenome.Hsapiens.UCSC.hg19 p = patternMatrix(pattern=ctcf.pwm, windows=ctcf.peaks, genome=hg19, min.score=0.8)

Visualization of the patternMatrix *patternMatrix* (here as ScoreMatrix object) can be visualized using i.e. heatMatrix, heatMeta or plotMeta functions.

heatMatrix(p, xcoords=c(-500, 500), main="CTCF motif")

plotMeta(mat=p, xcoords=c(-500, 500), smoothfun=function(x) stats::lowess(x, f = 1/10), line.col="red", main="ctcf motif")

# Integration with Travis CI for auto-testing

Recently we integrated genomation with Travis CI. It allows users to see current status of the package which is updated during every change of the package. Travis automatically runs R CMD CHECK and reports it. Shields shown below are on the genomation github site:

https://github.com/BIMSBbioinfo/genomation

Status

# <br /> sessionInfo() ## R version 3.2.2 (2015-08-14) ## Platform: x86_64-apple-darwin13.4.0 (64-bit) ## Running under: OS X 10.10.5 (Yosemite) ## ## locale: ## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 ## ## attached base packages: ## [1] stats4 parallel grid stats graphics grDevices utils ## [8] datasets methods base ## ## other attached packages: ## [1] BSgenome.Hsapiens.UCSC.hg19_1.4.0 BSgenome_1.36.3 ## [3] rtracklayer_1.28.10 Biostrings_2.36.4 ## [5] XVector_0.8.0 GenomicRanges_1.20.8 ## [7] GenomeInfoDb_1.4.3 IRanges_2.2.9 ## [9] S4Vectors_0.6.6 BiocGenerics_0.14.0 ## [11] genomation_1.1.27 BiocInstaller_1.18.5 ## [13] devtools_1.9.1 ## ## loaded via a namespace (and not attached): ## [1] Rcpp_0.12.1 formatR_1.2.1 ## [3] futile.logger_1.4.1 plyr_1.8.3 ## [5] bitops_1.0-6 futile.options_1.0.0 ## [7] tools_3.2.2 zlibbioc_1.14.0 ## [9] digest_0.6.8 gridBase_0.4-7 ## [11] evaluate_0.8 memoise_0.2.1 ## [13] gtable_0.1.2 curl_0.9.3 ## [15] yaml_2.1.13 proto_0.3-10 ## [17] httr_1.0.0 stringr_1.0.0 ## [19] knitr_1.11 data.table_1.9.6 ## [21] impute_1.42.0 R6_2.1.1 ## [23] plotrix_3.5-12 XML_3.98-1.3 ## [25] BiocParallel_1.2.22 seqPattern_1.0.1 ## [27] rmarkdown_0.8.1 readr_0.1.1 ## [29] reshape2_1.4.1 ggplot2_1.0.1 ## [31] lambda.r_1.1.7 magrittr_1.5 ## [33] matrixStats_0.14.2 MASS_7.3-44 ## [35] scales_0.3.0 Rsamtools_1.20.5 ## [37] htmltools_0.2.6 GenomicAlignments_1.4.2 ## [39] colorspace_1.2-6 KernSmooth_2.23-15 ## [41] stringi_0.5-5 munsell_0.4.2 ## [43] RCurl_1.95-4.7 chron_2.3-47 ## [45] markdown_0.7.7

**leave a comment**for the author, please follow the link and comment on their blog:

**Recipes, scripts and genomics**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.