(This article was first published on

My goal is to develop a means of detecting chromosome bias from a human BAM file.**Jermdemo Raised to the Law**, and kindly contributed to R-bloggers)Because I've been working with proprietary and novel plant genomes for the last three years, I haven't had the chance to use any of the awesome UCSC-based annotational features that have been introduced and refined in Bioconductor until now. I've returned to biomedical research and I have some catching up to do.

BSgenome might sound like horsecrap, but each

**B**io

**s**trings-based

**genome**data package is actually a huge digested version of a UCSC/NCBI genome freeze and basic sequence annotation compiled into R objects.

BSgenome at Bioconductor

Be careful with googling bioconductor help - often the results point to older versions. Make sure your link has "release" in the url.

Here are the BSgenomes available today:

> available.genomes(type=getOption("pkgType"))Select and load hg19

BioC_mirror = http://www.bioconductor.org

Change using chooseBioCmirror().

[1] "BSgenome.Amellifera.BeeBase.assembly4" "BSgenome.Amellifera.UCSC.apiMel2"

[3] "BSgenome.Athaliana.TAIR.01222004" "BSgenome.Athaliana.TAIR.04232008"

[5] "BSgenome.Btaurus.UCSC.bosTau3" "BSgenome.Btaurus.UCSC.bosTau4"

[7] "BSgenome.Celegans.UCSC.ce2" "BSgenome.Celegans.UCSC.ce6"

[9] "BSgenome.Cfamiliaris.UCSC.canFam2" "BSgenome.Dmelanogaster.UCSC.dm2"

[11] "BSgenome.Dmelanogaster.UCSC.dm3" "BSgenome.Drerio.UCSC.danRer5"

[13] "BSgenome.Drerio.UCSC.danRer6" "BSgenome.Ecoli.NCBI.20080805"

[15] "BSgenome.Ggallus.UCSC.galGal3" "BSgenome.Hsapiens.UCSC.hg17"

[17] "BSgenome.Hsapiens.UCSC.hg18" "BSgenome.Hsapiens.UCSC.hg19"

[19] "BSgenome.Mmusculus.UCSC.mm8" "BSgenome.Mmusculus.UCSC.mm9"

[21] "BSgenome.Ptroglodytes.UCSC.panTro2" "BSgenome.Rnorvegicus.UCSC.rn4"

[23] "BSgenome.Scerevisiae.UCSC.sacCer1" "BSgenome.Scerevisiae.UCSC.sacCer2"

biocLite("BSgenome.Hsapiens.UCSC.hg19")

library("BSgenome.Hsapiens.UCSC.hg19")

When we get an alignment file one of the first things we want to do is look for red flags that might indicate something went awry in the lab or downstream. An example is chromosome bias - are we seeing more reads aligned to certain chromosomes than would be expected on size alone? A sticky question, since any experiment will introduce confounds based on the inherent uneven distribution of interesting genomic features, not to mention mapability. And yet I think this is still a worthwhile exercise and should be part of any ngs sequencing pipeline.

What we don't want to do is ignore that 7.6% of the GRCh37 freeze is sequence that looks like "NNNNNNN" - gaps representing unsequencable regions such as centromeres, scaffold gap delinations, and the like. We especially don't want to ignore gaps because they are not evenly distributed across the chromosomes (chrY is 56% gaps).

Raw chromosome length can be obtained from the BAM file header, but for this chromosome bias analysis I need the "non-gappy" length, the portion eligible for alignment. This is one of the "masks" turned on by default for BSgenomes in order to allow various functions to work properly (see MaskCollection in the IRanges package for more information).

> masks(Hsapiens)Now I want to create a data frame of with sequence names and unmaskedWidths to go with some read counts from a BAM file. Whenever I want to go from a list, through a function, to a data frame I think plyr, specifically ldply (

Error in function (classes, fdef, mtable) :

unable to find an inherited method for function masks, for signature "BSgenome"

#oops I see masks are a member of MaskedDNAString objects (i.e. chromosomes) not BSgenome objects

> masks(Hsapiens$chrY)

MaskCollection of length 4 and width 59373566

masks:

maskedwidth maskedratio active names desc

1 33720000 0.567929506 TRUE AGAPS assembly gaps

2 0 0.000000000 TRUE AMB intra-contig ambiguities (empty)

3 16024357 0.269890426 FALSE RM RepeatMasker

4 587815 0.009900281 FALSE TRF Tandem Repeats Finder [period<=12]

all masks together:

maskedwidth maskedratio

49783032 0.8384713

all active masks together:

maskedwidth maskedratio

33720000 0.5679295

#I think the maskedwidth should reveal sum of actively masked nucleotides

> maskedwidth(Hsapiens$chrY)

[1] 33720000

#can we mess with the masks?

> active(masks(Hsapiens$chrY))["RM"]<-TRUE

Error in `$<-`(`*tmp*`, "chrY", value = < S4 object of class "MaskedDNAString">) :

no method for assigning subsets of this S4 class

#oops I can't manipulate a BSgenome this way - it is behaving like a class instead of an instance of a class

> chrY<-Hsapiens$chrY

> active(masks(chrY))["RM"]<-TRUE

> maskedwidth(chrY)

[1] 49744357

# ok maskedwidth is working as I figured, but i need unmasked width

> unmaskedWidth<-function(chr){length(chr)-maskedwidth(chr)}

> unmaskedWidth(Hsapiens$chrY)

[1] 25653566

#how can I iterate over something with a $ operator? let's try [[]]

> unmaskedWidth(Hsapiens[["chrY"]])

[1] 25653566

**l**ist to

**d**ata frame).

# let's take chr 1-22,X,Y, skipping the unscaffolded sequences and mitochondrial chr

> maskedSizes<-ldply(.data=seqnames(Hsapiens)[1:24],

.fun=function(x){

data.frame(chr=x,seqlength=length(Hsapiens[[x]]),

unmaskedWidth=unmaskedWidth(Hsapiens[[x]]))},

.progress="text",

.parallel=TRUE)

> maskedSizes

chr seqlength unmaskedWidth

1 chr1 249250621 225280621

2 chr2 243199373 238204518

3 chr3 198022430 194797135

4 chr4 191154276 187661676

5 chr5 180915260 177695260

6 chr6 171115067 167395066

7 chr7 159138663 155353663

8 chr8 146364022 142888922

9 chr9 141213431 120143431

10 chr10 135534747 131314738

11 chr11 135006516 131129516

12 chr12 133851895 130481393

13 chr13 115169878 95589878

14 chr14 107349540 88289540

15 chr15 102531392 81694766

16 chr16 90354753 78884753

17 chr17 81195210 77795210

18 chr18 78077248 74657229

19 chr19 59128983 55808983

20 chr20 63025520 59505520

21 chr21 48129895 35106642

22 chr22 51304566 34894545

23 chrX 155270560 151100560

24 chrY 59373566 25653566

Load the BAM file and get read counts in a data frame.

#other methods include scanBam and readAligned

bamFile<-readBamGappedAlignments("myIndexedSortedBamFile.bam")

> levels(rname(bamFile))

[1] "1" "2" "3" "4" "5"

[6] "6" "7" "8" "9" "10"

[11] "11" "12" "13" "14" "15"

[16] "16" "17" "18" "19" "20"

[21] "21" "22" "X" "Y" "MT"

[26] "GL000207.1" "GL000226.1" "GL000229.1" "GL000231.1" "GL000210.1"

[31] "GL000239.1" "GL000235.1" "GL000201.1" "GL000247.1" "GL000245.1"

[36] "GL000197.1" "GL000203.1" "GL000246.1" "GL000249.1" "GL000196.1"

[41] "GL000248.1" "GL000244.1" "GL000238.1" "GL000202.1" "GL000234.1"

[46] "GL000232.1" "GL000206.1" "GL000240.1" "GL000236.1" "GL000241.1"

[51] "GL000243.1" "GL000242.1" "GL000230.1" "GL000237.1" "GL000233.1"

[56] "GL000204.1" "GL000198.1" "GL000208.1" "GL000191.1" "GL000227.1"

[61] "GL000228.1" "GL000214.1" "GL000221.1" "GL000209.1" "GL000218.1"

[66] "GL000220.1" "GL000213.1" "GL000211.1" "GL000199.1" "GL000217.1"

[71] "GL000216.1" "GL000215.1" "GL000205.1" "GL000219.1" "GL000224.1"

[76] "GL000223.1" "GL000195.1" "GL000212.1" "GL000222.1" "GL000200.1"

[81] "GL000193.1" "GL000194.1" "GL000225.1" "GL000192.1"

#the deflines in my reference do not match the BSgenome names, must fix at least the chromosomes of interest

levels(rname(bamFile))[1:25]<-paste('chr',c(1:22,'X','Y','M'),sep='')

#run length encoded read counts per chromosome

readRle<-rname(bamFile)

#get a data frame with chromosome and read counts

allReadsDf<-ldply(runValue(readRle),function(x){data.frame(chr=levels(runValue(readRle))[x],reads=runLength(readRle)[x])})

> head(allReadsDf)

chr reads

1 chr1 3616909

2 chr2 3642052

3 chr3 2843019

4 chr4 2636141

5 chr5 2590352

6 chr6 2497123

Merge the read counts with unmasked chromosome lengths and plot their relationship.

chrSizesReads<-merge(maskedSizes,readCounts,sort=FALSE)There should be a strong linear relationship between read count and chromosome size. We can test this using a linear regression model, the null hypothesis being the number of reads aligned to a chromosome is independent of its size.

library(ggplot2)

p<-ggplot(data=chrSizesReads, aes(x=unmaskedWidth, y=reads, label=chr)) +

geom_point() +

geom_text(vjust=2,size=3) +

stat_smooth(method="lm", se=TRUE,level=0.95) +

ylab("Reads aligned") +

xlab("Unmasked chromosome size") +

opts(title = "Reads vs Chromosome Size")

print(p)

> mylm<-lm(reads~unmaskedWidth,data=chrSizesReads)The low p-value (that chr size has no influence) and R-squared (predictive value of the linear model) suggest this model is sound.

> mysummary<-summary(mylm)

> mysummary

Call:

lm(formula = reads ~ unmaskedWidth, data = chrSizesReads)

Residuals:

Min 1Q Median 3Q Max

-271816 -108122 -43984 42826 676284

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.774e+05 9.505e+04 1.866 0.0754 .

unmaskedWidth 1.455e-02 7.145e-04 20.365 9.12e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 206600 on 22 degrees of freedom

Multiple R-squared: 0.9496, Adjusted R-squared: 0.9473

F-statistic: 414.8 on 1 and 22 DF, p-value: 9.123e-16

The following plot is obtained from the standardized residuals (the standardized difference between data observed and values expected) of the linear model described earlier.

Chromosome bias refers to uneven read alignment distribution across various chromosomes. We can expect some chromosome bias in treatment sets because of the inherient nature any experimental conditions - recovered fragments will not be evenly distributed among chromosomes because regions of affect are not evenly distributed. Other possible factors of chromosome bias include heterochromatin, uneven repeat content, and the potential for aligning the against an incorrect set of sex chromsomes. Aligners will typically randomly, evenly, assign discrete positions to reads which map ambiguously to multiple locations.

> p<-qplot(chrSizesReads$chr,rstandard(mylm))+

aes(label=chrSizesReads$chr)+

geom_text(vjust=2,size=3)+

xlab("Chromosome")+

ylab("Std Residual from lm (reads)")+

geom_abline(slope=0,intercept=0)+

opts(axis.text.x = theme_text(angle=45,hjust=1))+

opts(title = "Linear Regression Residuals")

> print(p)

Fortunately, there is no clear pattern to these residual values, which would indicate some model problems, but with a Z-score of 3.36, chrX appears to be an outlier. With 46M total alignments this is certainly not due to sampling error, but we can still test our observation with a Lund statistic.

#http://stackoverflow.com/questions/1444306/how-to-use-outlier-tests-in-r-code

>lundcrit<-function(a, n, q) {

F<-qf(c(1-(a/n)),df1=1,df2=n-q-1,lower.tail=TRUE)

crit<-((n-q)*F/(n-q-1+F))^0.5

crit

}

> n<-nrow(chrSizesReads)

> q<-length(mylm$coefficients)

> crit<-lundcrit(0.05,n,q)

> chrSizesReads[which(rstandard(mylm)>crit),"chr"]

[1] chrX

Happy holidays!

To

**leave a comment**for the author, please follow the link and comment on his blog:**Jermdemo Raised to the Law**.R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...