Visualize coverage for targeted NGS (exome) experiments

March 20, 2014

(This article was first published on Getting Genetics Done, and kindly contributed to R-bloggers)

I’m calling variants from exome sequencing data and I need to evaluate the efficiency of the capture and the coverage along the target regions.

This sounds like a great use case for bedtools, your swiss-army knife for genomic arithmetic and interval manipulation. I’m lucky enough to be able to walk down the hall and bug Aaron Quinlan, creator of bedtools, whenever I have a “how do I do X with bedtools?” question (which happens frequently).

As open-source bioinformatics software documentation goes, bedtools’ documentation is top-notch. In addition, Aaron recently pointed out a work-in-progress bedtools cookbook that he’s putting together, giving code examples for both typical and clever uses of bedtools.

Getting back to my exome data, one way to visualize this is to plot the cumulative distribution describing the fraction of targeted bases that were covered by >10 reads, >20 reads, >80 reads, etc. For example, covering 90% of the target region at 20X coverage may be one metric to assess your ability to reliably detect heterozygotes. Luckily for me, there’s a bedtools protocol for that.

The basic idea is that for each sample, you’re using bedtools coverage to read in both a bam file containing your read alignments and a bed file containing your target capture regions (for example, you can download NimbleGen’s V3 exome capture regions here). The -hist option outputs a histogram of coverage for each feature in the BED file as well as a summary histogram for all of the features in the BED file. Below I’m using GNU parallel to run this on all 6 of my samples, piping the output of bedtools to grep out only the lines starting with “all.”

Now that I have text files with coverage histograms for all the regions in the capture target, I can now plot this using R.

You can see that sample #2 had some problems. Relative to the rest of the samples you see it take a quick nose-dive on the left side of the plot relative to the others. Running this through Picard showed that 86% of the reads from this sample were duplicates.

Thanks, Aaron, for the tips.

Bedtools protocols:

To leave a comment for the author, please follow the link and comment on their blog: Getting Genetics Done. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Dommino data lab

Quantide: statistical consulting and training



CRC R books series

Six Sigma Online Training

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)