Generate a single contig, hybrid assembly of E coli using MiSeq and MinION data

[This article was first published on R – Opiniomics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

If you have MiSeq and ONT MinION data, it is now very easy to combine them into a single-contig, high quality assembly.  Here I will start with a base Ubuntu 14.04 LTS distribution and take you through every step, including software installation.  I am starting with an Amazon EC2 m4.xlarge AMI with 4 vCPUs and 16Gb RAM.  If you want to repeat the work below, you don’t need to use EC2, any computer running Ubuntu and with enough disk/RAM should work.

For MinION data, we will use Nick’s SQK-MAP-006 data at the EBI.  Using a simple trick, and because I know the internal structure of the tar archive, we can download only the base-called data:

curl ftp://ftp.sra.ebi.ac.uk/vol1/ERA540/ERA540530/oxfordnanopore_native/MAP006-1.tar | tar -xf - MAP006-1/MAP006-1_downloads &

We can also get some MiSeq data from ENA from the same strain, E coli K-12 MG1655:

wget -q ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR262/005/SRR2627175/SRR2627175_2.fastq.gz &
wget -q ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR262/005/SRR2627175/SRR2627175_1.fastq.gz &

Now, while that is downloading, we need to install some software on the server:

sudo apt-get update
sudo apt-get install r-base-core git samtools

This installs R which we will use to extract the MinION data (using poRe), git and samtools.  Just reply “Y” to any questions.

We now need to install the poRe dependencies in R, which is very easy:

R
source("http://www.bioconductor.org/biocLite.R")
biocLite("rhdf5")
install.packages(c("shiny","bit64","data.table","svDialogs"))
q()

R may ask if you want to install into a local library, just say Y and accept defaults.  We need to download poRe from sourecforge and we are using version 0.16

Once downloaded, and back at the Linux command line:

R CMD INSTALL poRe_0.16.tar.gz

The fastq extraction scripts for poRe are in github, so let’s go get those:

git clone https://github.com/mw55309/poRe_scripts.git

We will assemble using SPAdes, so let’s go get that:

wget http://spades.bioinf.spbau.ru/release3.6.2/SPAdes-3.6.2-Linux.tar.gz
gunzip < SPAdes-3.6.2-Linux.tar.gz | tar xvf -

Now, we are ready to go.  First off, let’s extract the 2D sequence data as FASTQ from the MinION data.  Nick’s SQK-MAP-006 data are in the old FAST5 format so we use the script in “old_format”:

./poRe_scripts/old_format/extract2D MAP006-1/MAP006-1_downloads/pass/ > minion.pass.2D.fastq &

Finally we can use SPAdes to assemble the data.  SPAdes is a swiss-army knife of genome assembly tools, and by default includes read correction.  This takes up lots of RAM, so we are going to skip it.  We will also only use 3 kmers to save time:

./SPAdes-3.6.2-Linux/bin/spades.py --only-assembler 
				   -t 4 -k 21,51,71 
				   -1 SRR2627175_1.fastq.gz 
				   -2 SRR2627175_2.fastq.gz 
				   --nanopore minion.pass.2D.fastq 
				   -o SPAdes_hybrid & 

Use samtools to extract the top contig:

head -n 1 SPAdes_hybrid/contigs.fasta
samtools faidx SPAdes_hybrid/contigs.fasta
samtools faidx SPAdes_hybrid/contigs.fasta NODE_1_length_4620446_cov_135.169_ID_22238 > single_contig.fa

Finally, a quick comparison to the reference:

sudo apt-get install mummer 
curl -s "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=NC_000913.3&rettype=fasta&retmode=txt" > NC_000913.3.fa
nucmer NC_000913.3.fa single_contig.fa
mummerplot -png out.delta
display out.png &

ecoli_mummer

And that is a single global alignment between reference and query ?

Simple huh?

To leave a comment for the author, please follow the link and comment on their blog: R – Opiniomics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)