In sequencing circles the N50 length is a useful heuristic for judging the quality of an assembly. Here is my definition of N50 length, which you may or may not find intuitive:
N50 length is the length of the shortest contig such that the sum of contigs of equal length or longer is at least 50% of the total length of all contigs
For example’s sake imagine an assembler has created contigs of the following length (in descending order):
91 77 70 69 62 56 45 29 16 4
The sum of these is 519bp, so the sum of all contigs equal to or greater than N50 must be equal to or greater than 519/2 or 259.5
We can see by brute force that
91+77+70+69=307 (that’ll do)
so the N50 for this assembly is 69bp
Another way to look at this:
at least half the nucleotides in this assembly belong to contigs of size 69bp or longer.
N50 vs N50 length
Technically N50, as opposed to N50 length, refers to the ordinal of that last contig that pushes it over the brink – in this example 4 (since 69bp is the 4th largest contig). Unfortunately, a higher N50 implies the opposite of a longer N50 length. Some papers refer to N50 length as L50, while most have simply followed the lazy convention of dropping “length” off of “N50 length”. I think it is important to include units with your N50 to minimize confusion.
Contig N50 vs Scaffold N50
Another distinction is often made between contig N50 and scaffold N50. Contigs are “contiguous segments”, while scaffolds (aka supercontigs) consist of contigs separated by gaps. Scaffolds are constructed using paired-end information at the read level and, in major sequencing projects, paired BAC ends. Because the scaffolds sequences are filled with varying quantities of empty N’s, the scaffold N50 should not solely be used as a comparative score of assembly quality.
Velvet, when used with sane expCov settings, is very conservative with regard to scaffolding – so much that the contigs.fa N50 can be virtually considered a contig N50, as opposed to a scaffold N50. More aggressive programs, such as SOAPdenovo, produce separate contig and scaffold files.
Velvet N50 from stats.txt
Velvet is a popular assembler for short sequences that uses DeBruijn graphs and Eulerian graph theory instead of a repetitive align-consensus-align approach. Although it returns an N50 in the course of assembling, I wanted to derive it from the contigs themselves. These contigs are summarized in a table called stats.txt
Using R and its cumulative summation function we can easily compute N50.
Strategy: Order the contigs by decreasing size and find the first value for which the cumulative summation is at least half the total sum.
Thanks to Barry Rowlingson for providing this solution
Beware the kmer
Note: The length in the stats.txt file is given as length=lgth+kmer-1, where kmer is the kmer value chosen for that assembly. The N50 length given in the Log file also appears to be in kmers. You cannot convert an N50 in kmers to bp by adding kmer-1. The math doesn’t work like that – you need to convert each contig to bp before recalculating N50.
Finally, you can calculate N50 from sequences in the contigs.fa, but this file only contains contigs longer than 2-kmers by default. The contigs.fa bp-N50 will sometimes approximate the kmer-N50 in the Log file, but that is not a rule to depend on.