As discussed in last post, sequences can be assembled from reference data with or without the use of a reference genome. Let us take a look at de novo assembly in detail and talk about approaches to de novo assembly.
In terms of complexity and time requirements, de-novo assemblies are orders of magnitude slower and more memory intensive than mapping assemblies. This is mostly due to the fact that the assembly algorithm need to compare every read with every other read (an operation that has a complexity of O(n2) but can be reduced to O(n log(n)). Referring to the comparison drawn to shredded books in the last post: while for mapping assemblies one would have a very similar book as template (perhaps with the names of the main characters and a few locations changed), the de-novo assemblies are more hardcore in a sense as one would not know beforehand whether this would become a science book, or a novel, or a catalogue etc.
Two approaches to de novo assembly found in the literature:-
- de bruijn graph based assembly
- overlap based string graph assembly
de bruijn graph based assembly
Most popular approach to assembly from short reads generated by next generation sequencing technologies.
Some of the assemblers that use this approach include velvet, abyss, etc. In a de Bruijn graph each node N represents a series of overlapping k-mers. Adjacent k-mers overlap by k-1 nucleotides. The marginal information contained by a k-mer is its last nucleotide. The sequence of those final nucleotides is called the sequence of the node.
Since the reads can come from either strand during sequencing experiments. Each node N also has attached to it its twin node that represents its reverse complement k-mer, hence the overlaps between reads from opposite strand are also taken into account. Two nodes N1 and N2 are connected by a directed edge from N1 to N2 iff the last k-mer of node N1 overlaps with the first k-mer of its destination node N2. Due to symmetry, if an arc goes from N1 to N2, we also have an arc from reverse complement of N2 to that of N1.
The idea behind de Bruijn graph based assembly is breaking down every read into its constituent k-mers. A read of length l would contain l-k+1 k-mers and any two successive k-mers overlap by k-1 nucleotides. Thus, there is an edge between any two successive k-mers of a read. For the construction of the graph the reads are first hashed into k-mers. The value of k is a user input. Its value is limited on the upper side by the length of the reads being hashed. Small values of k increase the connectivity of the graph since when the value of k is small the chance that two k-mers overlap by an amount k-1 increases, but this also increases the number of ambiguous repeats in the graph. The value of k, therefore is central to a balance between sensitivity and specificity.
Once the hashtable is built a database for reads that records which of the original k-mers of the read are overlapped by subsequent reads. The ordered set of original k-mers of the read is cut each time an overlap with another read begins or ends. For each uninterrupted sequence of original k-mers , a node is created. This procedure results in the construction of de Bruijn graph. The reads are traced through the graph to determine contiguous stretches of nucleotides to form contigs.
But before listing contigs, simplifications are made. Long chains of unambiguous path of nodes are collapsed into single nodes. This is followed by error removal process which can arise both due to sequencing process or are due to the biological sample, e.g. polymorphisms. The removal of the later is the post assembly task. The simplest approach to error removal is to use the difference between the expected coverage of genuine sequences and that of random errors. This removes all low coverage nodes along with their corresponding arcs. Other graph cleanup methods include removal of topological features like tips, bulges and bubbles that occur due to erroneous data.
Overlap based string graph assembly
Least popular approach because of high computational complexity. Requires number of comparisons quadratic in the number of reads i.e. all pairs comparison. Typical data set sizes are in order of millions of reads. Comparing all to all would require an order of million squared comparisons! whereas no such comparison was required in the de Bruijn graph based approach.
The first generation of NGS assembly tools used greedy algorithms. The greedy algorithms apply one basic operation: given any read or contig, add one more read contig with highest scoring for e.g. maximum overlap. Thus the contigs grow by greedy extension that keeps on adding reads to its end found by the highest scoring overlap at each step. Greedy approach can get stuck at a local maxima if the contig at hand takes on reads that would have helped other contigs grow even longer.
Greedy algorithms need mechanisms to avoid incorporating false-positives, a problem common to all assemblers. False positives arise due to the presence of repetitive sequence. Overlaps induced by repetitive sequence may score higher than overlap induced by common position of origin. This can result in chimeric contigs since unrelated sequences may be joined to the either side of the repeat. Paired-end reads are used to resolve ambiguites arising due to repeats.
Overlap based assemblers use an overlap graph. The assembly includes three major steps,
- Finding all pairs overlaps through pairwise read comparisons. The seed and extend heuristic is used for efficiency. K-mer content across all reads is precomputed and all those reads that share k-mers are selected as overlap candidates.
- Construction and manipulation of an overlap graph leads to an approximate read layout.
- Multiple sequence alignment (MSA) can then determines the layout and the consensus sequence. But there is no efficient method to compute optimal MSA, therefore, progressive pairwise alignments are used.
References:
- Zerbino, Short read de novo assembly using de Buijn graphs
- Miller, Assembly algorithms for next-generation sequencing data