"The problem of sequence assembly can be compared to taking many copies of a book, passing them all through a shredder, and piecing a copy of the book back together from only shredded pieces. The book may have many repeated paragraphs, and some shreds may be modified to have typos. Excerpts from another book may be added in, and some shreds may be completely unrecognizable." -Wikipedia
As we move towards analyzing the genome of more complex organisms, the assembly programs needed to increasingly employ more and more sophisticated strategies to handle:
- terabytes of sequencing data which need processing on computing clusters;
- identical and nearly identical sequences (known as repeats) which can, in the worst case, increase the time and space complexity of algorithms exponentially;
- and errors in the fragments from the sequencing instruments, which can confound assembly.
De-novo vs. mapping assembly
In sequence assembly, two different types can be distinguished:
- de-novo: assembling reads together so that they form a new, previously unknown sequence
- mapping: assembling reads against an existing backbone sequence, building a sequence that is similar but not necessarily identical to the backbone sequence
Both types of assembly methods have their own merits and utilities. Mapping based assembly can be useful when one needs to assemble a genome sequence of a specific organism of which a sequence has previously been sequenced or we have a sequence of a closely related specie. Then the usual approach is to use the sequence from the related specie as a guided reference in order to assemble the sequence for the target organism. One case where this could be useful is to improve the sequence itself and secondly construct a previously unknown sequence of a closely related specie.
It turns out that we may not have a reference sequence available every time. For the model organisms which have been studied in detail we have a reference sequence but for non-model organisms this is not true. Secondly, since we are using a reference sequence to guide the sequence assembly of reads from the genetic material from another organism, it would introduce bias in how the sequence is assembled. Thus, one would probably want to carry out a de novo assembly even if we have a reference sequence available. Moreover, whereas de novo assembly of genome is about reconstructing a single long string of original genome from Next Generation Sequencing reads, transcriptome assembly requires recovering all isoforms found in the collection. Since a biologist may be interested in discovering previously unknown transcripts and other biologically interesting features, de novo assembly is probably the only option.
de novo assembly is hard since there is no reference sequence to guide the layout for reads. Mathematically, de novo assembly has been shown to belong to a class of NP hard problems. NP hard problems are problems that do not have a polynomial time optimal solution.But one can use good heuristics to get close to the optimal solution.