- Yann Surget-Groba and Juan I. Montoya-Burgos, Optimization of de novo transcriptome assembly from next-generation sequencing data.
Genomics and Bioinformatics
Friday, December 10, 2010
Transcriptome assembly optimization
Monday, December 6, 2010
Approaches to de novo assembly
- de bruijn graph based assembly
- overlap based string graph assembly
- Finding all pairs overlaps through pairwise read comparisons. The seed and extend heuristic is used for efficiency. K-mer content across all reads is precomputed and all those reads that share k-mers are selected as overlap candidates.
- Construction and manipulation of an overlap graph leads to an approximate read layout.
- Multiple sequence alignment (MSA) can then determines the layout and the consensus sequence. But there is no efficient method to compute optimal MSA, therefore, progressive pairwise alignments are used.
- Zerbino, Short read de novo assembly using de Buijn graphs
- Miller, Assembly algorithms for next-generation sequencing data
Wednesday, November 24, 2010
Sequence Assembly
"The problem of sequence assembly can be compared to taking many copies of a book, passing them all through a shredder, and piecing a copy of the book back together from only shredded pieces. The book may have many repeated paragraphs, and some shreds may be modified to have typos. Excerpts from another book may be added in, and some shreds may be completely unrecognizable." -Wikipedia
As we move towards analyzing the genome of more complex organisms, the assembly programs needed to increasingly employ more and more sophisticated strategies to handle:
- terabytes of sequencing data which need processing on computing clusters;
- identical and nearly identical sequences (known as repeats) which can, in the worst case, increase the time and space complexity of algorithms exponentially;
- and errors in the fragments from the sequencing instruments, which can confound assembly.
De-novo vs. mapping assembly
In sequence assembly, two different types can be distinguished:
- de-novo: assembling reads together so that they form a new, previously unknown sequence
- mapping: assembling reads against an existing backbone sequence, building a sequence that is similar but not necessarily identical to the backbone sequence
Both types of assembly methods have their own merits and utilities. Mapping based assembly can be useful when one needs to assemble a genome sequence of a specific organism of which a sequence has previously been sequenced or we have a sequence of a closely related specie. Then the usual approach is to use the sequence from the related specie as a guided reference in order to assemble the sequence for the target organism. One case where this could be useful is to improve the sequence itself and secondly construct a previously unknown sequence of a closely related specie.
It turns out that we may not have a reference sequence available every time. For the model organisms which have been studied in detail we have a reference sequence but for non-model organisms this is not true. Secondly, since we are using a reference sequence to guide the sequence assembly of reads from the genetic material from another organism, it would introduce bias in how the sequence is assembled. Thus, one would probably want to carry out a de novo assembly even if we have a reference sequence available. Moreover, whereas de novo assembly of genome is about reconstructing a single long string of original genome from Next Generation Sequencing reads, transcriptome assembly requires recovering all isoforms found in the collection. Since a biologist may be interested in discovering previously unknown transcripts and other biologically interesting features, de novo assembly is probably the only option.
de novo assembly is hard since there is no reference sequence to guide the layout for reads. Mathematically, de novo assembly has been shown to belong to a class of NP hard problems. NP hard problems are problems that do not have a polynomial time optimal solution.But one can use good heuristics to get close to the optimal solution.
Monday, August 30, 2010
Sequence alignment
- To look back billions of years ago.
- To test the evolutionary hypothesis, "whether sequences share a common ancestor"
- Structure modeling (Homology modeling): Identify one more known protein structures likely to resemble the structure of the query sequence through comparative analysis.
Sequences that are usually aligned
- Nucleic acid sequences/DNA [4 bases - A,C,G,T]
- Protein amino acid sequences [20 Amino acids]
Techniques used to align them
- Identity - dot plots
- Scoring matrices: This also has applications in,
- Models for evolution i.e. How sequences evolved?
- Dynamic Programming algorithms
- Evaluating the significance of pairwise alignment (?)
Scoring or Comparison Matrices
In order to align the sequences the algorithm has to know what it is looking for and how it can evaluate the worth of what it finds using 'comparison matrices'. Comparison matrices define a score for every possible match possibility - a measure to quantify how well the computational alignment is doing in the context of 'sequence alignment'.
Changes can occur in the sequence by way of:
- Substitution
- Insertion
- Deletions
- Transposition - Due to DNA breakage and re-joining
- Inversion
- Recombination
- Duplications
We can always determine an optimal alignment between any two sequences. The importance lies in determining the significance of this alignment. And, for that we need to look at the structural similarities. e.g. homology modeling in protein.
Two important types of alignment
- Local - Alignment between s1 and s2 (end to end)
- Global
- Alignment between substring of s1 and s2
- Looks for local regions of similarity
- Needleman Wunch algorithm (guarantees optimal alignment)
Algorithms for both local and global alignments use Dynamic Programming techniques. Dynamic programming techniques divides the problem into identical smaller sub-problems. The best solution to one sub problem is independent from the best solution to the other sub-problem.
Selecting scoring matrices and parameters
Selecting scoring matrices and parameters is one problem
Adjust scoring for gaps
- Penalty for gaps (insertions and deletions)
- There is no theory for penalizing gaps
- Based on trial and error.
- Different scoring matrices differ in gap penalty as well.
Evaluating statistical significance
Based on Homology modeling?
Dot Plots
- Diagonals easily allow to visualize the regions of similarity.
- Visually conveys a lot of information. e.g. regions of identity, duplication
- Really high noise in case of nucleotides. just 4 bases
- To reduce the noise: Look for identity over a window (window size and minimum score in that window).
Transitions Vs Transversions
Do all nucleotide changes occur with equal frequency? There are twice as many transversions than transitions. (see diagram)

Figure 1. Transition Vs Transversion mutations (source: Steve M. Carr)
Frequency of transitions is much higher than that of transitions in nature. Thus, transitions and transversions are penalized differently.
Are all positions equally likely to change?
Including indels
In this case we don't know whether there was a transversion or transition.
Terminal gaps are treated differently from internal gaps.
Note: Optimal global and local alignments can be computed in O(|s1|.|s2|) run-time and O(|s1| + |s2|) space complexity.
Pairwise Sequence alignment
- Optimal, polynomial time algorithms known.
Multiple Sequence alignment
- NP hard. Approximate algorithms based on pairwise sequence alignment
References and useful links
1. Dynamic programming and sequence alignment
2. Pairwise sequence alignment - Its all about us!
3. Homology modeling
4. Class notes and lectures. - Prof. Jung Choi, BIOL 8803