Genomics and Bioinformatics

Friday, December 10, 2010

Transcriptome assembly optimization

De novo transcriptome assembly is a challenging task. Though the primary approach for assembling transcriptome from its reads generated by next generation sequencing experiments (RNA-seq) is same as that adopted for de novo genome assembly. Trivially one would look for overlapping reads in order to extend short reads into longer contigs with the aim of reconstructing the original transcriptome.

The problems posed by transcriptome assembly are unique to its own. Transcriptome assembly is made challenging by the presence of alternatively spliced transcripts that are present in the pool the RNA-seq experiment was carried out on. Moreover, the coverage is not uniform either, as is assumed in de novo genome assembly. Transcriptome coverage is highly variable, depending on the gene expression level. Also, there is a problem of repeats that impedes the elongation of contigs as in genome assembly.

Assembly algorithms based on de Bruijn graph based approach require a user input for the value k-mer length (the parameter k) which defines the overlap length (k-1 overlap). The best value of k depends on the sequencing depth, the read error rate, and the complexity of the genome/transcriptome to be assembled. Using a higher value of k-mer length results in a more contiguous assembly of highly expressed transcripts while a lower value of k-mer length enables better assembly of poorly expressed transcripts. This suggests that a assembly that takes into account various k-mer lengths is desirable. With de bruijn graph based assembler, each assembly is based on a fixed k-mer length so multiple runs with different k-values is required. On the other hand, an overlap string graph based approach which takes into account overlaps of various lengths should be able to recover both highly expressed and poorly expressed transcripts in a single run.

Assembly based on multiple k values has been recently carried out by Yann Surget-Goba and Juan I. Montoya-Burgos and is described in their publication Optimization of de novo transcriptome assembly from next-generation sequencing data. They describe multiple-k based additive assembly to take advantage of assembling properties of different k-mer lengths. Additive multiple-k assembly pools the contigs obtained from different k-mer lengths. The performance is shown to be much better. The number of contigs >100 bp and total length are both doubled as compared to the single-k Velvet assembly. Interestingly, this marked increase

is accompanied by a higher N50 (median length-weighted contig length) indicating a substantial improvement in contiguity.

References:

Yann Surget-Groba and Juan I. Montoya-Burgos, Optimization of de novo transcriptome assembly from next-generation sequencing data.

Monday, December 6, 2010

Approaches to de novo assembly

As discussed in last post, sequences can be assembled from reference data with or without the use of a reference genome. Let us take a look at de novo assembly in detail and talk about approaches to de novo assembly.

In terms of complexity and time requirements, de-novo assemblies are orders of magnitude slower and more memory intensive than mapping assemblies. This is mostly due to the fact that the assembly algorithm need to compare every read with every other read (an operation that has a complexity of O(n²) but can be reduced to O(n log(n)). Referring to the comparison drawn to shredded books in the last post: while for mapping assemblies one would have a very similar book as template (perhaps with the names of the main characters and a few locations changed), the de-novo assemblies are more hardcore in a sense as one would not know beforehand whether this would become a science book, or a novel, or a catalogue etc.

Two approaches to de novo assembly found in the literature:-

de bruijn graph based assembly
overlap based string graph assembly

de bruijn graph based assembly

Most popular approach to assembly from short reads generated by next generation sequencing technologies.

Some of the assemblers that use this approach include velvet, abyss, etc. In a de Bruijn graph each node N represents a series of overlapping k-mers. Adjacent k-mers overlap by k-1 nucleotides. The marginal information contained by a k-mer is its last nucleotide. The sequence of those final nucleotides is called the sequence of the node.

Since the reads can come from either strand during sequencing experiments. Each node N also has attached to it its twin node that represents its reverse complement k-mer, hence the overlaps between reads from opposite strand are also taken into account. Two nodes N1 and N2 are connected by a directed edge from N1 to N2 iff the last k-mer of node N1 overlaps with the first k-mer of its destination node N2. Due to symmetry, if an arc goes from N1 to N2, we also have an arc from reverse complement of N2 to that of N1.

The idea behind de Bruijn graph based assembly is breaking down every read into its constituent k-mers. A read of length l would contain l-k+1 k-mers and any two successive k-mers overlap by k-1 nucleotides. Thus, there is an edge between any two successive k-mers of a read. For the construction of the graph the reads are first hashed into k-mers. The value of k is a user input. Its value is limited on the upper side by the length of the reads being hashed. Small values of k increase the connectivity of the graph since when the value of k is small the chance that two k-mers overlap by an amount k-1 increases, but this also increases the number of ambiguous repeats in the graph. The value of k, therefore is central to a balance between sensitivity and specificity.

Once the hashtable is built a database for reads that records which of the original k-mers of the read are overlapped by subsequent reads. The ordered set of original k-mers of the read is cut each time an overlap with another read begins or ends. For each uninterrupted sequence of original k-mers , a node is created. This procedure results in the construction of de Bruijn graph. The reads are traced through the graph to determine contiguous stretches of nucleotides to form contigs.

But before listing contigs, simplifications are made. Long chains of unambiguous path of nodes are collapsed into single nodes. This is followed by error removal process which can arise both due to sequencing process or are due to the biological sample, e.g. polymorphisms. The removal of the later is the post assembly task. The simplest approach to error removal is to use the difference between the expected coverage of genuine sequences and that of random errors. This removes all low coverage nodes along with their corresponding arcs. Other graph cleanup methods include removal of topological features like tips, bulges and bubbles that occur due to erroneous data.

Overlap based string graph assembly

Least popular approach because of high computational complexity. Requires number of comparisons quadratic in the number of reads i.e. all pairs comparison. Typical data set sizes are in order of millions of reads. Comparing all to all would require an order of million squared comparisons! whereas no such comparison was required in the de Bruijn graph based approach.

The first generation of NGS assembly tools used greedy algorithms. The greedy algorithms apply one basic operation: given any read or contig, add one more read contig with highest scoring for e.g. maximum overlap. Thus the contigs grow by greedy extension that keeps on adding reads to its end found by the highest scoring overlap at each step. Greedy approach can get stuck at a local maxima if the contig at hand takes on reads that would have helped other contigs grow even longer.

Greedy algorithms need mechanisms to avoid incorporating false-positives, a problem common to all assemblers. False positives arise due to the presence of repetitive sequence. Overlaps induced by repetitive sequence may score higher than overlap induced by common position of origin. This can result in chimeric contigs since unrelated sequences may be joined to the either side of the repeat. Paired-end reads are used to resolve ambiguites arising due to repeats.

Overlap based assemblers use an overlap graph. The assembly includes three major steps,

Finding all pairs overlaps through pairwise read comparisons. The seed and extend heuristic is used for efficiency. K-mer content across all reads is precomputed and all those reads that share k-mers are selected as overlap candidates.
Construction and manipulation of an overlap graph leads to an approximate read layout.
Multiple sequence alignment (MSA) can then determines the layout and the consensus sequence. But there is no efficient method to compute optimal MSA, therefore, progressive pairwise alignments are used.

References:

Zerbino, Short read de novo assembly using de Buijn graphs
Miller, Assembly algorithms for next-generation sequencing data

Wednesday, November 24, 2010

Sequence Assembly

Sequence assembly refers to aligning and merging of shorter fragments generated by experiments such as ones conducted using Next Generation Sequencing technologies in order to reconstruct the original sequence.

This is needed because no current technology can read the whole genome in one go but rather small fragments thereof. These small fragments called reads are in the range of 35-200 bp long depending on the technology used.

"The problem of sequence assembly can be compared to taking many copies of a book, passing them all through a shredder, and piecing a copy of the book back together from only shredded pieces. The book may have many repeated paragraphs, and some shreds may be modified to have typos. Excerpts from another book may be added in, and some shreds may be completely unrecognizable." -Wikipedia

As we move towards analyzing the genome of more complex organisms, the assembly programs needed to increasingly employ more and more sophisticated strategies to handle:

terabytes of sequencing data which need processing on computing clusters;
identical and nearly identical sequences (known as repeats) which can, in the worst case, increase the time and space complexity of algorithms exponentially;
and errors in the fragments from the sequencing instruments, which can confound assembly.

De-novo vs. mapping assembly

In sequence assembly, two different types can be distinguished:

de-novo: assembling reads together so that they form a new, previously unknown sequence
mapping: assembling reads against an existing backbone sequence, building a sequence that is similar but not necessarily identical to the backbone sequence

Both types of assembly methods have their own merits and utilities. Mapping based assembly can be useful when one needs to assemble a genome sequence of a specific organism of which a sequence has previously been sequenced or we have a sequence of a closely related specie. Then the usual approach is to use the sequence from the related specie as a guided reference in order to assemble the sequence for the target organism. One case where this could be useful is to improve the sequence itself and secondly construct a previously unknown sequence of a closely related specie.

It turns out that we may not have a reference sequence available every time. For the model organisms which have been studied in detail we have a reference sequence but for non-model organisms this is not true. Secondly, since we are using a reference sequence to guide the sequence assembly of reads from the genetic material from another organism, it would introduce bias in how the sequence is assembled. Thus, one would probably want to carry out a de novo assembly even if we have a reference sequence available. Moreover, whereas de novo assembly of genome is about reconstructing a single long string of original genome from Next Generation Sequencing reads, transcriptome assembly requires recovering all isoforms found in the collection. Since a biologist may be interested in discovering previously unknown transcripts and other biologically interesting features, de novo assembly is probably the only option.

de novo assembly is hard since there is no reference sequence to guide the layout for reads. Mathematically, de novo assembly has been shown to belong to a class of NP hard problems. NP hard problems are problems that do not have a polynomial time optimal solution.But one can use good heuristics to get close to the optimal solution.

Reference: http://en.wikipedia.org/wiki/Sequence_assembly

Monday, August 30, 2010

Sequence alignment

Pairwise sequence alignment is one of the most fundamental tools of bio-informatics. It allows us
- To look back billions of years ago.
- To test the evolutionary hypothesis, "whether sequences share a common ancestor"
- Structure modeling (Homology modeling): Identify one more known protein structures likely to resemble the structure of the query sequence through comparative analysis.

Sequences that are usually aligned
- Nucleic acid sequences/DNA [4 bases - A,C,G,T]
- Protein amino acid sequences [20 Amino acids]

Techniques used to align them
- Identity - dot plots
- Scoring matrices: This also has applications in,
- Models for evolution i.e. How sequences evolved?
- Dynamic Programming algorithms
- Evaluating the significance of pairwise alignment (?)

Scoring or Comparison Matrices
In order to align the sequences the algorithm has to know what it is looking for and how it can evaluate the worth of what it finds using 'comparison matrices'. Comparison matrices define a score for every possible match possibility - a measure to quantify how well the computational alignment is doing in the context of 'sequence alignment'.

Changes can occur in the sequence by way of:
- Substitution
- Insertion
- Deletions
- Transposition - Due to DNA breakage and re-joining
- Inversion
- Recombination
- Duplications

We can always determine an optimal alignment between any two sequences. The importance lies in determining the significance of this alignment. And, for that we need to look at the structural similarities. e.g. homology modeling in protein.

Two important types of alignment
- Local - Alignment between s1 and s2 (end to end)
- Global
- Alignment between substring of s1 and s2
- Looks for local regions of similarity
- Needleman Wunch algorithm (guarantees optimal alignment)

Algorithms for both local and global alignments use Dynamic Programming techniques. Dynamic programming techniques divides the problem into identical smaller sub-problems. The best solution to one sub problem is independent from the best solution to the other sub-problem.

Selecting scoring matrices and parameters
Selecting scoring matrices and parameters is one problem

Adjust scoring for gaps
- Penalty for gaps (insertions and deletions)
- There is no theory for penalizing gaps
- Based on trial and error.
- Different scoring matrices differ in gap penalty as well.

Evaluating statistical significance
Based on Homology modeling?

Dot Plots
- Diagonals easily allow to visualize the regions of similarity.
- Visually conveys a lot of information. e.g. regions of identity, duplication
- Really high noise in case of nucleotides. just 4 bases
- To reduce the noise: Look for identity over a window (window size and minimum score in that window).

Transitions Vs Transversions
Do all nucleotide changes occur with equal frequency? There are twice as many transversions than transitions. (see diagram)

Figure 1. Transition Vs Transversion mutations (source: Steve M. Carr)
Frequency of transitions is much higher than that of transitions in nature. Thus, transitions and transversions are penalized differently.

Are all positions equally likely to change?

Including indels
In this case we don't know whether there was a transversion or transition.
Terminal gaps are treated differently from internal gaps.

Note: Optimal global and local alignments can be computed in O(|s1|.|s2|) run-time and O(|s1| + |s2|) space complexity.

Pairwise Sequence alignment
- Optimal, polynomial time algorithms known.

Multiple Sequence alignment
- NP hard. Approximate algorithms based on pairwise sequence alignment

References and useful links
1. Dynamic programming and sequence alignment
2. Pairwise sequence alignment - Its all about us!
3. Homology modeling
4. Class notes and lectures. - Prof. Jung Choi, BIOL 8803

Friday, August 27, 2010

An introduction to Genomics

http://blog.ted.com/2009/01/22/an_introduction/