Genomics and Bioinformatics: December 2010

De novo transcriptome assembly is a challenging task. Though the primary approach for assembling transcriptome from its reads generated by next generation sequencing experiments (RNA-seq) is same as that adopted for de novo genome assembly. Trivially one would look for overlapping reads in order to extend short reads into longer contigs with the aim of reconstructing the original transcriptome.

The problems posed by transcriptome assembly are unique to its own. Transcriptome assembly is made challenging by the presence of alternatively spliced transcripts that are present in the pool the RNA-seq experiment was carried out on. Moreover, the coverage is not uniform either, as is assumed in de novo genome assembly. Transcriptome coverage is highly variable, depending on the gene expression level. Also, there is a problem of repeats that impedes the elongation of contigs as in genome assembly.

Assembly algorithms based on de Bruijn graph based approach require a user input for the value k-mer length (the parameter k) which defines the overlap length (k-1 overlap). The best value of k depends on the sequencing depth, the read error rate, and the complexity of the genome/transcriptome to be assembled. Using a higher value of k-mer length results in a more contiguous assembly of highly expressed transcripts while a lower value of k-mer length enables better assembly of poorly expressed transcripts. This suggests that a assembly that takes into account various k-mer lengths is desirable. With de bruijn graph based assembler, each assembly is based on a fixed k-mer length so multiple runs with different k-values is required. On the other hand, an overlap string graph based approach which takes into account overlaps of various lengths should be able to recover both highly expressed and poorly expressed transcripts in a single run.

Assembly based on multiple k values has been recently carried out by Yann Surget-Goba and Juan I. Montoya-Burgos and is described in their publication Optimization of de novo transcriptome assembly from next-generation sequencing data. They describe multiple-k based additive assembly to take advantage of assembling properties of different k-mer lengths. Additive multiple-k assembly pools the contigs obtained from different k-mer lengths. The performance is shown to be much better. The number of contigs >100 bp and total length are both doubled as compared to the single-k Velvet assembly. Interestingly, this marked increase

is accompanied by a higher N50 (median length-weighted contig length) indicating a substantial improvement in contiguity.

References:

Yann Surget-Groba and Juan I. Montoya-Burgos, Optimization of de novo transcriptome assembly from next-generation sequencing data.

As discussed in last post, sequences can be assembled from reference data with or without the use of a reference genome. Let us take a look at de novo assembly in detail and talk about approaches to de novo assembly.

In terms of complexity and time requirements, de-novo assemblies are orders of magnitude slower and more memory intensive than mapping assemblies. This is mostly due to the fact that the assembly algorithm need to compare every read with every other read (an operation that has a complexity of O(n²) but can be reduced to O(n log(n)). Referring to the comparison drawn to shredded books in the last post: while for mapping assemblies one would have a very similar book as template (perhaps with the names of the main characters and a few locations changed), the de-novo assemblies are more hardcore in a sense as one would not know beforehand whether this would become a science book, or a novel, or a catalogue etc.

Two approaches to de novo assembly found in the literature:-

de bruijn graph based assembly
overlap based string graph assembly

de bruijn graph based assembly

Most popular approach to assembly from short reads generated by next generation sequencing technologies.

Some of the assemblers that use this approach include velvet, abyss, etc. In a de Bruijn graph each node N represents a series of overlapping k-mers. Adjacent k-mers overlap by k-1 nucleotides. The marginal information contained by a k-mer is its last nucleotide. The sequence of those final nucleotides is called the sequence of the node.

Since the reads can come from either strand during sequencing experiments. Each node N also has attached to it its twin node that represents its reverse complement k-mer, hence the overlaps between reads from opposite strand are also taken into account. Two nodes N1 and N2 are connected by a directed edge from N1 to N2 iff the last k-mer of node N1 overlaps with the first k-mer of its destination node N2. Due to symmetry, if an arc goes from N1 to N2, we also have an arc from reverse complement of N2 to that of N1.

The idea behind de Bruijn graph based assembly is breaking down every read into its constituent k-mers. A read of length l would contain l-k+1 k-mers and any two successive k-mers overlap by k-1 nucleotides. Thus, there is an edge between any two successive k-mers of a read. For the construction of the graph the reads are first hashed into k-mers. The value of k is a user input. Its value is limited on the upper side by the length of the reads being hashed. Small values of k increase the connectivity of the graph since when the value of k is small the chance that two k-mers overlap by an amount k-1 increases, but this also increases the number of ambiguous repeats in the graph. The value of k, therefore is central to a balance between sensitivity and specificity.

Once the hashtable is built a database for reads that records which of the original k-mers of the read are overlapped by subsequent reads. The ordered set of original k-mers of the read is cut each time an overlap with another read begins or ends. For each uninterrupted sequence of original k-mers , a node is created. This procedure results in the construction of de Bruijn graph. The reads are traced through the graph to determine contiguous stretches of nucleotides to form contigs.

But before listing contigs, simplifications are made. Long chains of unambiguous path of nodes are collapsed into single nodes. This is followed by error removal process which can arise both due to sequencing process or are due to the biological sample, e.g. polymorphisms. The removal of the later is the post assembly task. The simplest approach to error removal is to use the difference between the expected coverage of genuine sequences and that of random errors. This removes all low coverage nodes along with their corresponding arcs. Other graph cleanup methods include removal of topological features like tips, bulges and bubbles that occur due to erroneous data.

Overlap based string graph assembly

Least popular approach because of high computational complexity. Requires number of comparisons quadratic in the number of reads i.e. all pairs comparison. Typical data set sizes are in order of millions of reads. Comparing all to all would require an order of million squared comparisons! whereas no such comparison was required in the de Bruijn graph based approach.

The first generation of NGS assembly tools used greedy algorithms. The greedy algorithms apply one basic operation: given any read or contig, add one more read contig with highest scoring for e.g. maximum overlap. Thus the contigs grow by greedy extension that keeps on adding reads to its end found by the highest scoring overlap at each step. Greedy approach can get stuck at a local maxima if the contig at hand takes on reads that would have helped other contigs grow even longer.

Greedy algorithms need mechanisms to avoid incorporating false-positives, a problem common to all assemblers. False positives arise due to the presence of repetitive sequence. Overlaps induced by repetitive sequence may score higher than overlap induced by common position of origin. This can result in chimeric contigs since unrelated sequences may be joined to the either side of the repeat. Paired-end reads are used to resolve ambiguites arising due to repeats.

Overlap based assemblers use an overlap graph. The assembly includes three major steps,

Finding all pairs overlaps through pairwise read comparisons. The seed and extend heuristic is used for efficiency. K-mer content across all reads is precomputed and all those reads that share k-mers are selected as overlap candidates.
Construction and manipulation of an overlap graph leads to an approximate read layout.
Multiple sequence alignment (MSA) can then determines the layout and the consensus sequence. But there is no efficient method to compute optimal MSA, therefore, progressive pairwise alignments are used.

References:

Zerbino, Short read de novo assembly using de Buijn graphs
Miller, Assembly algorithms for next-generation sequencing data

Genomics and Bioinformatics

Friday, December 10, 2010

Transcriptome assembly optimization

Monday, December 6, 2010

Approaches to de novo assembly