Genomics and Bioinformatics: Transcriptome assembly optimization

De novo transcriptome assembly is a challenging task. Though the primary approach for assembling transcriptome from its reads generated by next generation sequencing experiments (RNA-seq) is same as that adopted for de novo genome assembly. Trivially one would look for overlapping reads in order to extend short reads into longer contigs with the aim of reconstructing the original transcriptome.

The problems posed by transcriptome assembly are unique to its own. Transcriptome assembly is made challenging by the presence of alternatively spliced transcripts that are present in the pool the RNA-seq experiment was carried out on. Moreover, the coverage is not uniform either, as is assumed in de novo genome assembly. Transcriptome coverage is highly variable, depending on the gene expression level. Also, there is a problem of repeats that impedes the elongation of contigs as in genome assembly.

Assembly algorithms based on de Bruijn graph based approach require a user input for the value k-mer length (the parameter k) which defines the overlap length (k-1 overlap). The best value of k depends on the sequencing depth, the read error rate, and the complexity of the genome/transcriptome to be assembled. Using a higher value of k-mer length results in a more contiguous assembly of highly expressed transcripts while a lower value of k-mer length enables better assembly of poorly expressed transcripts. This suggests that a assembly that takes into account various k-mer lengths is desirable. With de bruijn graph based assembler, each assembly is based on a fixed k-mer length so multiple runs with different k-values is required. On the other hand, an overlap string graph based approach which takes into account overlaps of various lengths should be able to recover both highly expressed and poorly expressed transcripts in a single run.

Assembly based on multiple k values has been recently carried out by Yann Surget-Goba and Juan I. Montoya-Burgos and is described in their publication Optimization of de novo transcriptome assembly from next-generation sequencing data. They describe multiple-k based additive assembly to take advantage of assembling properties of different k-mer lengths. Additive multiple-k assembly pools the contigs obtained from different k-mer lengths. The performance is shown to be much better. The number of contigs >100 bp and total length are both doubled as compared to the single-k Velvet assembly. Interestingly, this marked increase

is accompanied by a higher N50 (median length-weighted contig length) indicating a substantial improvement in contiguity.

References:

Yann Surget-Groba and Juan I. Montoya-Burgos, Optimization of de novo transcriptome assembly from next-generation sequencing data.

Genomics and Bioinformatics

Friday, December 10, 2010

Transcriptome assembly optimization

No comments:

Post a Comment