Readjoiner: a fast and memory efficient string graph-based sequence assembler

BMC Bioinformatics - Tập 13 - Trang 1-19 - 2012
Giorgio Gonnella1, Stefan Kurtz1
1Center for Bioinformatics, University of Hamburg, Hamburg, Germany

Tóm tắt

Ongoing improvements in throughput of the next-generation sequencing technologies challenge the current generation of de novo sequence assemblers. Most recent sequence assemblers are based on the construction of a de Bruijn graph. An alternative framework of growing interest is the assembly string graph, not necessitating a division of the reads into k-mers, but requiring fast algorithms for the computation of suffix-prefix matches among all pairs of reads. Here we present efficient methods for the construction of a string graph from a set of sequencing reads. Our approach employs suffix sorting and scanning methods to compute suffix-prefix matches. Transitive edges are recognized and eliminated early in the process and the graph is efficiently constructed including irreducible edges only. Our suffix-prefix match determination and string graph construction algorithms have been implemented in the software package Readjoiner. Comparison with existing string graph-based assemblers shows that Readjoiner is faster and more space efficient. Readjoiner is available at http://www.zbh.uni-hamburg.de/readjoiner .

Tài liệu tham khảo

Myers EW: Toward simplifying and accurately formulating fragment assembly. J Comput Biol. 1995, 2: 275-290. Illumina Sequencing - Performance and Specifications for HiSeq 2000. [http://www.illumina.com/systems/hiseq_2000.ilmn] McPherson JD: Next-generation gap. Nature Methods. 2009, 6 (11 Suppl): S2-S5. [http://dx.doi.org/10.1038/nmeth.f.268] Pevzner PA, Tang H, Waterman MS: An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci U S A. 2001, 98: 9748-9753. Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008, 18: 821-829. Chaisson MJ, Pevzner PA: Short read fragment assembly of bacterial genomes. Genome Res. 2008, 18 (2): 324-330. [http://dx.doi.org/10.1101/gr.7088808] Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I: ABySS: a parallel assembler for short read sequence data. Genome Res. 2009, 19: 1117-1123. Myers EW: The fragment assembly string graph. Bioinformatics. 2005, 21 (Suppl 2): 79-85. Hernandez D, FranÇois P, Farinelli L, Osterås M, Schrenzel J: De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome Res. 2008, 18 (5): 802-809. [http://dx.doi.org/10.1101/gr.072033.107] Manber U, Myers G: Suffix Arrays: A New Method for On-Line String Searches. SIAM J Comput. 1993, 22 (5): 935-948. Simpson JT, Durbin R: Efficient construction of an assembly string graph using the FM-index. Bioinformatics. 2010, 26 (12): i367-373. [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/26/12/i367] Simpson JT, Durbin R: Efficient de novo assembly of large genomes using compressed data structures. Genome Research. 2011, [http://genome.cshlp.org/content/early/2011/12/07/gr.126953.111.abstract] Dinh H, Rajasekaran S: A memory-efficient data structure representing exact-match overlap graphs with application for next-generation DNA assembly. Bioinformatics. 2011, 27 (14): 1901-1907. [http://dx.doi.org/10.1093/bioinformatics/btr321] Kärkkäinen J, Rantala T: Engineering Radix Sort for Strings. String Processing and Information Retrieval, Volume 5280 of Lecture Notes in Computer Science. Edited by: Amir A, Turpin A, Moffat A. 2009, Springer Berlin/Heidelberg, 3-14. [http://dx.doi.org/10.1007/978-3-540-89097-3_3] Cormen T, Leiserson C, Rivest R: Introduction to Algorithms. 1990, Cambridge, MA: MIT Press Steinbiss S, Kurtz S: A New Efficient Data Structure for Storage and Retrieval of Multiple Biosequences. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2012, 9 (2): 345-357. Bentley J, McIllroy M: Engineering a Sort Function. Software Pract Ex. 1993, 23 (11): 1249-1265. Abouelhoda M, Kurtz S, Ohlebusch E: Replacing Suffix Trees with Enhanced Suffix Arrays. J Discrete Algorithm. 2004, 2: 53-86. Fredkin E: Trie Memory. Commun ACM. 1960, 3 (9): 490-499. Ferragina P, Grossi R: The string B-tree: a new data structure for string search in external memory and its applications. J ACM. 1999, 46 (2): 236-280. GenomeTools - The versatile open source genome analysis software. [http://genometools.org] Bowden T, Bauer B, Nerin J, Feng S, Seibold S: The/proc filesystem. http://www.kernel.org/doc/Documentation/filesystems/proc.txt2009, Edena: very short reads assembler. [http://www.genomic.ch/edena.php] jts/sga - Github. [https://github.com/jts/sga] Software distribution of LEAP. [http://www.engr.uconn.edu/htd06001/assembler/leap.zip] Barthelson R, McFarlin AJ, Rounsley SD, Young S: Plantagora: Modeling Whole Genome Sequencing and Assembly of Plant Genomes. PLoS ONE. 2011, 6 (12): e28436-[http://dx.doi.org/10.1371/journal.pone.0028436] Kurtz S, Phillippy A, Delcher A, Smoot M, Shumway M, Antonescu C, Salzberg S: Versatile and open software for comparing large genomes. Genome Biology. 2004, 5 (2): R12-[http://genomebiology.com/2004/5/2/R12] Earl DA, Bradnam K, St John J, Darling A, Lin D, Faas J, Yu HOK, Vince B, Zerbino DR, Diekhans M, Nguyen N, Nuwantha P, Sung AWK, Ning Z, Haimel M, Simpson JT, Fronseca NA, Birol I, Docking TR, Ho IY, Rokhsar DS, Chikhi R, Lavenier D, Chapuis G, Naquin D, Maillet N, Schatz MC, Kelly DR, Phillippy AM, Koren S, Yang SP, Wu W, Chou WC, Srivastava A, Shaw TI, Ruby JG, Skewes-Cox P, Betegon M, Dimon MT, Solovyev V, Kosarev P, Vorobyev D, Ramirez-Gonzalez R, Leggett R, MacLean D, Xia F, Luo R, L Z, Xie Y, Liu B, Gnerre S, MacCallum I, Przybylski D, Ribeiro FJ, Yin S, Sharpe T, Hall G, Kersey PJ, Durbin R, Jackman SD, Chapman JA, Huang X, DeRisi JL, Caccamo M, Li Y, Jaffe DB, Green R, Haussler D, Korf I, Paten B: Assemblathon 1: A competitive assessment of de novo short read assembly methods. Genome Research. 2011, http://genome.cshlp.org/content/early/2011/09/16/gr.126599.1 11.abstract Gusfield D, Landau GM, Schieber B: An efficient algorithm for the All Pairs Suffix-Prefix Problem. Inf Process Lett. 1992, 41 (4): 181-185. Ohlebusch E, Gog S: Efficient algorithms for the all-pairs suffix-prefix problem and the all-pairs substring-prefix problem. Inf Process Lett. 2010, 110 (3): 123-128. Kelley D, Schatz M, Salzberg S: Quake: quality-aware detection and correction of sequencing errors. Genome Biology. 2010, 11: 1-13. [http://dx.doi.org/10.1186/gb-2010-11-11-r116].[10.1186/gb-2010-11-11-r116]].[10.1186/gb-2010-11-11-r116 Pop M, Kosack DS, Salzberg SL: Hierarchical Scaffolding With Bambus. Genome Research. 2004, 14: 149-159. [http://genome.cshlp.org/content/14/1/149.abstract] Boetzer M, Henkel CV, Jansen HJ, Butler D, Pirovano W: Scaffolding pre-assembled contigs using SSPACE. Bioinformatics. 2010, [http://bioinformatics.oxfordjournals.org/content/early/2010/12/12/bioinformatics.btq683.abstract] Donmez N, Brudno M: Hapsembler: An Assembler for Highly Polymorphic Genomes. Research in Computational Molecular Biology, Volume 6577 of Lecture Notes in Computer Science. Edited by: Bafna V, Sahinalp S. 2011, Springer Berlin/Heidelberg, 38-52. Chikhi R, Lavenier D: Localized genome assembly from reads to scaffolds: practical traversal of the paired string graph. Algorithms in Bioinformatics - 11th International Workshop, WABI 2011, Saarbrücken, Germany, September 5-7, 2011. Proceedings. Lecture Notes in Computer Science 6833 Springer. Edited by: Teresa M, Przytycka TM, Marie-France S. 2011, ISBN 978-3-642-23037-0