Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory

BMC Bioinformatics - Tập 13 - Trang 1-18 - 2012
Mark J Chaisson1, Glenn Tesler2
1Department Secondary Analysis, Pacific Biosciences, Menlo Park, USA
2Department of Mathematics, University of California San Diego, La Jolla, USA

Tóm tắt

Recent methods have been developed to perform high-throughput sequencing of DNA by Single Molecule Sequencing (SMS). While Next-Generation sequencing methods may produce reads up to several hundred bases long, SMS sequencing produces reads up to tens of kilobases long. Existing alignment methods are either too inefficient for high-throughput datasets, or not sensitive enough to align SMS reads, which have a higher error rate than Next-Generation sequencing. We describe the method BLASR (Basic Local Alignment with Successive Refinement) for mapping Single Molecule Sequencing (SMS) reads that are thousands of bases long, with divergence between the read and genome dominated by insertion and deletion error. The method is benchmarked using both simulated reads and reads from a bacterial sequencing project. We also present a combinatorial model of sequencing error that motivates why our approach is effective. The results indicate that it is possible to map SMS reads with high accuracy and speed. Furthermore, the inferences made on the mapability of SMS reads using our combinatorial model of sequencing error are in agreement with the mapping accuracy demonstrated on simulated reads.

Tài liệu tham khảo

Smith T, Waterman M: Identification of Common Molecular Subsequences. J Mol Biol 1981, 147: 194–197. Zhang Z, Schwartz S, Wagner L, Miller W: A greedy algorithm for aligning DNA sequences. J Comput Biol 2000, 7: 203–214. 10.1089/10665270050081478 Kent W: BLAT–the BLAST-like alignment tool. Genome Res 2002, 12: 656–664. Langmead B, Trapnell C, Pop M, Salzberg S: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 2009, 10: R25. 10.1186/gb-2009-10-3-r25 Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009, 25: 1754–1760. 10.1093/bioinformatics/btp324 Li R, Yu C, Li Y, Lam T, Yiu S, Kristiansen K, Wang J: SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 2009, 25: 1966–1967. 10.1093/bioinformatics/btp336 Ferragina P, Manzini G: Opportunistic data structures with applications. Proc. of the 41st IEEE Symp on Found of Comput Sci 2000, 390–398. Rasko D, Webster D, Sahl J, Bashir A, Boisen N, Scheutz F, Paxinos E, Sebra R, Chin C, Iliopoulos D, Klammer A, Peluso P, Lee L, Kislyuk A, Bullard J, Kasarskis A, Wang S, Eid J, Rank D, Redman J, Steyert S, Frimodt-Mller J, Struve C, Petersen A, Krogfelt K, Nataro J, Schadt E, Waldor M: Origins of the E. coli strain causing an outbreak of hemolytic-uremic syndrome in Germany. New England J Med 2011, 365: 709–717. 10.1056/NEJMoa1106920 Brudno M, Poliakov A, Salamov A, Cooper G, Sidow A, Rubin E, Solovyev V, Batzoglou S, Dubchak I: Automated Whole-Genome Multiple Alignment of Rat, Mouse, and Human. Genome Res 2004, 14: 685–692. 10.1101/gr.2067704 Schwartz S, Kent W, Smit A, Zhang Z, Baertsch R, Hardison R, Haussler D Miller: Human-Mouse Alignments with BLASTZ. Genome Res 2003, 13: 103–107. 10.1101/gr.809403 Kurtz S, Phillippy A, Delcher A, Smoot M, Shumway M, Antonescu C, Salzberg S: Versatile and open software for comparing large genomes. Genome Biol 2004, 5: R122-R129. Altschul S, Gish W, Miller W, Myers E, Lipman D: Basic local alignment search tool. J Mol Biol 1990, 215: 403–410. Lipman D, Pearson W: Rapid and sensitive protein similarity searches. Science 1985, 4693: 1435–1441. Bray N, Dubchak I, Pachter L: AVID: A global alignment program. Genome Res 2003, 13: 97–102. 10.1101/gr.789803 Darling A, Mau B, Blatter F, Perna N: Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res 2004, 14: 1394–1403. 10.1101/gr.2289704 Brudno M, Do C, Cooper G, Kim M, Davydov E, Green E, Sidow A, Batzoglou S: LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res 2003, 13: 721–731. 10.1101/gr.926603 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D: Evolution’s cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes. Proc Nat Acad Sci 2003, 100: 11484–11489. 10.1073/pnas.1932072100 Li H, Durbin R: Fast and accurate long-read alignment with Burrows Wheeler transform. Bioinformatics 2010, 26: 589–595. 10.1093/bioinformatics/btp698 Li R, Li Y, Kristiansen K, Wang J: SOAP: short oligonucleotide alignment program. Bioinformatics 2008, 24: 713–714. 10.1093/bioinformatics/btn025 Li H, Ruan J, R D: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 2008, 18: 1851–1858. 10.1101/gr.078212.108 Rumble S, Lacroute P, Dalca A, Fiume M, Sidow A, et al.: SHRiMP: Accurate Mapping of Short Color-space Reads. PLoS Comput Biol 2009, 5: e1000386. 10.1371/journal.pcbi.1000386 Durbin R, Eddy S, Krobh A, Mitchison G: Biol Sequence Anal: Probabilistic Models of Proteins and Nucleic Acids. The Edinburgh Building, Cambridge, CB2 2RU. United Kingdom: Cambridge University Press; 1998. Needleman S, Wunsch C: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 1970, 48: 443–453. 10.1016/0022-2836(70)90057-4 Eid J, Fehr A, Korlach J, Turner S, et al.: Real-time DNA sequencing from single polymerase molecules. Science 2009, 323: 133–138. 10.1126/science.1162986 Clarke J, Wu H, Jayasinghe L, Patel A, Reid S, Bayley H: Continuous base identification for single-molecule nanopore DNA sequencing. Nat Nanotechnol 2009, 4: 265–270. 10.1038/nnano.2009.12 Cherf G, Lieberman K, Rashid H, Lam C, Karplus K, Akeson M: Automated forward and reverse ratcheting of DNA in a nanopore at 5-Å precision. Nat Nanotechnol 2012, 14: 344–348. Garaj S, Hubbard W, Reina A, Kong J, Branton D, Golovchenko J: Graphene as a subnanometre trans-electrode membrane. Nature 2010, 467: 190–193. 10.1038/nature09379 Altschul S, Madden T, Schäffer A, Zhang J, Zhang Z, Miller W, Lipman D: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 17: 3389–3402. Weese D, Emde A, Rausch T, Döring A, Reinert K: RazerS-fast read mapping with sensitivity control. Genome Res 2009, 19: 1646–1654. 10.1101/gr.088823.108 Bao Z, Eddy S: Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res 2002, 12: 1269–1276. 10.1101/gr.88502 Flusberg B, Webster D, Lee J, Travers K, Olivares E, Clark T, Korlach J, SW T: Direct detection of DNA methylation during single-molecule, real-time sequencing. Nat Methods 2010, 7: 461–465. 10.1038/nmeth.1459 Alkan C, Kidd J, Marques-Bonet T, Aksay G, Antonacci F, Hormozdiari F, Kitzman J, Baker C, Malig M, Mutlu O, Sahinalp S, Gibbs R, Eichler E: Personalized Copy-Number and Segmental Duplication Maps using Next-Generation Sequencing. Nat Genet 2009, 41: 1061–1067. 10.1038/ng.437 Myers E, Manber U: Suffix arrays: A new method for on-line string searches. SIAM J Comput 1993, 22: 935–948. 10.1137/0222058 Abouelhoda M, Ohlebusch E: A Local Chaining Algorithm and its Applications in Comparitive Genomics. Lecture Notes in Comput Sci 2812, 2003: 1–16. Eppstein D, Galil Z, Giancarlo R, Italiano G: Sparse Dynamic Programming I: Linear cost functions. J Assoc Comput Machinery 1992, 39: 519–545. 10.1145/146637.146650