libgapmis: extending short-read alignments

BMC Bioinformatics - Tập 14 - Trang 1-14 - 2013
Nikolaos Alachiotis1, Simon Berger1, Tomáš Flouri1, Solon P Pissis1, Alexandros Stamatakis1
1Heidelberg Institute for Theoretical Studies, Heidelberg, Germany

Tóm tắt

A wide variety of short-read alignment programmes have been published recently to tackle the problem of mapping millions of short reads to a reference genome, focusing on different aspects of the procedure such as time and memory efficiency, sensitivity, and accuracy. These tools allow for a small number of mismatches in the alignment; however, their ability to allow for gaps varies greatly, with many performing poorly or not allowing them at all. The seed-and-extend strategy is applied in most short-read alignment programmes. After aligning a substring of the reference sequence against the high-quality prefix of a short read--the seed--an important problem is to find the best possible alignment between a substring of the reference sequence succeeding and the remaining suffix of low quality of the read--extend. The fact that the reads are rather short and that the gap occurrence frequency observed in various studies is rather low suggest that aligning (parts of) those reads with a single gap is in fact desirable. In this article, we present libgapmis, a library for extending pairwise short-read alignments. Apart from the standard CPU version, it includes ultrafast SSE- and GPU-based implementations. libgapmis is based on an algorithm computing a modified version of the traditional dynamic-programming matrix for sequence alignment. Extensive experimental results demonstrate that the functions of the CPU version provided in this library accelerate the computations by a factor of 20 compared to other programmes. The analogous SSE- and GPU-based implementations accelerate the computations by a factor of 6 and 11, respectively, compared to the CPU version. The library also provides the user the flexibility to split the read into fragments, based on the observed gap occurrence frequency and the length of the read, thereby allowing for a variable, but bounded, number of gaps in the alignment. We present libgapmis, a library for extending pairwise short-read alignments. We show that libgapmis is better-suited and more efficient than existing algorithms for this task. The importance of our contribution is underlined by the fact that the provided functions may be seamlessly integrated into any short-read alignment pipeline. The open-source code of libgapmis is available at http://www.exelixis-lab.org/gapmis .

Tài liệu tham khảo

Levenshtein VI: Binary codes capable of correcting deletions, insertions, and reversals. Tech Rep 8. 1966, Soviet Physics Doklady Wagner RA, Fischer MJ: The String-to-String Correction Problem. Journal of the ACM. 1974, 21: 168-173. 10.1145/321796.321811. Sellers PH: On the Theory and Computation of Evolutionary Distances. SIAM Journal on Applied Mathematics. 1974, 26 (4): 787-793. 10.1137/0126070. Heckel P: A technique for isolating differences between files. Communications of the ACM. 1978, 21 (4): 264-268. 10.1145/359460.359467. Peterson JL: Computer programs for detecting and correcting spelling errors. Communications of the ACM. 1980, 23 (12): 676-687. 10.1145/359038.359041. Cambouropoulos E, Crochemore M, Iliopoulos CS, Mouchard L, Pinzon YJ: Algorithms for Computing Approximate Repetitions in Musical Sequences. International Journal of Computational Mathematics. 2000, 79 (11): 1135-1148. Flouri T, Frousios K, Iliopoulos CS, Park K, Pissis SP, Tischler G: GapMis: a tool for pairwise sequence alignment with a single gap. Recent Pat DNA Gene Seq. 2013, 7: 84-95. 10.2174/1872215611307020002. Gusfield D: Algorithms on strings, trees, and sequences: computer science and computational biology. 1997, USA: Cambridge University Press Simpson MA, Irving MD, Asilmaz E, Gray MJ, Dafou D, Elmslie FV, Mansour S, Holder SE, Brain CE, Burton BK, Kim KH, Pauli RM, Aftimos S, Stewart H, Kim CA, Holder-Espinasse M, Robertson SP, Drake WM, Trembath RC: Mutations in NOTCH2 cause Hajdu-Cheney syndrome, a disorder of severe and progressive bone loss. Nature Genetics. 2011, 43 (4): 303-305. 10.1038/ng.779. Balasubramanian S, Klenerman D, Barnes C, Osborne M: 2007, Patent US20077232656 Ju J, Li Z, Edwards J, Itagaki Y: 2007, Patent EP1790736 Rothberg J, Bader J, Dewell S, McDade K, Simpson J, Berka J, Colangelo C: Founding patent of 454 Life Sciences. 2007, Patent US20077211390 Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome biology. 2009, 10 (3): R25+-10.1186/gb-2009-10-3-r25. Li R, Yu C, Li Y, Lam TW, Yiu SM, Kristiansen K, Wang J: SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics. 2009, 25 (16): 1966-1967. Frousios K, Iliopoulos CS, Mouchard L, Pissis SP, Tischler G: REAL: an efficient REad ALigner for next generation sequencing reads. Proceedings of the first ACM International Conference on Bioinformatics and Computational Biology (BCB 2011). Edited by: Zhang A, Borodovsky M, Özsoyoglu G, Mikler AR. 2010, USA: ACM, 154-159. Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009, 25 (14): 1754-1760. 10.1093/bioinformatics/btp324. Langmead B, Salzberg SL: Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012, 9: 357-359. 10.1038/nmeth.1923. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic Local Alignment Search Tool. Journal of Molecular Biology. 1990, 215 (3): 403-410. 10.1016/S0022-2836(05)80360-2. Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology. 1970, 48 (3): 443-453. 10.1016/0022-2836(70)90057-4. Waterman MS, Smith TF: Identification of common molecular subsequences. Journal of Molecular Biology. 1981, 147: 195-197. 10.1016/0022-2836(81)90087-5. Alachiotis N, Berger S, Flouri T, Pissis SP, Stamatakis A: libgapmis: an ultrafast library for short-read single-gap alignment. Bioinformatics and Biomedicine Workshops (BIBMW), 2012 IEEE International Conference on: 4-7 October 2012. 2012, 688-695. 10.1109/BIBMW.2012.6470221. Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, Shaffer T, Wong M, Bhattacharjee A, Eichler EE, Bamshad M, Nickerson DA, Shendure J: Targeted capture and massively parallel sequencing of 12 human exomes. Nature. 2009, 461 (7261): 272-276. 10.1038/nature08250. Ostergaard P, Simpson MA, Brice G, Mansour S, Connell FC, Onoufriadis A, Child AH, Hwang J, Kalidas K, Mortimer PS, Trembath R, Jeffery S: Rapid identification of mutations in GJC2 in primary lymphoedema using whole exome sequencing combined with linkage analysis with delineation of the phenotype. J Med Genet. 2011, 48 (4): 251-255. 10.1136/jmg.2010.085563. Minosche AE, Dohm JC, Himmelbauer H: Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and Genome Analyzer systems. Genome Biology. 2011, 12: R112-10.1186/gb-2011-12-11-r112. Flouri T, Frousios K, Iliopoulos CS, Park K, Pissis SP, Tischler G: Approximate string-matching with a single gap for sequence alignment. Proceedings of the second ACM International Conference on Bioinformatics and Computational Biology (BCB 2011). Edited by: ACM. 2011, USA: ACM, 490-492. Crochemore M, Hancart C, Lecroq T: Algorithms on Strings. 2007, USA: Cambridge University Press Na JC, Roh K, Apostolico A, Park K: Alignment of biological sequences with quality scores. International Journal of Bioinformatics Research and Applications. 2009, 5: 97-113. 10.1504/IJBRA.2009.022466. National Center for Biotechnology Information (NCBI): 2013, [ftp://ftp.ncbi.nih.gov/blast/matrices/NUC.4.4] National Center for Biotechnology Information (NCBI): 2013, [ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM62] Rice P, Longden I, Bleasby A: EMBOSS: The European Molecular Biology Open Software Suite. Trends in Genetics. 2000, 16 (6): 276-277. 10.1016/S0168-9525(00)02024-2. Alachiotis N, Berger S, Stamatakis A: Coupling SIMD and SIMT architectures to boost performance of a phylogeny-aware alignment kernel. BMC Bioinformatics. 2012, 13: 196-10.1186/1471-2105-13-196. Rognes T: Faster Smith-Waterman database searches with inter-sequence SIMD parallelisation. BMC Bioinformatics. 2011, 12: 221-10.1186/1471-2105-12-221. National Center for Biotechnology Information (NCBI): [http://www.ncbi.nlm.nih.gov/]