OPTIMA: sensitive and accurate whole-genome alignment of error-prone genomic maps by combinatorial indexing and technology-agnostic statistical analysis

Oxford University Press (OUP) - Tập 5 - Trang 1-16 - 2016
Davide Verzotto1, Audrey S. M. Teo2, Axel M. Hillmer2, Niranjan Nagarajan1
1Computational and Systems Biology, Genome Institute of Singapore, Singapore, Singapore
2Cancer Therapeutics and Stratified Oncology, Genome Institute of Singapore, Singapore, Singapore

Tóm tắt

Resolution of complex repeat structures and rearrangements in the assembly and analysis of large eukaryotic genomes is often aided by a combination of high-throughput sequencing and genome-mapping technologies (for example, optical restriction mapping). In particular, mapping technologies can generate sparse maps of large DNA fragments (150 kilo base pairs (kbp) to 2 Mbp) and thus provide a unique source of information for disambiguating complex rearrangements in cancer genomes. Despite their utility, combining high-throughput sequencing and mapping technologies has been challenging because of the lack of efficient and sensitive map-alignment algorithms for robustly aligning error-prone maps to sequences. We introduce a novel seed-and-extend glocal (short for global-local) alignment method, OPTIMA (and a sliding-window extension for overlap alignment, OPTIMA-Overlap), which is the first to create indexes for continuous-valued mapping data while accounting for mapping errors. We also present a novel statistical model, agnostic with respect to technology-dependent error rates, for conservatively evaluating the significance of alignments without relying on expensive permutation-based tests. We show that OPTIMA and OPTIMA-Overlap outperform other state-of-the-art approaches (1.6−2 times more sensitive) and are more efficient (170−200 %) and precise in their alignments (nearly 99 % precision). These advantages are independent of the quality of the data, suggesting that our indexing approach and statistical evaluation are robust, provide improved sensitivity and guarantee high precision.

Tài liệu tham khảo

OpGen Inc. 2015. Available from: http://www.opgen.com. Accessed 9 Dec 2015. BioNano Genomics Inc. 2015. Available from: http://www.bionanogenomics.com. Accessed 9 Dec 2015. Nabsys Inc. 2015. Available from: http://www.nabsys.com. Accessed 9 Dec 2015. Dong Y, Xie M, Jiang Y, Xiao N, Du X, Wang W, et al.Sequencing and automated whole-genome optical mapping of the genome of a domestic goat (Capra hircus). Nat Biotechnol. 2013; 31:135–41. Ganapathy G, Howard J, Ward J, Zhang G, Phillippy A, Jarvis E, et al.High-coverage sequencing and annotated assemblies of the budgerigar genome. GigaScience. 2014; 3:11. Lam ET, Hastie A, Lin C, Ehrlich D, Nagarajan N, Kwok PY, et al.Genome mapping on nanochannel arrays for structural variation analysis and sequence assembly. Nat Biotechnol. 2012; 30:771–6. Ray M, Goldstein S, Zhou S, Potamousis K, Sarkar D, Schwartz D, et al.Discovery of structural alterations in solid tumor oligodendroglioma by single molecule analysis. BMC Genomics. 2013; 14:505. Waterman MS, Smith TF, Katcher H. Algorithms for restriction map comparisons. Nucleic Acids Res. 1984; 12:237–42. Mendelowitz L, Pop M. Computational methods for optical mapping. GigaScience. 2014; 3:33. Nagarajan N, Read TD, Pop M. Scaffolding and validation of bacterial genome assemblies using optical restriction maps. Bioinformatics. 2008; 24:1229–35. Lin H, Goldstein S, Mendelowitz L, Zhou S, Schwartz D, Pop M, et al.AGORA: assembly guided by optical restriction alignment. BMC Bioinforma. 2012; 13:189. Antoniotti M, Anantharaman T, Paxia S, Mishra B. Genomics via optical mapping IV: sequence validation via optical map matching. New York, NY, USA: New York University; 2001. Valouev A, Li L, Liu YC, Schwartz DC, Yang Y, Waterman MS, et al.Alignment of optical maps. J Comput Biol. 2006; 13:442–62. Anantharaman TS, Mishra B. A probabilistic analysis of false positives in optical map alignment and validation. In: First international workshop on algorithms in bioinformatics (WABI). Aarhus, Denmark: Springer: 2001. p. 27–40. Sarkar D, Goldstein S, Schwartz DC, Newton MA. Statistical significance of optical map alignments. J Comput Biol. 2012; 19:478–92. Anantharaman TS, Mishra B, Schwartz DC. Genomics via optical mapping II: ordered restriction maps. J Comput Biol. 1997; 4:91–118. Anantharaman TS, Mishra B, Schwartz DC. Genomics via optical mapping III: contiging genomic DNA. In: Proceedings of the 7th international conference on intelligent systems for molecular biology (ISMB 1999). Heidelberg, Germany: AAAI: 1999. p. 18–27. Karp RM, Rabin MO. Efficient randomized pattern-matching algorithms. IBM J Res Dev. 1987; 31:249–60. Muggli MD, Puglisi SJ, Boucher C. Efficient indexed alignment of contigs to optical maps. In: Algorithms in Bioinformatics (WABI 2014). vol. 8701 of LNCS. Wrocław, Poland: Springer: 2014. p. 68–81. Verzotto D, Teo ASM, Hillmer AM, Nagarajan N. Index-based map-to-sequence alignment in large eukaryotic genomes. In: Proceedings of the fifth RECOMB satellite workshop on massively parallel sequencing (RECOMB-Seq 2015). Warsaw (Poland): Cold Spring Harbor Labs Journals: 2015. p. 1–11. doi:10.1101/017194. Califano A, Rigoutsos I. FLASH: a fast look-up algorithm for string homology. In: Proceedings of the 1st International Conference on Intelligent Systems for Molecular Biology (ISMB 1993). Bethesda, MD, USA: AAAI: 1993. p. 56–64. Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Nat Acad Sci. 2003; 100:9440–5. Kawahara Y, de la Bastide M, Hamilton J, Kanamori H, McCombie R, Matsumoto T, et al.Improvement of the Oryza sativa Nipponbare reference genome using next generation sequence and optical map data. Rice. 2013; 6:4. Anantharaman TS, Mysore V, Mishra B. Fast and cheap genome wide haplotype construction via optical mapping In: Altman RB, Jung TA, Klein TE, Dunker AK, Hunter L, editors. Proceedings of the 10th Pacific Symposium on Biocomputing (PSB 2005). Hawaii, USA: World Scientific: 2005. p. 1–16. Teo ASM, Verzotto D, Yao F, Nagarajan N, Hillmer AM. Single-molecule optical genome mapping of a human HapMap and a colorectal cancer cell line. GigaScience. 2015; 4:65. doi:10.1186/s13742-015-0106-1. Yao F, Ariyaratne PN, Hillmer AM, Liu ET, Ruan Y. Long span DNA Paired-End-Tag (DNA-PET) sequencing strategy for the interrogation of genomic structural mutations and fusion-point-guided reconstruction of amplicons. PLOS ONE. 2012; e46152:7. Li R, Zhu H, Ruan J, Wang J, Fang X, Wang J, et al.De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 2010; 20:265–72. Gao S, Sung WK, Nagarajan N. Opera: Reconstructing Optimal Genomic Scaffolds with High-Throughput Paired-End Sequences. J Comput Biol. 2011; 18:1681–91. Verzotto D, Teo ASM, Hillmer AM, Nagarajan N. Supporting software for OPTIMA, a tool for sensitive and accurate whole-genome alignment of error-prone genomic maps by combinatorial indexing and technology-agnostic statistical analysis; 2015. GigaScience Database. doi:10.5524/100165. Teo ASM, Verzotto D, Yao F, Nagarajan N, Hillmer AM. Supporting single-molecule optical genome mapping data from human HapMap and colorectal cancer cell lines; 2015. GigaScience Database. doi:10.5524/100182.