Prefix-free parsing for building big BWTs

Springer Science and Business Media LLC - Tập 14 - Trang 1-15 - 2019

Christina Boucher¹, Travis Gagie^2,3, Alan Kuhnle^1,4, Ben Langmead⁵, Giovanni Manzini^6,7, Taher Mun⁵

¹CISE, University of Florida, Gainesville, USA

²EIT, Diego Portales University, Santiago, Chile

³CeBiB, Santiago, Chile

⁴Informatics Institute, Gainesville, USA

⁵Johns Hopkins University, Baltimore, USA

⁶University of Eastern Piedmont, Alessandria, Italy

⁷IIT CNR, Pisa, Italy

Tóm tắt

High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic databases are highly-repetitive—a characteristic that can be exploited to ease the computation of the Burrows-Wheeler Transform (BWT), which underlies many popular indexes. In this paper, we introduce a preprocessing algorithm, referred to as prefix-free parsing, that takes a text T as input, and in one-pass generates a dictionary D and a parse P of T with the property that the BWT of T can be constructed from D and P using workspace proportional to their total size and O(|T|)-time. Our experiments show that D and P are significantly smaller than T in practice, and thus, can fit in a reasonable internal memory even when T is very large. In particular, we show that with prefix-free parsing we can build an 131-MB run-length compressed FM-index (restricted to support only counting and not locating) for 1000 copies of human chromosome 19 in 2 h using 21 GB of memory, suggesting that we can build a 6.73 GB index for 1000 complete human-genome haplotypes in approximately 102 h using about 1 TB of memory.

Tài liệu tham khảo

The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature. 2015;526:68–74. Turnbull C, et al. The 100,000 genomes project: bringing whole genome sequencing to the nhs. Br Med J. 2018;361:1687. Carleton HA, Gerner-Smidt P. Whole-genome sequencing is taking over foodborne disease surveillance. Microbe. 2016;11:311–7. Stevens EL, Timme R, Brown EW, Allard MW, Strain E, Bunning K, Musser S. The public health impact of a publically available, environmental database of microbial genomes. Front Microbiol. 2017;8:808. Burrows M, Wheeler DJ. A block-sorting lossless compression algorithm, Technical report. : Digital Equipment Corporation; 1994. Sirén J. Burrows-Wheeler transform for terabases. In: Proccedings of the 2016 data compression conference (DCC), 2016; p. 211–220. Ferragina P, Gagie T, Manzini G. Lightweight data indexing and compression in external memory. Algorithmica. 2012;63(3):707–30. Policriti A, Prezza N. From LZ77 to the run-length encoded burrows-wheeler transform, and back. In: Proceedings of the 28th symposium on combinatorial pattern matching (CPM), 2017; p. 17–11710. https://rsync.samba.org. Accessed 10 Apr 2019. Nong G. Practical linear-time O(1)-workspace suffix sorting for constant alphabets. ACM Trans Inf Syst. 2013;31(3):15. Ferragina P, Manzini G. Indexing compressed text. J ACM (JACM). 2005;52(4):552–81. Louza FA, Gog S, Telles GP. Inducing enhanced suffix arrays for string collections. Theor Comput Sci. 2017;678:22–39. http://pizzachili.dcc.uchile.cl/repcorpus.html. Accessed 10 Apr 2019. Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10(3):25. Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4):357–60. https://doi.org/10.1038/nmeth.1923. Li H, Durbin R. Fast and accurate long-read alignment with burrows-wheeler transform. Bioinformatics. 2010;26(5):589–95. Li R, Yu C, Li Y, Lam T-W, Yiu S-M, Kristiansen K, Wang J. Soap2: an improved ultrafast tool for short read alignment. Bioinformatics. 2009;25(15):1966–7. Gagie T, Navarro G, Prezza N. Optimal-time text indexing in bwt-runs bounded space. In: Proceedings of the 29th symposium on discrete algorithms (SODA), 2018. p. 1459–77. Gog S, Beller T, Moffat A, Petri M. From theory to practice: plug and play with succinct data structures. In: 13th international symposium on experimental algorithms, (SEA 2014), 2014. p. 326–37. Consortium TGP. A global reference for human genetic variation. Nature. 2015;526(7571):68–74. https://doi.org/10.1038/nature15393 Accessed 2018-09-28. Narasimhan V, Danecek P, Scally A, Xue Y, Tyler-Smith C, Durbin R. BCFtools/RoH: a hidden Markov model approach for detecting autozygosity from next-generation sequencing data. Bioinformatics. 2016;32(11):1749–51. MetaSUB International Consortium A. The metagenomics and metadesign of the subways and urban biomes (MetaSUB) international consortium inaugural meeting report. Microbiome. 2016;4(1):24.

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA