Prefix-free parsing for building big BWTs
Tóm tắt
High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic databases are highly-repetitive—a characteristic that can be exploited to ease the computation of the Burrows-Wheeler Transform (BWT), which underlies many popular indexes. In this paper, we introduce a preprocessing algorithm, referred to as prefix-free parsing, that takes a text T as input, and in one-pass generates a dictionary D and a parse P of T with the property that the BWT of T can be constructed from D and P using workspace proportional to their total size and O(|T|)-time. Our experiments show that D and P are significantly smaller than T in practice, and thus, can fit in a reasonable internal memory even when T is very large. In particular, we show that with prefix-free parsing we can build an 131-MB run-length compressed FM-index (restricted to support only counting and not locating) for 1000 copies of human chromosome 19 in 2 h using 21 GB of memory, suggesting that we can build a 6.73 GB index for 1000 complete human-genome haplotypes in approximately 102 h using about 1 TB of memory.
Tài liệu tham khảo
The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature. 2015;526:68–74.
Turnbull C, et al. The 100,000 genomes project: bringing whole genome sequencing to the nhs. Br Med J. 2018;361:1687.
Carleton HA, Gerner-Smidt P. Whole-genome sequencing is taking over foodborne disease surveillance. Microbe. 2016;11:311–7.
Stevens EL, Timme R, Brown EW, Allard MW, Strain E, Bunning K, Musser S. The public health impact of a publically available, environmental database of microbial genomes. Front Microbiol. 2017;8:808.
Burrows M, Wheeler DJ. A block-sorting lossless compression algorithm, Technical report. : Digital Equipment Corporation; 1994.
Sirén J. Burrows-Wheeler transform for terabases. In: Proccedings of the 2016 data compression conference (DCC), 2016; p. 211–220.
Ferragina P, Gagie T, Manzini G. Lightweight data indexing and compression in external memory. Algorithmica. 2012;63(3):707–30.
Policriti A, Prezza N. From LZ77 to the run-length encoded burrows-wheeler transform, and back. In: Proceedings of the 28th symposium on combinatorial pattern matching (CPM), 2017; p. 17–11710.
https://rsync.samba.org. Accessed 10 Apr 2019.
Nong G. Practical linear-time O(1)-workspace suffix sorting for constant alphabets. ACM Trans Inf Syst. 2013;31(3):15.
Ferragina P, Manzini G. Indexing compressed text. J ACM (JACM). 2005;52(4):552–81.
Louza FA, Gog S, Telles GP. Inducing enhanced suffix arrays for string collections. Theor Comput Sci. 2017;678:22–39.
http://pizzachili.dcc.uchile.cl/repcorpus.html. Accessed 10 Apr 2019.
Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10(3):25.
Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4):357–60. https://doi.org/10.1038/nmeth.1923.
Li H, Durbin R. Fast and accurate long-read alignment with burrows-wheeler transform. Bioinformatics. 2010;26(5):589–95.
Li R, Yu C, Li Y, Lam T-W, Yiu S-M, Kristiansen K, Wang J. Soap2: an improved ultrafast tool for short read alignment. Bioinformatics. 2009;25(15):1966–7.
Gagie T, Navarro G, Prezza N. Optimal-time text indexing in bwt-runs bounded space. In: Proceedings of the 29th symposium on discrete algorithms (SODA), 2018. p. 1459–77.
Gog S, Beller T, Moffat A, Petri M. From theory to practice: plug and play with succinct data structures. In: 13th international symposium on experimental algorithms, (SEA 2014), 2014. p. 326–37.
Consortium TGP. A global reference for human genetic variation. Nature. 2015;526(7571):68–74. https://doi.org/10.1038/nature15393 Accessed 2018-09-28.
Narasimhan V, Danecek P, Scally A, Xue Y, Tyler-Smith C, Durbin R. BCFtools/RoH: a hidden Markov model approach for detecting autozygosity from next-generation sequencing data. Bioinformatics. 2016;32(11):1749–51.
MetaSUB International Consortium A. The metagenomics and metadesign of the subways and urban biomes (MetaSUB) international consortium inaugural meeting report. Microbiome. 2016;4(1):24.