SaAlign: Multiple DNA/RNA sequence alignment and phylogenetic tree construction tool for ultra-large datasets and ultra-long sequences based on suffix array

Computational and Structural Biotechnology Journal - Tập 20 - Trang 1487-1493 - 2022
Ziyuan Wang1, Junjie Tan2, Yanling Long3, Yijia Liu1, Wenyan Lei1, Jing Cai4, Yi Yang1, Zhibin Liu1
1Key Laboratory of Bio-Resource and Eco-Environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu 610064, Sichuan, PR China
2Center for Clinical Molecular Medicine, National Clinical Research Center for Child Health and Disorders, Ministry of Education Key Laboratory of Child Development and Disorders, China International Science and Technology Cooperation Base of Child Development and Critical Disorders, Chongqing Key Laboratory of Pediatrics, Children’s Hospital of Chongqing Medical University, Chongqing 400014, PR China
3College of Computer Science, Sichuan University, Chengdu 610064, Sichuan, PR China
4West China School of Pharmacy, Sichuan University, Chengdu 610041, Sichuan, PR China

Tài liệu tham khảo

Hong, 2021, ENJ algorithm can construct triple phylogenetic trees, Mol Ther Nucleic Acids, 23, 286, 10.1016/j.omtn.2020.11.004 Kolomvatsos, 2019, A distributed, proactive intelligent scheme for securing quality in large scale data processing, Computing, 101, 1687, 10.1007/s00607-018-0683-9 Wooley, 2010, A primer on metagenomics, PLoS Comput Biol, 6, 10.1371/journal.pcbi.1000667 Wooley, 2010, Metagenomics: facts and artifacts, and computational challenges, J Comp Sci Technol, 25, 71, 10.1007/s11390-010-9306-4 Godini, 2019, A brief overview of the concepts, methods and computational tools used in phylogenetic tree construction and gene prediction, Meta Gene, 21, 10.1016/j.mgene.2019.100586 Smith, 2014, Buying in to bioinformatics: An introduction to commercial sequence analysis software, Briefings Bioinf, 16, 700, 10.1093/bib/bbu030 Nakamura, 2018, Parallelization of MAFFT for large-scale multiple sequence alignments, Bioinformatics, 34, 2490, 10.1093/bioinformatics/bty121 Katoh, 2002, MAFFT: A novel method for rapid multiple sequence alignment based on fast Fourier transform, Nucleic Acids Res, 30, 3059, 10.1093/nar/gkf436 Pal, 2016, Use of FFT in protein sequence comparison under their binary representations, Comput Mol Biosci, 06, 33, 10.4236/cmb.2016.62003 Mirarab, 2014, PASTA: ultra-large multiple sequence alignment for nucleotide and amino-acid sequences, J Comput Biol, 22, 377, 10.1089/cmb.2014.0156 Zhan, 2019, ProbPFP: a multiple sequence alignment algorithm combining hidden Markov model optimized by particle swarm optimization with partition function, BMC Bioinf, 20, 1, 10.1186/s12859-019-3132-7 Li, 2018, Minimap2: Pairwise alignment for nucleotide sequences, Bioinformatics, 34, 3094, 10.1093/bioinformatics/bty191 Jiang X, Fu X, Dong G, Li H. Research on Pairwise Sequence Alignment Needleman-Wunsch Algorithm 2017;141:1041–6. 10.2991/icmmcce-17.2017.187. Lu, 2020, Parallel and distributed architecture of genetic algorithm on Apache Hadoop and Spark, Appl Soft Comp J, 95 Abuín, 2016, SparkBWA: speeding up the alignment of high-throughput DNA sequencing data, PLoS ONE, 11, 10.1371/journal.pone.0155461 Zou, 2015, HAlign: Fast multiple similar DNA/RNA sequence alignment based on the centre star strategy, Bioinformatics, 31, 2475, 10.1093/bioinformatics/btv177 Wan, 2017, HAlign-II: Efficient ultra-large multiple sequence alignment and phylogenetic tree reconstruction with distributed and parallel computing, Algorithms Mol Biol, 12, 1, 10.1186/s13015-017-0116-x Song, 2020, Complete mitochondrial genome of Aspergillus japonicus from the built environment and its phylogenetic analysis, Mitochondrial DNA Part B, 5, 1445, 10.1080/23802359.2020.1735972 Merheb, 2019, Mitochondrial DNA, a powerful tool to decipher ancient human civilization from domestication to music, and to uncover historical murder cases, Cells, 8, 10.3390/cells8050433 Abuín, 2020, Big Data in metagenomics: Apache Spark vs MPI, PLoS ONE, 15, 10.1371/journal.pone.0239741 Junier, 2010, The Newick utilities: high-throughput phylogenetic tree processing in the Unix shell, Bioinformatics, 26, 1669, 10.1093/bioinformatics/btq243 Thompson, 1999, BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs, Bioinformatics, 15, 87, 10.1093/bioinformatics/15.1.87 Carrillo, 1988, The multiple sequence alignment problem in biology, SIAM J Appl Math, 48, 1073, 10.1137/0148063 Darling, 2010, progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement, PLOS ONE, 5, 10.1371/journal.pone.0011147 Efron, 1996, Bootstrap confidence levels for phylogenetic trees, PNAS, 93, 13429, 10.1073/pnas.93.23.13429 Soltis, 2003, Applying the bootstrap in phylogeny reconstruction, Statistical Sci, 18, 256, 10.1214/ss/1063994980 Hill, 2008, Amdahl’s law in the multicore era, Computer, 41, 33, 10.1109/MC.2008.209 Li, 2003, ClustalW-MPI: ClustalW analysis using distributed and parallel computing, Bioinformatics, 19, 1585, 10.1093/bioinformatics/btg192 Niemenmaa, 2012, Hadoop-BAM: directly manipulating next generation sequencing data in the cloud, Bioinformatics, 28, 876, 10.1093/bioinformatics/bts054 Zou Q, Wan S, Zeng X. HPTree: Reconstructing phylogenetic trees for ultra-large unaligned DNA sequences via NJ model and Hadoop. 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2016, p. 53–8. 10.1109/BIBM.2016.7822492. Pearson, 1991, Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms, Genomics, 11, 635, 10.1016/0888-7543(91)90071-L Su, 2017, Multiple sequence alignment based on a suffix tree and center-star strategy: a linear method for multiple nucleotide sequence alignment on spark parallel framework, J Comput Biol, 24, 1230, 10.1089/cmb.2017.0040 Saitou, 1987, The neighbor-joining method: a new method for reconstructing phylogenetic trees, Mol Biol Evol, 4, 406 Edgar, 2004, MUSCLE: A multiple sequence alignment method with reduced time and space complexity, BMC Bioinf, 5, 1, 10.1186/1471-2105-5-113 Zaharia, 2012, Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing, 15 Xin RS, Gonzalez JE, Franklin MJ, Stoica I. GraphX: A resilient distributed graph system on spark. 1st International Workshop on Graph Data Management Experiences and Systems, GRADES 2013 – Co-Located with SIGMOD/PODS 2013 2013. 10.1145/2484425.2484427. Sun, 2013, A novel algorithm for DNA multiple sequence alignment based on the sliding window and the keyword tree, Int J Biosci, Biochem Bioinform, 3, 271 Na JC, Park H, Lee S, Hong M, Lecroq T, Mouchard L, et al. Suffix array of alignment: A practical index for similar data. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 2013;8214 LNCS:243–54. 10.1007/978-3-319-02432-5_27. Bingmann, 2018, Scalable string and suffix sorting, Algorith. Techn. Tools, 1