ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time

PLoS Computational Biology - Tập 13 Số 4 - Trang e1005518
Yunpeng Cai1, Wei Xing Zheng2, Jin Yao3, Yujie Yang1, Volker Mai4, Qi Mao3, Yijun Sun5,2,3
1Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
2Department of Computer Science and Engineering, The State University of New York at Buffalo, Buffalo, New York, United States of America
3Department of Microbiology and Immunology, The State University of New York at Buffalo, Buffalo, New York, United States of America
4Department of Epidemiology, University of Florida, Gainesville, Florida, United States of America
5Department of Biostatistics, The State University of New York at Buffalo, Buffalo, New York, United States of America

Tóm tắt

Từ khóa


Tài liệu tham khảo

A Sboner, 2011, The real cost of sequencing: higher than you think!, Genome Biology, 12, 125, 10.1186/gb-2011-12-8-125

N Beerenwinkel, 2011, Ultra-deep sequencing for the analysis of viral populations, Current Opinion in Virology, 1, 413, 10.1016/j.coviro.2011.07.008

ML Sogin, 2006, Microbial diversity in the deep sea and the underexplored “rare biosphere”, Proceedings of the National Academy of Sciences, 103, 12115, 10.1073/pnas.0605127103

HE O’Brien, 2005, Fungal community analysis by large-scale sequencing of environmental samples, Applied and Environmental Microbiology, 71, 5544, 10.1128/AEM.71.9.5544-5550.2005

P López-García, 2001, Unexpected diversity of small eukaryotes in deep-sea Antarctic plankton, Nature, 409, 603, 10.1038/35054537

Z Kan, 2010, Diverse somatic mutation patterns and pathway alterations in human cancers, Nature, 466, 869, 10.1038/nature09208

SD Boyd, 2009, Measurement and clinical monitoring of human lymphocyte clonality by massively parallel VDJ pyrosequencing, Science Translational Medicine, 1, 12ra23

2013, Your Microbes, Your Health, Science, 342, 1440, 10.1126/science.342.6165.1440-b

JM Di Bella, 2013, High throughput sequencing methods and analysis for microbiome research, Journal of Microbiological Methods, 95, 401, 10.1016/j.mimet.2013.08.011

SS Mande, 2012, Classification of metagenomic sequences: methods and challenges, Briefings in Bioinformatics, 13, 669, 10.1093/bib/bbs054

J Dröge, 2012, Taxonomic binning of metagenome samples generated by next-generation sequencing technologies, Briefings in Bioinformatics, 13, 646, 10.1093/bib/bbs031

W Li, 2006, Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences, Bioinformatics, 22, 1658, 10.1093/bioinformatics/btl158

RC Edgar, 2010, Search and clustering orders of magnitude faster than BLAST, Bioinformatics, 26, 2460, 10.1093/bioinformatics/btq461

Y Sun, 2011, A large-scale benchmark study of existing algorithms for taxonomy-independnet microbial community analysis, Briefings in Bioinformatics, 13, 107, 10.1093/bib/bbr009

W Chen, 2013, MSClust: A multi-seeds based clustering algorithm for microbiome profiling using 16S rRNA sequences, Journal of Microbiological Methods, 94, 347, 10.1016/j.mimet.2013.07.004

MJ Bonder, 2012, Comparing clustering and pre-processing in taxonomy analysis, Bioinformatics, 28, 2891, 10.1093/bioinformatics/bts552

J Peterson, 2009, The NIH Human Microbiome Project, Genome Research, 19, 2317, 10.1101/gr.096651.109

Y Cai, 2011, ESPRIT-Tree: Hierarchical clustering analysis of millions of 16S rRNA Pyrosequences in quasilinear computational time, Nuclear Acids Research, 39, e95, 10.1093/nar/gkr349

X Wang, 2012, Secondary structure information does not improve OTU assignment for partial 16S rRNA sequences, The ISME Journal, 6, 1277, 10.1038/ismej.2011.187

J Barriuso, 2011, Estimation of bacterial diversity using next generation sequencing of 16S rDNA: a comparison of different workflows, BMC Bioinformatics, 12, 473, 10.1186/1471-2105-12-473

CF Olson, 1995, Parallel algorithms for hierarchical clustering, Parallel Computing, 21, 1313, 10.1016/0167-8191(95)00017-I

M Dash, 2004, Euro-Par 2004 Parallel Processing, 363

Z Feng, 2007, A parallel hierarchical clustering algorithm for PCs cluster system, Neurocomputing, 70, 809, 10.1016/j.neucom.2006.10.034

JFM Rodrigues, 2014, HPC-CLUST: distributed hierarchical clustering for large sets of nucleotide sequences, Bioinformatics, 30, 287, 10.1093/bioinformatics/btt657

Y Sun, 2009, ESPRIT: estimating species richness using large collections of 16S rRNA pyrosequences, Nuclear Acids Research, 37, e76, 10.1093/nar/gkp285

TD Nguyen, 2015, Efficient and Accurate OTU Clustering with GPU-Based Sequence Alignment and Dynamic Dendrogram Cutting, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 12, 1060, 10.1109/TCBB.2015.2407574

Mao Q, Zheng W, Wang L, Cai Y, Mai V, Sun Y. Parallel Hierarchical Clustering in Linearithmic Time for Large-Scale Sequence Analysis. In: 2015 IEEE International Conference on Data Mining; 2015. p. 310–319.

RC Edgar, 2004, MUSCLE: multiple sequence alignment with high accuracy and high throughput, Nucleic Acids Research, 32, 1792, 10.1093/nar/gkh340

K Katoh, 2002, MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform, Nucleic Acids Research, 30, 3059, 10.1093/nar/gkf436

MN Price, 2010, FastTree 2–approximately maximum-likelihood trees for large alignments, PLoS ONE, 5, e9490, 10.1371/journal.pone.0009490

K Howe, 2002, QuickTree: building huge Neighbour-Joining trees of protein sequences, Bioinformatics, 18, 1546, 10.1093/bioinformatics/18.11.1546

MJ Quinn, 2004, Parallel Programming in C with MPI and OpenMP

S Skiena, 2008, The Algorithm Design Manual, 10.1007/978-1-84800-070-4

RC Edgar, 2011, UCHIME improves sensitivity and speed of chimera detection, Bioinformatics, 27, 2194, 10.1093/bioinformatics/btr381

RC Edgar, 2013, UPARSE: Highly accurate OTU sequences from microbial amplicon reads, Nature Methods, 10, 996, 10.1038/nmeth.2604

PJ Turnbaugh, 2008, A core gut microbiome in obese and lean twins, Nature, 457, 480, 10.1038/nature07540

J Ye, 2006, BLAST: improvements for better sequence analysis, Nucleic acids research, 34, W6, 10.1093/nar/gkl164

JR Cole, 2005, The Ribosomal Database Project (RDP-II): sequences and tools for high-throughput rRNA analysis, Nucleic acids research, 33, D294

A Giongo, 2010, TaxCollector: modifying current 16S rRNA databases for the rapid classification at six taxonomic levels, Diversity, 2, 1015, 10.3390/d2071015

MJ Claesson, 2011, Composition, variability, and temporal stability of the intestinal microbiota of the elderly, Proceedings of the National Academy of Sciences, 108, 4586, 10.1073/pnas.1000097107

2012, Structure, function and diversity of the healthy human microbiome, Nature, 486, 207, 10.1038/nature11234

T Ding, 2014, Dynamics and associations of microbial community types across the human body, Nature, 509, 357, 10.1038/nature13178

AF Koeppel, 2013, Surprisingly extensive mixed phylogenetic and ecological signals among bacterial Operational Taxonomic Units, Nucleic acids research, gkt241

SL Westcott, 2015, De novo clustering methods outperform reference-based methods for assigning 16S rRNA gene sequences to operational taxonomic units, PeerJ, 3, e1487, 10.7717/peerj.1487

A May, 2014, Unraveling the outcome of 16S rDNA-based taxonomy analysis through mock data and simulations, Bioinformatics, 30, 1530, 10.1093/bioinformatics/btu085

JM Flynn, 2015, Toward accurate molecular identification of species in complex environmental samples: testing the performance of sequence filtering and clustering methods, Ecology and evolution, 5, 2252, 10.1002/ece3.1497

JR White, 2010, Alignment and clustering of phylogenetic markers-implications for microbial diversity studies, BMC bioinformatics, 11, 1, 10.1186/1471-2105-11-152

X Wang, 2013, M-pick, a modularity-based method for OTU picking of 16S rRNA sequences, BMC bioinformatics, 14, 1, 10.1186/1471-2105-14-43

C Lozupone, 2005, UniFrac: a new phylogenetic method for comparing microbial communities, Applied and environmental microbiology, 71, 8228, 10.1128/AEM.71.12.8228-8235.2005

F Corpet, 1988, Multiple sequence alignment with hierarchical clustering, Nucleic acids research, 16, 10881, 10.1093/nar/16.22.10881

A Krause, 2005, Large scale hierarchical clustering of protein sequences, BMC bioinformatics, 6, 1, 10.1186/1471-2105-6-15