A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences

BMC Bioinformatics - Tập 11 - Trang 1-14 - 2010
David J Russell1, Samuel F Way1, Andrew K Benson2,3, Khalid Sayood1
1Department of Electrical Engineering, University of Nebraska-Lincoln, Lincoln, USA
2Department of Food Science and Technology, University of Nebraska-Lincoln, Lincoln, USA
3Core for Applied Genomics and Ecology, University of Nebraska-Lincoln, Lincoln, USA

Tóm tắt

We propose a sequence clustering algorithm and compare the partition quality and execution time of the proposed algorithm with those of a popular existing algorithm. The proposed clustering algorithm uses a grammar-based distance metric to determine partitioning for a set of biological sequences. The algorithm performs clustering in which new sequences are compared with cluster-representative sequences to determine membership. If comparison fails to identify a suitable cluster, a new cluster is created. The performance of the proposed algorithm is validated via comparison to the popular DNA/RNA sequence clustering approach, CD-HIT-EST, and to the recently developed algorithm, UCLUST, using two different sets of 16S rDNA sequences from 2,255 genera. The proposed algorithm maintains a comparable CPU execution time with that of CD-HIT-EST which is much slower than UCLUST, and has successfully generated clusters with higher statistical accuracy than both CD-HIT-EST and UCLUST. The validation results are especially striking for large datasets. We introduce a fast and accurate clustering algorithm that relies on a grammar-based sequence distance. Its statistical clustering quality is validated by clustering large datasets containing 16S rDNA sequences.

Tài liệu tham khảo

Holm L, Sander C: Removing Near-Neighbour Redundancy from Large Protein Sequence Collections. Bioinformatics 1998, 14(5):423–429. 10.1093/bioinformatics/14.5.423 Li W, Jaroszewski L, Godzik A: Clustering of Highly Homologous Sequences to Reduce the Size of Large Protein Databases. Bioinformatics 2001, 17(3):282–283. 10.1093/bioinformatics/17.3.282 Li W, Jaroszewski L, Godzik A: Tolerating some Redundancy Significantly Speeds up Clustering of Large Protein Databases. Bioinformatics 2002, 18: 77–82. 10.1093/bioinformatics/18.1.77 Parsons JD: Improved Tools for DNA Comparison and Clustering. Computer Applications in the Biosciences 1995, 11(6):603–613. Blackshields G, Sievers F, Shi W, Wilm A, Higgins DG: Sequence Embedding for Fast Construction of Guide Trees for Multiple Sequence Alignment. Algorithms for Molecular Biology 2010., 5(21): Li W, Godzik A: Cd-hit: a Fast Program for Clustering and Comparing Large Sets of Protein or Nucleotide Sequences. Bioinformatics 2006, 22(13):1658–1659. 10.1093/bioinformatics/btl158 Costello EK, Lauber CL, Hamady M, Fierer N, Gordon JI, Knight R: Bacterial Community Variation in Human Body Habitats Across Space and Time. Science 2009, 326: 1694–1697. 10.1126/science.1177486 Edgar RC: Search and Clustering Orders of Magnitude Faster than BLAST. Bioinformatics 2010, 26(19):2460–2461. 10.1093/bioinformatics/btq461 Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a New Generation of Protein Database Search Programs. Nucleic Acids Research 1997, 25(17):3389–3402. 10.1093/nar/25.17.3389 Lempel A, Ziv J: On the Complexity of Finite Sequences. IEEE Transactions on Information Theory 1976, 22: 75–81. 10.1109/TIT.1976.1055501 Nevill-Manning CG, Witten IH: Compression and Explanation using Hierarchical Grammars. The Computer Journal 1997, 40(2/3):103–116. 10.1093/comjnl/40.2_and_3.103 Ziv J, Lempel A: A Universal Algorithm for Sequential Data Compression. IEEE Transactions on Information Theory 1977, 23(3):337–343. 10.1109/TIT.1977.1055714 Charikar M, Lehman E, Liu D, Panigrahy R, Prabhakaran M, Rasala A, Sahai A, Shelat A: Approximating the Smallest Grammar: Kolmogorov Complexity in Natural Models. In STOC '02: Proceedings of the Thirty-Fourth Annual ACM Symposium on Theory of Computing. New York, NY, USA: ACM; 2002:792–801. Ziv J, Lempel A: Compression of Individual Sequences via Variable-Rate Coding. IEEE Transactions on Information Theory 1978, 24(5):530–536. 10.1109/TIT.1978.1055934 Benedetto D, Caglioti E, Loreto V: Language Trees and Zipping. Physical Review Letters 2002., 88(4): 10.1103/PhysRevLett.88.048702 Otu HH, Sayood K: A New Sequence Distance Measure for Phylogenetic Tree Construction. Bioinformatics 2003, 19(16):2122–2130. 10.1093/bioinformatics/btg295 Russell DJ, Otu HH, Sayood K: Grammar-Based Distance in Progressive Multiple Sequence Alignment. BMC Bioinformatics 2008., 9(306): Puglisi A, Benedetto D, Caglioti E, Loreto V, Vulpiani A: Data Compression and Learning in Time Sequences Analysis. Physica D: Nonlinear Phenomena 2003, 180: 92–107. 10.1016/S0167-2789(03)00047-2 Bastola DR, Otu HH, Doukas SE, Sayood K, Hinrichs SH, Iwen PC: Utilization of the Relative Complexity Measure to Construct a Phylogenetic Tree for Fungi. Mycological Research 2004, 108(2):117–125. 10.1017/S0953756203009079 Weiner P: Linear Pattern Matching Algorithms. 14th Annual Symposium on Switching and Automata Theory 1973, 1–11. full_text McCreight EM: A Space-Economical Suffix Tree Construction Algorithm. Journal of the ACM 1976, 23(2):262–272. 10.1145/321941.321946 Ukkonen E: On-Line Construction of Suffix Trees. Algorithmica 1995, 14(3):249–260. 10.1007/BF01206331 Wilbur WJ, Lipman DJ: Rapid Similarity Searches of Nucleic Acid and Protein Data Banks. Proceedings of the National Academy of Sciences of the United States of America 1983, 80: 726–730. 10.1073/pnas.80.3.726 Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: Improving the Sensitivity of Progressive Multiple Sequence Alignment Through Sequence Weighting, Position-Specific Gap Penalties and Weight Matrix Choice. Nucleic Acids Research 1994, 22(22):4673–4680. 10.1093/nar/22.22.4673 Halkidi M, Batistakis Y, Vazirgiannis M: On Clustering Validation Techniques. Journal of Intelligent Information Systems 2001, 17(2–3):107–145. 10.1023/A:1012801612483 Li W: Analysis and Comparison of Very Large Metagenomes with Fast Clustering and Functional Annotation. BMC Bioinformatics 2009., 10(359):