Search and clustering orders of magnitude faster than BLAST

Bioinformatics - Tập 26 Số 19 - Trang 2460-2461 - 2010
Robert C. Edgar1
1Tiburon, CA, 94920, USA

Tóm tắt

Abstract Motivation: Biological sequence data is accumulating rapidly, motivating the development of improved high-throughput methods for sequence classification. Results: UBLAST and USEARCH are new algorithms enabling sensitive local and global search of large sequence databases at exceptionally high speeds. They are often orders of magnitude faster than BLAST in practical applications, though sensitivity to distant protein relationships is lower. UCLUST is a new clustering method that exploits USEARCH to assign sequences to clusters. UCLUST offers several advantages over the widely used program CD-HIT, including higher speed, lower memory use, improved sensitivity, clustering at lower identities and classification of much larger datasets. Availability: Binaries are available at no charge for non-commercial use at http://www.drive5.com/usearch Contact:  [email protected] Supplementary information:  Supplementary data are available at Bioinformatics online.

Từ khóa


Tài liệu tham khảo

Altschul, 1990, Basic local alignment search tool, J. Mol. Biol., 215, 403, 10.1016/S0022-2836(05)80360-2

Butte, 2001, Challenges in bioinformatics: infrastructure, models and analytics, Trends Biotechnol., 19, 159, 10.1016/S0167-7799(01)01603-1

Costello, 2009, Bacterial community variation in human body habitats across space and time, Science, 326, 1694, 10.1126/science.1177486

Edgar, 2004, Local homology recognition and distance measures in linear time using compressed amino acid alphabets, Nucleic Acids Res., 32, 380, 10.1093/nar/gkh180

Edgar, 2004, MUSCLE: multiple sequence alignment with high accuracy and high throughput, Nucleic Acids Res., 32, 1792, 10.1093/nar/gkh340

Finn, 2008, The Pfam protein families database, Nucleic Acids Res., 36, D281, 10.1093/nar/gkm960

Gardner, 2009, Rfam: updates to the RNA families database, Nucleic Acids Res., 37, D136, 10.1093/nar/gkn766

Li, 2006, Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences, Bioinformatics, 22, 1658, 10.1093/bioinformatics/btl158