Limits to robustness and reproducibility in the demarcation of operational taxonomic units

Wiley - Tập 17 Số 5 - Trang 1689-1706 - 2015
Thomas Schmidt1, João F. Matias Rodrigues1, Christian von Mering1
1Institute for Molecular Life Sciences and Swiss Institute of Bioinformatics University of Zurich Winterthurerstrasse 190 Zürich 8057 Switzerland

Tóm tắt

SummaryThe demarcation of operational taxonomic units (OTUs) from complex sequence data sets is a key step in contemporary studies of microbial ecology. However, as biologically motivated ‘optimal’ OTU‐binning algorithms remain elusive, many conceptually distinct approaches continue to be used. Using a global data set of 887 870 bacterial 16S rRNA gene sequences, we objectively quantified biases introduced by several widely employed sequence clustering algorithms. We found that OTU‐binning methods often provided surprisingly non‐equivalent partitions of identical data sets, notably when clustering to the same nominal similarity thresholds; and we quantified the resulting impact on ecological data description for a well‐defined human skin microbiome data set. We observed that some methods were very robust to varying clustering thresholds, while others were found to be highly susceptible even to slight threshold variations. Moreover, we comprehensively quantified the impact of the choice of 16S rRNA gene subregion, as well as of data set scope and context on algorithm performance. Our findings may contribute to an enhanced comparability of results across sequence‐processing pipelines, and we arrive at recommendations towards higher levels of standardization in established workflows.

Từ khóa


Tài liệu tham khảo

10.1038/nrmicro1872

10.1186/1471-2105-12-473

10.1093/nar/gks1195

10.1093/bioinformatics/bts552

10.2307/1942268

10.1093/nar/gkr349

10.1038/nmeth.f.303

10.1128/IAI.00908-10

Chao A., 1984, Nonparametric estimation of the number of classes in a population, Scand J Stat, 11, 265

10.1111/j.1461-0248.2004.00707.x

10.1016/j.mimet.2013.07.004

10.1371/journal.pone.0070837

10.1093/nar/gks227

10.1098/rstb.2006.1918

10.1093/nar/gkt1244

10.1128/AEM.03006-05

10.2307/1932409

10.1186/gb-2006-7-9-116

10.1101/gr.086645.108

Drummond C.(2009)Replicability is Not Reproducibility: Nor is It Good Science. Proc Eval Meth Mach Learn Workshop 26th ICML. Montreal Quebec Canada.

10.1093/bioinformatics/btq461

10.1038/nmeth.2604

10.1093/bioinformatics/btr381

Fred A., 2003, Proc IEEE Conference Comp Vision Pattern Recognition, II/128

10.1093/bioinformatics/bts565

10.1038/nrmicro1236

10.4056/sigs.1433550

10.1126/science.1171700

10.1093/bioinformatics/btq725

10.1086/282436

10.1007/BF01908075

10.1111/j.1462-2920.2010.02193.x

10.1016/j.mimet.2010.10.020

10.1073/pnas.0712205105

10.1093/nar/gkt241

10.1073/pnas.82.20.6955

10.1093/bioinformatics/btl158

10.1093/bib/bbs035

10.1093/bioinformatics/btt657

Nawrocki E.P.(2009)Structural RNA homology search and alignment using covariance models. PhD Thesis. St Louis USA: Washington University School of Medicine.

10.1093/bioinformatics/btp157

10.1146/annurev.mi.40.100186.002005

10.1093/nar/gkr1100

10.1128/AEM.00342-13

10.1093/nar/gkr1079

10.1186/1752-0509-7-S4-S11

10.1371/journal.pcbi.1000844

Schloss P.D., 2012, Secondary structure improves OTU assignments of 16S rRNA gene sequences, ISME J, 7, 511

10.1128/AEM.02810-10

10.1128/AEM.01541-09

10.1371/journal.pone.0027310

10.1371/journal.pcbi.1003594

Shannon C.E., 1948, A mathematical theory of communication, AT&T Tech J, 27, 623

10.1038/163688a0

10.1093/nar/gkp285

10.1093/bib/bbr009

10.1038/nature11209

10.1038/nature11234

Vinh N.X., 2010, Information theoretic measures for clustering comparison: variants, properties, normalization and correction for chance, J Mach Learn Res, 11, 2837

10.1038/ismej.2011.187

10.1186/1471-2105-14-43

10.1186/1471-2105-13-174

10.1186/1471-2105-11-152

10.1093/nar/gkt1209

10.1093/bioinformatics/btt499

10.1093/bioinformatics/bts355