Model Selection and Model Averaging in Phylogenetics: Advantages of Akaike Information Criterion and Bayesian Approaches Over Likelihood Ratio Tests

Systematic Biology - Tập 53 Số 5 - Trang 793-808 - 2004
David Posada1, Thomas R. Buckley2,3
1Departamento de Bioquímica, Genética e Inmunología, Facultad de Biología, Universidad de Vigo Vigo 36200, Spain; E-mail: [email protected]
2e-mail:
3Landcare Research Private Bag 92170, Auckland, New Zealand; E-mail: [email protected]

Tóm tắt

Từ khóa


Tài liệu tham khảo

Adachi, 1996, MOLPHY version 2.3.: Programs for molecular phylogenetics based in maximum likelihood, Comput. Sci. Monogr., 28, 1

Agresti, 1990, Categorical data analysis, 2nd edition

Akaike, 1973, Information theory and an extension of the maximum likelihood principle, Second International Symposium on Information Theory, 267

Akaike, 1974, A new look at the statistical model identification, IEEE Trans. Aut. Control, 19, 716, 10.1109/TAC.1974.1100705

Akaike, 1981, Likelihood of a model and information criteria, J. Econometrics, 16, 3, 10.1016/0304-4076(81)90071-3

Akaike, 1983, Information measures and model selection, Int. Stat. Inst., 22, 277

Anderson, 2000, Null hypothesis testing: Problems, prevalence, and an alternative, J. Wildl. Manage, 64, 912, 10.2307/3803199

Aris-Brosou, 2002, Effects of models of rate evolution on estimation of divergence dates with special reference to the metazoan 18S ribosomal RNA phylogeny, Syst. Biol., 51, 703, 10.1080/10635150290102375

Bartlett, 1957, A comment on D, V. Lindley's statistical paradox. Biometrika, 44, 533

Berger, 1987, Testing a point null hypothesis: The irreconcilability of P values and evidence, J. Am. Stat. Assoc., 82, 112

Bernardo, 1994, Bayesian theory, 10.1002/9780470316870

Bollback, 2002, Bayesian model adequacy and choice in phylogenetics, Mol. Biol. Evol., 19, 1171, 10.1093/oxfordjournals.molbev.a004175

Box, 1976, Science and statistics, J. Am. Stat. Assoc., 71, 791, 10.1080/01621459.1976.10480949

Bozdogan, 1987, Model selection and Akaike's information criterion (AIC): The general theory and its analytical extensions, Psychometrika, 52, 345, 10.1007/BF02294361

Browne, 2000, Cross-validation methods, J. Math. Psychol., 44, 108, 10.1006/jmps.1999.1279

Bruno, 1999, Topological bias and inconsistency of maximum likelihood using wrong models, Mol. Biol. Evol., 16, 564, 10.1093/oxfordjournals.molbev.a026137

Buckley, 2002, Model misspecification and probabilistic tests of topology: Evidence from empirical data sets, Syst. Biol., 51, 509, 10.1080/10635150290069922

Buckley, 2002, Combined data, Bayesian phylogenetics, and the origin of the New Zealand cicada genera, Syst. Biol., 51, 4, 10.1080/106351502753475844

Buckland, 1997, Model selection uncertainty: An integral part of inference, Biometrics, 53, 603, 10.2307/2533961

Buckley, 2002, The effects of nucleotide substitution model assumptions on estimates of nonparametric bootstrap support, Mol. Biol. Evol., 19, 394, 10.1093/oxfordjournals.molbev.a004094

Buckley, 2001, Exploring among-site rate variation models in a maximum likelihood framework using empirical data: The effects of model assumptions on estimates of topology, branch lengths, and bootstrap support, Syst. Biol., 50, 67, 10.1080/10635150116786

Burnham, 1998, Model selection and inference: A practical information-theoretic approach, 1st ed

Burnham, 2003, Model selection and multimodel inference: A practical information-theoretic approach, 2nd ed

Burnham, 1994, Evaluation of the Kullback-Leibler discrepancy for model selection in open population capture-recapture models, Biometrica J., 36, 299, 10.1002/bimj.4710360308

Cavanaugh, 1999, Generalizing the derivation of the Schwarz information criterion, Commun. Stat. Theory Methods, 28, 49, 10.1080/03610929908832282

Chamberlain, 1890, The method of multiple working hypotheses, Science, 15, 93

Chatfield, 1995, Model uncertainty, data mining and statistical inference, J. R. Stat. Soc. A, 158, 419, 10.2307/2983440

Churchill, 1992, Sample size for a phylogenetic inference, Mol. Biol. Evol., 9, 753

Deleeuw, 1992, Introduction to Akaike 1973 information theory and an extension of the maximum likelihood principle, Breakthroughs in statistics, 599, 10.1007/978-1-4612-0919-5_37

Edwards, 1972, Likelihood

Efron, 1993, An Introduction to the Bootstrap, 10.1007/978-1-4899-4541-9

Felsenstein, 1978, Cases in which parsimony or compatibility methods will be positively misleading, Syst. Zool., 27, 401, 10.2307/2412923

Felsenstein, 1981, Evolutionary trees from DNA sequences: A maximum likelihood approach, J. Mol. Evol., 17, 368, 10.1007/BF01734359

Felsenstein, 1981, A likelihood approach to character weighting and what it tells us about parsimony and compatibility, Biol. J. Linnaean Soc., 16, 183, 10.1111/j.1095-8312.1981.tb01847.x

Findley, 1991, Counterexamples to parsimony and BIC, Ann. Inst. Stat. Math., 43, 505, 10.1007/BF00053369

Fisher, 1921, On the ‘probable error’ of a coefficient of correlation deduced from a small sample, Metron I, part, 4, 3

Forster, 2000, Key Concepts in model selection: Performance and generalizability, J. Math. Psychol., 44, 205, 10.1006/jmps.1999.1284

Forster, 2001, The new science of simplicity, Simplicity, inference and modeling, 83

Forster, 2002, Predictive accuracy as am achievable goal of science, Phil. Sci., 69, S124, 10.1086/341840

Forster, 1994, How to tell when simpler, more unified, or less ad hoc theories will provide more accurate predictions, Br. J. Phil. Sci., 45, 1, 10.1093/bjps/45.1.1

Forster, 2004, Why likelihood?, Likelihood and Evidence, 10.7208/chicago/9780226789583.003.0006

Foulds, 1979, A graph theoretic approach to the development of minimal phylogenetic trees, J. Mol. Evol., 13, 127, 10.1007/BF01732868

Foutz, 1977, The performance of the likelihood ratio test when the model is incorrect, Ann. Stat., 5, 1183, 10.1214/aos/1176344003

Frati, 1997, Gene evolution and phylogeny of the mitochondrial cytochrome oxidase gene in Collembola, J. Mol. Evol., 44, 145, 10.1007/PL00006131

Gelfand, 1996, Model determination using sampling-based methods, Markov chain Monte Carlo in practice, 145

Gilks, 1996, Markov chain Monte Carlo in practice

Golden, 1995, Making correct statistical inferences using a wrong probability model, J. Math. Psychol., 38, 3, 10.1006/jmps.1995.1002

Goldman, 1990, Maximum likelihood inference of phylogenetic trees, with special reference to a Poisson process model of DNA substitution and to parsimony analyses, Syst. Zool., 39, 345, 10.2307/2992355

Goldman, 1993, Statistical tests of models of DNA substitution, J. Mol. Evol., 36, 182, 10.1007/BF00166252

Goldman, 1998, Phylogenetic information and experimental design in molecular systematics, Proc. R. Soc. Lond. B Biol. Sci., 265, 1779, 10.1098/rspb.1998.0502

Goldman, 2000, Statistical tests of gamma-distributed rate heterogeneity in models of sequence evolution in phylogenetics, Mol. Biol. Evol., 17, 975, 10.1093/oxfordjournals.molbev.a026378

Green, 1995, Reversible jump MCMC computation and Bayesian model determination, Biometrika, 92, 711, 10.1093/biomet/82.4.711

Hasegawa, 1990, Mitochondrial DNA evolution in primates: Transition rate has been extremely low in the lemur, J. Mol. Evol., 31, 113, 10.1007/BF02109480

Hasegawa, 1990, Phylogeny and molecular evolution in primates, Jpn. J. Genet., 65, 243, 10.1266/jjg.65.243

Hasegawa, 1985, Dating the human-ape splitting by a molecular clock of mitochondrial DNA, J. Mol. Evol., 22, 160, 10.1007/BF02101694

Hastings, 1970, Monte Carlo sampling methods using Markov chains and their applications, Biometrika, 57, 97, 10.1093/biomet/57.1.97

Hochberg, 1988, A sharper Bonferroni procedure for multiple tests of significance, Biometrika, 75, 800, 10.1093/biomet/75.4.800

Hoeting, 1999, Bayesian model averaging: A tutorial, Stat. Sci., 14, 382

Holder, 2003, Phylogeny estimation: Traditional and Bayesian approaches, Nat. Rev. Genet., 4, 275, 10.1038/nrg1044

Hsiao, 1997, Approximate Bayes factors when a mode occurs on the boundary, J. Am. Stat. Assoc., 92, 656, 10.1080/01621459.1997.10474017

Huelsenbeck, 1997, Phylogeny estimation and hypothesis testing using maximum likelihood, Annu. Rev. Ecol. Syst., 28, 437, 10.1146/annurev.ecolsys.28.1.437

Huelsenbeck, 1993, Success of phylogenetic methods in the four-taxon case, Syst. Biol., 42, 247, 10.1093/sysbio/42.3.247

Huelsenbeck, 2002, Geographic origin of human mitochondrial DNA: Accommodating phylogenetic uncertainty and model comparison, Syst. Biol., 51, 155, 10.1080/106351502753475934

Huelsenbeck, 2004, Bayesian phylogenetic model selection using reversible jump Markov chain Monte Carlo, Mol. Biol. Evol., 21, 1123, 10.1093/molbev/msh123

Huelsenbeck, 2002, Potential applications and pitfalls of Bayesian inference of phylogeny, Syst. Biol., 51, 673, 10.1080/10635150290102366

Huelsenbeck, 2000, A Bayesian framework for the analysis of cospeciation, Evol. Int. J. Org. Evol., 54, 352, 10.1111/j.0014-3820.2000.tb00039.x

Huelsenbeck, 2001, Bayesian inference of phylogeny and its impact on evolutionary biology, Science, 294, 2310, 10.1126/science.1065889

Hurvich, 1989, Regression and time series model selection in small samples, Biometrika, 76, 297, 10.1093/biomet/76.2.297

Jeffreys, 1939, Theory of probability

Jermiin, 1997, Majority-rule consensus of phylogenetic trees obtained by maximum-likelihood analysis, Mol. Biol. Evol., 14, 1296, 10.1093/oxfordjournals.molbev.a025739

Johnson, 2003, Model selection in ecology and evolution, Trends Ecol. Evol., 19, 101, 10.1016/j.tree.2003.10.013

Jukes, 1969, Evolution of protein molecules, Mammalian protein metabolism, 21, 10.1016/B978-1-4832-3211-9.50009-7

Kadane, 1998, Experiencies in elicitation, J. R. Stat. Soc. D 47 Part, 1, 3, 10.1111/1467-9884.00113

Kass, 1995, Bayes factors, J. Am. Stat. Assoc., 90, 773, 10.1080/01621459.1995.10476572

Kass, 1995, A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion, J. Am. Stat. Assoc., 90, 928, 10.1080/01621459.1995.10476592

Kelsey, 1999, Different models, different trees: The geographic origin of PTLV-I, Mol. Phylogenet. Evol., 13, 336, 10.1006/mpev.1999.0663

Kendall, 1979, The advanced theory of statistics, 4th edition

Kent, 1982, Robust properties of likelihood ratio tests, Biometrika, 69, 19

Keuzenkamp, 1995, Simplicity, scientific inference and economic modeling, Econ. J., 105, 1, 10.2307/2235317

Kieseppä, 2002, Statistical model selection and Bayesianism, Phil. Sci., 68, S141, 10.1086/392904

Kimura, 1980, A simple method for estimating evolutionary rate of base substitutions through comparative studies of nucleotide sequences, J. Mol. Evol., 16, 111, 10.1007/BF01731581

Kimura, 1981, Estimation of evolutionary distances between homologous nucleotide sequences, Proc. Nat. Acad. Sci. USA, 78, 454, 10.1073/pnas.78.1.454

Kishino, 1989, Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in Hominoidea, J. Mol. Evol., 29, 170, 10.1007/BF02100115

Kuha, 2003, AIC and BIC: Comparisons of assumptions and performance, Sociol. Methods Res.

Kullback, 1951, On information and sufficiency, Ann. Math. Stat., 22, 79, 10.1214/aoms/1177729694

Larget, 1999, Markov chain Monte Carlo algorithms for the Bayesian analysis of phylogenetic trees, Mol. Biol. Evol., 16, 750, 10.1093/oxfordjournals.molbev.a026160

Lindley, 1957, A statistical paradox, Biometrika, 44, 187, 10.1093/biomet/44.1-2.187

Linhart, 1988, A test whether two AIC's differ significantly, S. Afr. Stat. J., 22, 153

Linhart, 1986, Model selection

Madigan, 1995, Eliciting prior information to enhance the predictive performance of Bayesian graphical models, Commun. Stat. Theory Methods, 24, 2271, 10.1080/03610929508831616

Madigan, 1994, Model selection and accounting for model uncertainty in graphical models using Occam's Window, J. Am. Stat. Assoc., 89, 1335, 10.1080/01621459.1994.10476894

Mau, 1997, Phylogenetic inference for binary data on dendrograms using Markov chain Monte Carlo, J. Comp. Grap. Stat.

Mau, 1999, Bayesian phylogenetic inference via Markov chain Monte Carlo methods, Biometrics, 55, 1, 10.1111/j.0006-341X.1999.00001.x

Metropolis, 1953, Equations of state calculations by fast computing machines, J. Chem. Phys., 21, 1087, 10.1063/1.1699114

Miller, 2002, Subset Selection in Regression, 2nd edition edition

Minin, 2003, Performance-based selection of likelihood models for phylogeny estimation, Syst. Biol., 52, 674, 10.1080/10635150390235494

Morozov, 2000, A new method for characterizing replacement rate variation in molecular sequences: Application of the Fourier and Wavelet models to Drosophila and mammalian proteins, Genetics, 154, 381, 10.1093/genetics/154.1.381

Myrvold, 2002, Model Selection, Simplicity, and Scientific Inference, Philos. Sci., 69, S135, 10.1086/341841

Nishii, 1984, Asymptotic properties of criteria for selection of variables in multiple regression, Ann. Stat., 12, 758, 10.1214/aos/1176346522

Nishii, 1988, Maximum likelihood principle and model selection when the true model is unspecified, J. Multivar. Ana., 27

Nylander, 2004, Bayesian Phylogenetics and the Evolution of Gall Wasps, Acta Universitatis Upsaliensis, 43

Nylander, 2004, Bayesian phylogenetic analysis of combined data, Syst. Biol., 53, 47, 10.1080/10635150490264699

Occam, .1320, Scriptum in Librum Primum Sententiarum, Opera Theologica, I

Ogishima, 2000, Efficiencies of information criteria for topology selection in reconstructing molecular phylogenetic tree in Proceedings of International Symposium on Artificial Life and Robotics, 745

Ota, 2000, Appropriate likelihood ratio tests and marginal distributions for evolutionary tree models with constraints on parameters, Mol. Biol. Evol., 17, 798, 10.1093/oxfordjournals.molbev.a026358

Penny, 1994, The role of models in reconstructing evolutionary trees, Models in Phylogenetic Reconstruction, 211, 10.1093/oso/9780198548249.003.0012

Pol, Empirical problems of the hierarchical likelihood ratio test for model selection, Syst. Biol.

Popper, 1959, Logic of scientific discovery, 10.1063/1.3060577

Posada, 2001, The effect of branch length variation on the selection of models of molecular evolution, J. Mol. Evol., 52, 434, 10.1007/s002390010173

Posada, 2003, Using Modeltest and PAUP* to select a model of nucleotide substitution, Current Protocols in Bioinformatics, 6.5.1, 10.1002/0471250953.bi0605s00

Posada, 1998, Modeltest: Testing the model of DNA substitution, Bioinformatics, 14, 817, 10.1093/bioinformatics/14.9.817

Posada, 2001., Selecting models of nucleotide substitution: An application to human immunodeficiency virus 1 (HIV-1), Mol. Biol. Evol., 18, 897, 10.1093/oxfordjournals.molbev.a003890

Posada, 2001., Selecting the best-fit model of nucleotide substitution, Syst. Biol., 50, 580, 10.1080/10635150118469

Posada, 2001., Simple (wrong) models for complex trees: Empirical Bias, Mol. Biol. Evol., 18, 271, 10.1093/oxfordjournals.molbev.a003802

Pupko, 2002, Combining multiple data sets in a likelihood analysis: Which models are the best? Mol, Biol. Evol., 19, 2294, 10.1093/oxfordjournals.molbev.a004053

Raftery, 1996, Hypothesis testing and model selection, Markov chain Monte Carlo in practice, 163

Raftery, 1999, Bayes factors and BIC: Comment on “A critique of the Bayesian information criterion for model selection”, Sociol. Methods Res., 27, 411, 10.1177/0049124199027003005

Robinson, 1981, Comparison of phylogenetic trees, Math. Biosci., 53, 131, 10.1016/0025-5564(81)90043-2

Rzhetsky, 1995, Tests of applicability of several substitution models for DNA sequence data, Mol. Biol. Evol., 12, 131, 10.1093/oxfordjournals.molbev.a040182

Sakamoto, 1986, Akaike information criterion statistics

Sanderson, 2000, Parametric phylogenetics? Syst, Biol., 49, 817

Schwarz, 1978, Estimating the dimension of a model, Ann. Stat., 6, 461, 10.1214/aos/1176344136

Shafer, 1982, Lindley's paradox (with discussion), J. Am. Stat. Assoc., 77, 325, 10.1080/01621459.1982.10477809

Shibata, 1986, Consistency of model selection and parameter estimation, J. Appl. Prob., 23A, 127, 10.2307/3214348

Shimodaira, 1997, Assessing the error probability of the model selection test, Ann. Inst. Stat. Math., 49, 395, 10.1023/A:1003140609666

Shimodaira, 1998, An application of multiple comparison techniques to model selection, Ann. Inst. Stat. Math., 1, 1, 10.1023/A:1003483128844

Shimodaira, 2001, Multiple comparisons of log-likelihoods and combining nonnested models with applications to phylogenetic tree selection, Commun. Stat. Theory Methods, 30, 1751, 10.1081/STA-100105696

Shimodaira, 1999, Multiple comparisons of log-likelihoods with applications to phylogenetic inference, Mol. Biol. Evol., 16, 1114, 10.1093/oxfordjournals.molbev.a026201

Sober, 2002, Bayesianism—its scope and limits, Bayes's Theorem, 21

Sober, 2002, Instrumentalism, parsimony, and the Akaike framework, Phil. Sci., 69, S112, 10.1086/341839

Sober, 2002, Testing the hypothesis of common ancestry, J. Theoret. Biol., 218, 395, 10.1016/S0022-5193(02)93086-9

Sota, 2001, Incongruence of mitochondrial and nuclear gene trees in the Carabid beetles Ohomopterus, Syst. Biol., 50, 39, 10.1093/sysbio/50.1.39

Steel, 2000, Parsimony, likelihood, and the role of models in molecular phylogenetics, Mol. Biol. Evol., 17, 839, 10.1093/oxfordjournals.molbev.a026364

Stone, 1977, An asymptotic equivalence of choice of model by cross-validation and Akaike's criterion, J. R. Stat. Soc., 39, 44

Strimmer, 2001, Model selection using expected likelihood weights: A Bayes-frequentist compromise

Strimmer, 2001, Inferring confidence sets of possibly misspecified gene trees, Proc. R. Soc. Lond. B Biol. Sci., 269, 137, 10.1098/rspb.2001.1862

Suchard, 2003., Hierarchical phylogenetic models for analyzing multipartite sequence data, Syst. Biol., 52, 649, 10.1080/10635150390238879

Suchard, 2002, Oh brother, where art thou? A Bayes factor test for recombination with uncertain heritage, Syst. Biol., 51, 715, 10.1080/10635150290102384

Suchard, 2001, Bayesian selection of continuous-time Markov chain evolutionary models, Mol. Biol. Evol., 18, 1001, 10.1093/oxfordjournals.molbev.a003872

Suchard, 2003., Testing a molecular clock without an outgroup: Derivations of induced priors on branch-Length restrictions in a Bayesian framework, Syst. Biol., 52, 48, 10.1080/10635150390132713

Sugiura, 1978, Further analysis of the data by Akaike's information criterion and the finite corrections, Commun. Stat. Theory Methods A, 7, 13, 10.1080/03610927808827599

Sullivan, 1997, Are guinea pigs rodents? The importance of adequate models in molecular phylogenies, J. Mamm. Evol., 4, 77, 10.1023/A:1027314112438

Sullivan, 2001, Should we use model-based methods for phylogenetic inference when we know that assumptions about among-site rate variation and nucleotide substitution pattern are violated? Syst, Biol., 50, 723

Suzuki, 2002, Overcredibility of molecular phylogenies obtained by Bayesian phylogenetics, Proc. Natl. Acad. Sci. USA, 99, 16138, 10.1073/pnas.212646199

Swofford, 1998, PAUP* Phylogenetic analysis using parsimony and other methods, version 4.0. beta

Swofford, 2000, PAUP* Phylogenetic analysis using parsimony (*and other methods). version 4

Tamura, 1994, Model selection in the estimation of the number of nucleotide substitutions, Mol. Biol. Evol., 11, 154

Tamura, 1993, Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees, Mol. Biol. Evol., 10, 512

Tanaka, 1999, Topology selection in unrooted molecular phylogenetic tree by minimum model-based complexity method, Pac. Symp. Biocomput., 4, 326

Tavaré, 1986, Some probabilistic and statistical problems in the analysis of DNA sequences, Some mathematical questions in biology—DNA sequence analysis, 57

Van Den Bussche, 1998, Base compositional bias and phylogenetic analyses: A test of the “flying DNA” hypothesis, Mol. Phylogenet. Evol., 10, 408, 10.1006/mpev.1998.0531

Verdinelli, 1995, Computing Bayes factors using a generalization of the Savage-Dickey density ratio, J. Am. Stat. Assoc., 90, 614, 10.1080/01621459.1995.10476554

Vuong, 1989, Likelihood ratio tests for model selection and non-nested hypotheses, Econometrica, 57, 307, 10.2307/1912557

Wasserman, 2000, Bayesian model selection and model averaging, J. Math. Psychol., 44, 92, 10.1006/jmps.1999.1278

Weakliem, 1999, A critique of the Bayesian information criterion for model selection, Sociol. Methods Res., 27, 359, 10.1177/0049124199027003002

Whelan, 1999, Distributions of statistics used for the comparison of models of sequence evolution in phylogenetics, Mol. Biol. Evol., 16, 1292, 10.1093/oxfordjournals.molbev.a026219

Woodroofe, 1982, On the model selection and the arc sine laws, Ann. Stat., 10, 1182, 10.1214/aos/1176345983

Yang, 1996, Among-site rate variation and its impact on phylogenetic analysis, Trends Ecol. Evol., 11, 367, 10.1016/0169-5347(96)10041-0

Yang, 1996, Maximum-likelihood models for combined analyses of multiple sequence data, J. Mol. Evol., 42, 587, 10.1007/BF02352289

Yang, 1995, Maximum likelihood trees from DNA sequences: A peculiar statistical estimation problem, Syst. Biol., 44, 384, 10.1093/sysbio/44.3.384

Yang, 2000, Codon-substitution models for heterogeneous selection pressure at amino acid sites, Genetics, 155, 431, 10.1093/genetics/155.1.431

Yang, 1997, Bayesian phylogenetic inference using DNA sequences: A Markov chain Monte Carlo method, Mol. Biol. Evol., 14, 717, 10.1093/oxfordjournals.molbev.a025811

Zhang, 1999, Performance of likelihood ratio tests of evolutionary hypotheses under inadequate substitution models, Mol. Biol. Evol., 16, 868, 10.1093/oxfordjournals.molbev.a026171

Zharkikh, 1994, Estimation of evolutionary distances between nucleotide sequences, J. Mol. Evol., 39, 315, 10.1007/BF00160155

Zucchini, 2000, An introduction to model selection, J. Math. Psychol., 44, 41, 10.1006/jmps.1999.1276