Lựa chọn tự động các sơ đồ phân vùng cho các phân tích phát sinh chủng loại sử dụng phân cụm k-means tuần tự dựa trên tỷ lệ của các vị trí

Springer Science and Business Media LLC - Tập 15 - Trang 1-17 - 2015
Paul B Frandsen1,2, Brett Calcott3, Christoph Mayer4, Robert Lanfear5,6,7
1Office of Research Information Services, Office of the CIO, Smithsonian Institution, Washington, USA
2Department of Entomology, Rutgers University, New Brunswick, USA
3School of Life Sciences, Arizona State University, Tempe, USA
4Zoologisches Forschungsmuseum Alexander Koenig (ZFMK), Zentrum für Molekulare Biodiversitätsforschung (zmb), Bonn, Germany
5Ecology, Evolution and Genetics Research School of Biology, Australian National University, Canberra, Australia
6National Evolutionary Synthesis Center, Durham, USA
7Department of Biological Sciences, Macquarie University, Sydney, Australia

Tóm tắt

Việc chọn lựa mô hình là một phần quan trọng của hầu hết các phân tích phát sinh chủng loại, và việc tính đến sự không đồng nhất trong các mô hình tiến hóa giữa các vị trí là điều rất quan trọng. Các mô hình hỗn hợp và phân vùng thường được sử dụng để tính đến sự biến thiên này, và phân vùng là phương pháp phổ biến nhất. Hầu hết các phương pháp phân vùng hiện tại yêu cầu một sơ đồ phân vùng a priori nào đó được xác định, thường được hướng dẫn bởi các đặc điểm cấu trúc đã biết của các chuỗi, chẳng hạn như ranh giới gen hoặc vị trí codon. Bằng chứng gần đây cho thấy rằng các ranh giới a priori này thường không đủ để tính đến sự biến thiên trong các tỷ lệ và mô hình tiến hóa giữa các vị trí. Hơn nữa, các tập dữ liệu phát sinh gen mới như những tập hợp từ các yếu tố bảo tồn siêu thiếu rõ ràng các đặc điểm cấu trúc để xác định các sơ đồ phân vùng a priori. Kết quả là, đối với nhiều tập dữ liệu phát sinh chủng loại, các mô hình phân vùng của tiến hóa phân tử có thể không đủ, do đó hạn chế độ chính xác của các phân tích phát sinh chủng loại phía sau. Chúng tôi trình bày một thuật toán mới tự động lựa chọn một sơ đồ phân vùng thông qua việc chia nhỏ phương pháp căn chỉnh thành các tập hợp các vị trí tương tự dựa trên tỷ lệ tiến hóa của chúng. Chúng tôi so sánh phương pháp này với các phương pháp hiện có sử dụng một loạt các tập dữ liệu thực nghiệm, và cho thấy rằng nó dẫn đến sự gia tăng đáng kể trong sự khớp của các mô hình phân vùng tiến hóa phân tử khi được đo bằng điểm số AICc và BIC. Trong quá trình này, chúng tôi chứng minh rằng một số phương pháp liên quan đến việc giải quyết vấn đề này có thể đã liên kết với một thiên lệch nhỏ nhưng quan trọng. Phương pháp của chúng tôi cung cấp một lựa chọn thay thế cho các phương pháp truyền thống về phân vùng, chẳng hạn như chia các căn chỉnh theo gen và vị trí codon. Bởi vì phương pháp của chúng tôi được dựa trên dữ liệu, nó có thể được sử dụng để ước lượng các mô hình phân vùng cho tất cả các loại căn chỉnh, bao gồm cả những căn chỉnh không thể áp dụng cho các phương pháp phân vùng truyền thống.

Từ khóa


Tài liệu tham khảo

Sullivan J, Joyce P. Model selection in phylogenetics. Annu Rev Ecol Evol Syst. 2005;36:445–66. Steel M. Should phylogenetic models be trying to “fit an elephant”? Trends Genet. 2005;21:307–9. Phillips MJ, Delsuc F, Penny D. Genome-scale phylogeny and the detection of systematic biases. Mol Biol Evol. 2004;21:1455–8. Felsenstein J. Inferring phylogenies, Sunderland. Sinauer Associates: Mass; 2004. Yang Z, Rannala B. Molecular phylogenetics: principles and practice. Nat Rev Genet. 2012;13:303–14. Felsenstein J. Cases in which parsimony or compatibility methods will be positively misleading. Syst Zool. 1978;27:401–10. Jeffroy O, Brinkmann H, Delsuc F, Philippe H. Phylogenomics: the beginning of incongruence? Trends Genet. 2006;22:225–31. Nishihara H, Okada N, Hasegawa M. Rooting the eutherian tree: the power and pitfalls of phylogenomics. Genome Biol. 2007;8:R199. Rodríguez-Ezpeleta N, Brinkmann H, Roure B, Lartillot N, Lang BF, Philippe H. Detecting and overcoming systematic errors in genome-scale phylogenies. Syst Biol. 2007;56:389–99. Kumar S, Filipski AJ, Battistuzzi FU, Pond SLK, Tamura K. Statistics and truth in phylogenomics. Mol Biol Evol. 2012;29:457–72. Yang Z. Among-site rate variation and its impact on phylogenetic analyses. Trends Ecol Evol. 1996;11:367–72. Buckley TR, Simon C, Chambers GK. Exploring among-site rate variation models in a maximum likelihood framework using empirical data: effects of model assumptions on estimates of topology, branch lengths, and bootstrap support. Syst Biol. 2001;50:67–86. Lemmon AR, Moriarty EC. The importance of proper model assumption in Bayesian phylogenetics. Syst Biol. 2004;53:265–77. Revell LJ, Harmon LJ, Glor RE. Under-parameterized model of sequence evolution leads to bias in the estimation of diversification rates from molecular phylogenies. Syst Biol. 2005;54:973–83. Bull JJ, Huelsenbeck JP, Cunningham CW, Swofford DL, Waddell PJ. Partitioning and combining data in phylogenetic analysis. Syst Biol. 1993;42:384–97. Pagel M, Meade A. A phylogenetic mixture model for detecting pattern-heterogeneity in gene sequence or character-state data. Syst Biol. 2004;53:571–81. Le SQ, Lartillot N, Gascuel O. Phylogenetic mixture models for proteins. Philos Trans R Soc B Biol Sci. 2008;363:3965–76. Lartillot N, Lepage T, Blanquart S. PhyloBayes 3: a Bayesian software package for phylogenetic reconstruction and molecular dating. Bioinforma Oxf Engl. 2009;25:2286–8. Nylander JAA, Ronquist F, Huelsenbeck JP, Nieves-Aldrey J. Bayesian phylogenetic analysis of combined data. Syst Biol. 2004;53:47–67. Brandley MC, Schmitz A, Reeder TW. Partitioned Bayesian analyses, partition choice, and the phylogenetic relationships of scincid lizards. Syst Biol. 2005;54:373–90. Brown JM, Lemmon AR. The importance of data partitioning and the utility of bayes factors in Bayesian phylogenetics. Syst Biol. 2007;56:643–55. Kjer KM, Honeycutt RL. Site specific rates of mitochondrial genomes and the phylogeny of eutheria. BMC Evol Biol. 2007;7:8. Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 2014;30:1312–3. Guindon S, Dufayard J-F, Lefort V, Anisimova M, Hordijk W, Gascuel O. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst Biol. 2010;59:307–21. Zwickl DJ. Genetic algorithm approaches for the phylogenetic analysis of large biological sequence datasets under the maximum likelihood criterion. 2006. Ronquist F, Teslenko M, van der Mark P, Ayres DL, Darling A, Höhna S, et al. MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space. Syst Biol. 2012;61:539–42. Drummond AJ, Suchard MA, Xie D, Rambaut A. Bayesian phylogenetics with BEAUti and the BEAST 1.7. Mol Biol Evol. 2012;29:1969–73. Blair C, Murphy RW. Recent trends in molecular phylogenetic analysis: where to next? J Hered. 2011;102:130–8. Lanfear R, Calcott B, Kainer D, Mayer C, Stamatakis A. Selecting optimal partitioning schemes for phylogenomic datasets. BMC Evol Biol. 2014;14:82. Hurvich CM, Tsai C-L. Regression and time series model selection in small samples. Biometrika. 1989;76:297–307. Schwarz G. Estimating the dimension of a model. Ann Stat. 1978;6:461–4. Li C, Lu G, Ortí G. Optimal data partitioning and a test case for Ray-finned fishes (actinopterygii) based on Ten nuclear loci. Syst Biol. 2008;57:519–39. Lanfear R, Calcott B, Ho SYW, Guindon S. PartitionFinder: combined selection of partitioning schemes and substitution models for phylogenetic analyses. Mol Biol Evol. 2012;29:1695–701. Hebert PDN, Cywinska A, Ball SL, deWaard JR. Biological identifications through DNA barcodes. Proc R Soc B Biol Sci. 2003;270:313–21. Wu C-H, Suchard MA, Drummond AJ. Bayesian selection of nucleotide substitution models and their site assignments. Mol Biol Evol. 2013;30:669–88. Leavitt JR, Hiatt KD, Whiting MF, Song H. Searching for the optimal data partitioning strategy in mitochondrial phylogenomics: a phylogeny of acridoidea (Insecta: orthoptera: caelifera) as a case study. Mol Phylogenet Evol. 2013;67:494–508. Best RJ, Stachowicz JJ. Phylogeny as a proxy for ecology in seagrass amphipods: which traits are most conserved? PLoS One. 2013;8:e57550. Springer MS, Amrine HM, Burk A, Stanhope MJ. Additional support for afrotheria and paenungulata, the performance of mitochondrial versus nuclear genes, and the impact of data partitions with heterogeneous base composition. Syst Biol. 1999;48:65–75. Biffin MGH E. Structural partitioning, paired-sites models and evolution of the ITS transcript in Syzygium and myrtaceae. Mol Phylogenet Evol. 2007;43:124–39. Bofkin L, Goldman N. Variation in evolutionary processes at different codon positions. Mol Biol Evol. 2007;24:513–21. Liò P, Goldman N. Models of molecular evolution and phylogeny. Genome Res. 1998;8:1233–44. Hu G, Shen S, Wang K. On the evolution rate in mammalian mitochondrial genomes. Comput Biol Chem. 2011;35:137–42. Huelsenbeck JP, Crandall KA. Phylogeny estimation and hypothesis testing using maximum likelihood. Annu Rev Ecol Syst. 1997;28:437–66. Stergachis AB, Haugen E, Shafer A, Fu W, Vernot B, Reynolds A, et al. Exonic transcription factor binding directs codon choice and affects protein evolution. Science. 2013;342:1367–72. Simon C, Frati F, Beckenbach A, Crespi B, Liu H, Flook P. Evolution, weighting, and phylogenetic utility of mitochondrial gene sequences and a compilation of conserved polymerase chain reaction primers. Ann Entomol Soc Am. 1994;87:651–701. Simon C, Buckley TR, Frati F, Stewart JB, Beckenbach AT. Incorporating molecular evolution into phylogenetic analysis, and a new compilation of conserved polymerase chain reaction primers for animal mitochondrial DNA. Annu Rev Ecol Evol Syst. 2006;37:547–79. Yang Z. Maximum-likelihood models for combined analyses of multiple sequence data. J Mol Evol. 1996;42:587–96. Lartillot N, Philippe H. A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. Mol Biol Evol. 2004;21:1095–109. Simon C, Nigro L, Sullivan J, Holsinger K, Martin A, Grapputo A, et al. Large differences in substitutional pattern and evolutionary rate of 12S ribosomal RNA genes. Mol Biol Evol. 1996;13:923–32. Letsch HO, Kjer KM. Potential pitfalls of modelling ribosomal RNA data in phylogenetic tree reconstruction: evidence from case studies in the metazoa. BMC Evol Biol. 2011;11:146. Faircloth BC, McCormack JE, Crawford NG, Harvey MG, Brumfield RT, Glenn TC. Ultraconserved elements anchor thousands of genetic markers spanning multiple evolutionary timescales. Syst Biol. 2012;61:717–26. Lemmon AR, Emme SA, Lemmon EM. Anchored hybrid enrichment for massively high-throughput phylogenomics. Syst Biol. 2012;61:727–44. McCormack JE, Faircloth BC, Crawford NG, Gowaty PA, Brumfield RT, Glenn TC. Ultraconserved elements are novel phylogenomic markers that resolve placental mammal phylogeny when combined with species-tree analysis. Genome Res. 2012;22:746–54. Crawford NG, Faircloth BC, McCormack JE, Brumfield RT, Winker K, Glenn TC. More than 1000 ultraconserved elements provide evidence that turtles are the sister group of archosaurs. Biol Lett. 2012;8:783–6. Kjer KM, Blahnik RJ, Holzenthal RW. Phylogeny of trichoptera (caddisflies): characterization of signal and noise within multiple datasets. Syst Biol. 2001;50:781–816. Ellingson RA, Swift CC, Findley LT, Jacobs DK: Convergent evolution of ecomorphological adaptations in geographically isolated Bay gobies (Teleostei: Gobionellidae) of the temperate North Pacific. Mol Phylogenet Evol 2013. Cummins CA, McInerney JO. A method for inferring the rate of evolution of homologous characters that Can potentially improve phylogenetic inference, resolve deep divergence and correct systematic biases. Syst Biol. 2011;60:833–44. Misof B, Liu S, Meusemann K, Peters RS, Donath A, Mayer C, et al. Phylogenomics resolves the timing and pattern of insect evolution. Science. 2014;346:763–7. Abdo Z, Minin VN, Joyce P, Sullivan J. Accounting for uncertainty in the tree topology Has little effect on the decision-theoretic approach to model selection in phylogeny estimation. Mol Biol Evol. 2005;22:691–703. Posada D, Crandall KA. Selecting the best-fit model of nucleotide substitution. Syst Biol. 2001;50:580–601. Minin V, Abdo Z, Joyce P, Sullivan J. Performance-based selection of likelihood models for phylogeny estimation. Syst Biol. 2003;52:674–83. Paul Frandsen, Christoph Mayer: fast_TIGER. http://dx.doi.org/10.5281/zenodo.12914 MacQueen J. Some methods for classification and analysis of multivariate observations, The Regents of the University of California. 1967. Lloyd S. Least squares quantization in PCM. IEEE Trans Inf Theory. 1982;28:129–37. Ostrovsky R, Rabani Y, Schulman LJ, Swamy C. The effectiveness of Lloyd-type methods for the k-means problem. In: In 47th Annu IEEE Symp Found Comput Sci 2006 FOCS 06. 2006. p. 165–76. Arthur D, Vassilvitskii S: k-means++: The Advantages of Careful Seeding. . Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in python. arxiv e-print. 2012. McCormack JE, Harvey MG, Faircloth BC, Crawford NG, Glenn TC, Brumfield RT. A phylogeny of birds based on over 1,500 loci collected by target enrichment and high-throughput sequencing. PLoS One. 2013;8:e54848. Lanfear R, Calcott B, Kainer D, Mayer C, Stamatakis A: Selecting optimal partitioning schemes for phylogenomic datasets. BMC Bioinformatics 2014, in press. Stamatakis A. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics. 2006;22:2688–90. Robinson DF, Foulds LR. Comparison of phylogenetic trees. Math Biosci. 1981;53:131–47. Fletcher W, Yang Z. INDELible: a flexible simulator of biological sequence evolution. Mol Biol Evol. 2009;26:1879–88. Jukes T, Cantor C. Evolution of protein molecules. In: Munro H, editor. Mamm Protein Metab. Academy Press. 1969. Kawahara AY, Rubinoff D. Convergent evolution of morphology and habitat use in the explosive Hawaiian fancy case caterpillar radiation. J Evol Biol. 2013;26:1763–73. Shapiro B, Rambaut A, Drummond AJ. Choosing appropriate substitution models for the phylogenetic analysis of protein-coding sequences. Mol Biol Evol. 2006;23:7–9. Soubrier J, Steel M, Lee MSY, Sarkissian CD, Guindon S, Ho SYW, et al. The influence of rate heterogeneity among sites on the time dependence of molecular rates. Mol Biol Evol. 2012;29:3345–58. Galtier N, Enard D, Radondy Y, Bazin E, Belkhir K. Mutation hot spots in mammalian mitochondrial DNA. Genome Res. 2006;16:215–22. Lartillot N, Philippe H. Computing Bayes factors using thermodynamic integration. Syst Biol. 2006;55:195–207. Lartillot N, Brinkmann H, Philippe H. Suppression of long-branch attraction artefacts in the animal phylogeny using a site-heterogeneous model. BMC Evol Biol. 2007;7 Suppl 1:S4. Quang LS, Gascuel O, Lartillot N. Empirical profile mixture models for phylogenetic reconstruction. Bioinformatics. 2008;24:2317–23. Bouckaert R, Heled J, Kühnert D, Vaughan T, Wu C-H, Xie D, et al. BEAST 2: a software platform for Bayesian evolutionary analysis. PLoS Comput Biol. 2014;10:e1003537. McCormack JE, Harvey MG, Faircloth BC, Crawford NG, Glenn TC, Brumfield RT: Data from: A phylogeny of birds based on over 1,500 loci collected by target enrichment and high-throughput sequencing. Dryad Digit Repos 2013.http://dx.doi.org/10.5061/dryad.sd080 Yang Z. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J Mol Evol. 1994;39:306–14. Anderson FE, Bergman A, Cheng SH, Pankey MS, Valinassab T. Lights out: the evolution of bacterial bioluminescence in loliginidae. Hydrobiologia. 2014;725:189–203. Anderson FE, Bergman A, Cheng SH, Pankey MS, Valinassab T, Anderson FE: Data from: Lights out: the evolution of bacterial bioluminescence in Loliginidae. Dryad Digit Repos 2013. http://dx.doi.org/10.5061/dryad.93s3n Cognato AI, Vogler AP. Exploring data interaction and nucleotide alignment in a multiple gene analysis of Ips (coleoptera: scolytinae). Syst Biol. 2001;50:758–80. Cognato AI, Vogler AP: Data from: Exploring data interaction and nucleotide alignment in a multiple gene analysis of Ips (Coleoptera: Scolytinae). Dryad Digit Repos 2001. http://dx.doi.org/10.5061/dryad.678 Grande WCB T. Limits and relationships of the paracanthopterygii. A molecular framework for evaluating past morphological hypotheses. Mesoz Fishes. 2013;5:385–418. Grande T, Borden WC, Smith WL: Data from: Limits and relationships of Paracanthopterygii: a molecular framework for evaluating past morphological hypotheses. Dryad Digit Repos 2013. http://dx.doi.org/10.5061/dryad.k4m8t Kang JH, Schartl M, Walter RB, Meyer A. Comprehensive phylogenetic analysis of all species of swordtails and platies (Pisces: Genus Xiphophorus) uncovers a hybrid origin of a swordtail fish, Xiphophorus monticolus, and demonstrates that the sexually selected sword originated in the ancestral lineage of the genus, but was lost again secondarily. BMC Evol Biol. 2013;13:25. Kawahara AY, Rubinoff D: Data from: Convergent evolution in the explosive Hawaiian Fancy Cased caterpillar radiation. Dryad Digit Repos 2013. http://dx.doi.org/10.5061/dryad.gh895 Oaks JR. A time-calibrated species tree of crocodylia reveals a recent radiation of the true crocodiles. Evolution. 2011;65:3285–97. Oaks JR: Data from: A time-calibrated species tree of Crocodylia reveals a recent radiation of the true crocodiles. Dryad Digit Repos 2011. http://dx.doi.org/10.5061/dryad.5k9s0 Sharanowski BJ, Dowling APG, Sharkey MJ. Molecular phylogenetics of braconidae (hymenoptera: ichneumonoidea), based on multiple nuclear genes, and implications for classification. Syst Entomol. 2011;36:549–72. Sharanowski BJ, Dowling APG, Sharkey MJ: Data from: Molecular phylogenetics of Braconidae (Hymenoptera: Ichneumonoidea) based on multiple nuclear genes and implications for classification. Dryad Digit Repos 2011. http://dx.doi.org/10.5061/dryad.1688p