EM for phylogenetic topology reconstruction on nonhomogeneous data

Esther Ibáñez-Marcelo1, Marta Casanellas2
1Centre de Recerca Matemàtica, Bellaterra, Barcelona, Spain
2Departament Matemàtica Aplicada I, Universitat Politècnica de Catalunya, Barcelona, Spain

Tóm tắt

The reconstruction of the phylogenetic tree topology of four taxa is, still nowadays, one of the main challenges in phylogenetics. Its difficulties lie in considering not too restrictive evolutionary models, and correctly dealing with the long-branch attraction problem. The correct reconstruction of 4-taxon trees is crucial for making quartet-based methods work and being able to recover large phylogenies. We adapt the well known expectation-maximization algorithm to evolutionary Markov models on phylogenetic 4-taxon trees. We then use this algorithm to estimate the substitution parameters, compute the corresponding likelihood, and to infer the most likely quartet. In this paper we consider an expectation-maximization method for maximizing the likelihood of (time nonhomogeneous) evolutionary Markov models on trees. We study its success on reconstructing 4-taxon topologies and its performance as input method in quartet-based phylogenetic reconstruction methods such as QFIT and QuartetSuite. Our results show that the method proposed here outperforms neighbor-joining and the usual (time-homogeneous continuous-time) maximum likelihood methods on 4-leaved trees with among-lineage instantaneous rate heterogeneity, and perform similarly to usual continuous-time maximum-likelihood when data satisfies the assumptions of both methods. The method presented in this paper is well suited for reconstructing the topology of any number of taxa via quartet-based methods and is highly accurate, specially regarding largely divergent trees and time nonhomogeneous data.

Từ khóa


Tài liệu tham khảo

Kelchner SA, Thomas MA: Model use in phylogenetics: nine key questions. Trends Ecol Evol. 2007, 22 (2): 87-94. 10.1016/j.tree.2006.10.004.

Ripplinger J, Sullivan J: Assessment of substitution model adequacy using frequentist and Bayesian methods. Mol Biol Evol. 2010, 27 (12): 2790-2803. 10.1093/molbev/msq168.

Jermiin LS, Ho SY, Ababneh F, Robinson J, Larkum AW: The biasing effect of compositional heterogeneity on phylogenetic estimates may be underestimated. Syst Biol. 2004, 53 (4): 638-643. 10.1080/10635150490468648.

Galtier N, Gouy M: Inferring pattern and process: maximum likelihood implementation of a non-homogeneous model of DNA sequence evolution for phylogenetic analysis. Mol Biol Evol. 1998, 154 (4): 871-879.

Yang Z, Yoder AD: Estimation of the transition/transversion rate bias and species sampling. J Mol Evol. 1999, 48: 274-283. 10.1007/PL00006470.

Ranwez V, Gascuel O: Quartet-based phylogenetic inference: improvements and limits. Mol Biol Evol. 2001, 18 (6): 1103-1116. 10.1093/oxfordjournals.molbev.a003881.

Anderson FE, Swofford DL: Should we be worried about long-branch attraction in real data sets? Investigations using metazoan 18S rDNA. Mol Phylogenet Evol. 2004, 33 (2): 440-451. 10.1016/j.ympev.2004.06.015.

Semple C, Steel M: Phylogenetics, Volume 24 of Oxford Lecture Series in Mathematics and its Applications. 2003, Oxford: Oxford University Press

Jayaswal V, Jermiin LS, Robinson J: Estimation of phylogeny using a general Markov model. Evolutionary Bioinformatics Online. 2005, 1: 62-

Allman ES, Rhodes JA: Phylogenetic invariants. Reconstructing Evolution. Edited by Gascuel O, Steel M. 2007, New York: Oxford University Press

Barry D, Hartigan JA: Statistical analysis of hominoid molecular evolution. Stat Sci. 1987, 2 (2): 191-207. 10.1214/ss/1177013353.

Evans S, Speed T: Invariants of some probability models used in phylogenetic inference. Ann Statist. 1993, 21: 355-377. 10.1214/aos/1176349030.

Kimura M: Estimation of evolutionary distances between homologous nucleotide sequences. Proc Natl Acad Sci USA. 1981, 78: 454-458. 10.1073/pnas.78.1.454.

Dempster A, Laird N, Rubin D: Maximum likelihood estimation from incomplete data via the EM algorithm. J Roy Stat Soc. 1977, 39: 1-38.

McLachlan G, Krishnan T: The EM Algorithm and Extensions, Volume 382. 2007, New York: Wiley-Interscience

Kedzierska AM, Casanellas M: EMpar: EM-based algorithm for parameter estimation of Markov models on trees. [http://arxiv.org/abs/1207.1236],

Huelsenbeck J: Performance of phylogenetic methods in simulation. Syst Biol. 1995, 44: 17-48. 10.1093/sysbio/44.1.17.

Ho SY, Jermiin LS: Tracing the decay of the historical signal in biological sequence data. Syst Biol. 2004, 53 (4): 623-637. 10.1080/10635150490503035.

Willson SJ: Building phylogenetic trees from quartets by using local inconsistency measures. Mol Biol Evol. 1999, 16: 685-693. 10.1093/oxfordjournals.molbev.a026151.

Department of Computer Science, Iowa State University: QuartetSuite by Raul Piaggio. [http://genome.cs.iastate.edu/CBL/download/],

Creevey CJ, McInerney JO: Clann: investigating phylogenetic information through supertree analyses. Bioinformatics. 2005, 21: 390-392. 10.1093/bioinformatics/bti020.

Dutheil J, Boussau B: Non-homogeneous models of sequence evolution in the Bio++ suite of libraries and programs. BMC Evol Biol. 2008, 8: 255-10.1186/1471-2148-8-255.

Strimmer K, Goldman N, von Haeseler A: Bayesian probabilities and quartet puzzling. Mol Biol Evol. 1997, 14 (2): 210-10.1093/oxfordjournals.molbev.a025756.

Felsenstein J: Inferring Phylogenies. 2004, Sunderland: Sinauer Associates

Szekely LA, Steel MA, Erdos P: Fourier calculus on evolutionary trees. Adv Appl Math. 1993, 14 (2): 200-216. 10.1006/aama.1993.1011.

Yang Z: PAML: a program package for phylogenetic analysis by maximum likelihood. CABIOS. 1997, 15: 555-556. [http://abacus.gene.ucl.ac.uk/software/paml.html],

Kedzierska AM, Casanellas M: GenNon-h: generating multiple sequence alignments on nonhomogeneous phylogenetic trees. BMC Bioinformatics. 2012, 13: 216-10.1186/1471-2105-13-216.

Rambaut A, Grassly N: Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput Appl Biosci. 1997, 13: 235-238.

Jermiin LS, Ho SY, Ababneh F, Robinson J, Larkum AW: Hetero: a program to simulate the evolution of DNA on a four-taxon tree. Appl Bioinformatics. 2003, 2 (3): 159-163.

Sukumaran J, Holder MT: DendroPy: a Python library for phylogenetic computing. Bioinformatics. 2010, 26: 1569-1571. 10.1093/bioinformatics/btq228.