Tools for simulating evolution of aligned genomic regions with integrated parameter estimation

Genome Biology - Tập 9 - Trang 1-12 - 2008
Avinash Varadarajan1, Robert K Bradley2, Ian H Holmes2,3
1Computer Science Division, University of California, Berkeley, USA
2Biophysics Graduate Group, University of California, Berkeley, USA
3Department of Bioengineering, University of California, Berkeley, USA

Tóm tắt

Controlled simulations of genome evolution are useful for benchmarking tools. However, many simulators lack extensibility and cannot measure parameters directly from data. These issues are addressed by three new open-source programs: GSIMULATOR (for neutrally evolving DNA), SIMGRAM (for generic structured features) and SIMGENOME (for syntenic genome blocks). Each offers algorithms for parameter measurement and reconstruction of ancestral sequence. All three tools out-perform the leading neutral DNA simulator (DAWG) in benchmarks. The programs are available at http://biowiki.org/SimulationTools .

Tài liệu tham khảo

Pedersen JS, Hein J: Gene finding with a hidden Markov model of genome structure and evolution. Bioinformatics. 2003, 19: 219-227. Bais AS, Grossmann S, Vingron M: Incorporating evolution of transcription factor binding sites into annotated alignments. J Biosci. 2007, 32: 841-850. Pollard DA, Bergman CM, Stoye J, Celniker SE, Eisen MB: Benchmarking tools for the alignment of functional noncoding DNA. BMC Bioinformatics. 2004, 5: 6- Evans J, Sheneman L, Foster J: Relaxed neighbor joining: a fast distance-based phylogenetic tree construction method. J Mol Evol. 2006, 62: 785-792. Rasmussen MD, Kellis M: Accurate gene-tree reconstruction by learning gene- and species-specific substitution rates across multiple complete genomes. Genome Res. 2007, 17: 1932-1942. Simmons MP, Müller K, Norton AP: The relative performance of indel-coding methods in simulations. Mol Phylogenet Evol. 2007, 44: 724-740. Albà MM, Castresana J: On homology searches by protein Blast and the characterization of the age of genes. BMC Evol Biol. 2007, 7: 53- Nuin PAS, Wang Z, Tillier ERM: The accuracy of several multiple sequence alignment programs for proteins. BMC Bioinformatics. 2006, 7: 471- Stoye J, Evers D, Meyer F: Rose: generating sequence families. Bioinformatics. 1998, 14: 157-163. Cartwright RA: DNA assembly with gaps (Dawg): simulating sequence evolution. Bioinformatics. 2005, 21 (Suppl 3): iii31-iii38. Miklós I, Lunter G, Holmes I: A "long indel" model for evolutionary sequence alignment. Mol Biol Evol. 2004, 21: 529-540. Gesell T, Washietl S: Dinucleotide controlled null models for comparative RNA gene prediction. BMC Bioinformatics. 2008, 9: 248- Pang A, Smith AD, Nuin PAS, Tillier ERM: SIMPROT: using an empirically determined indel distribution in simulations of protein evolution. BMC Bioinformatics. 2005, 6: 236- Strope CL, Scott SD, Moriyama EN: indel-Seq-Gen: a new protein family simulator incorporating domains, motifs, and indels. Mol Biol Evol. 2007, 24: 640-649. Hall BG: Simulating DNA coding sequence evolution with EvolveAGene 3. Mol Biol Evol. 2008, 25: 688-695. Beiko RG, Charlebois RL: A simulation test bed for hypotheses of genome evolution. Bioinformatics. 2007, 23: 825-831. Huang W, Nevins JR, Ohler U: Phylogenetic simulation of promoter evolution: estimation and modeling of binding site turnover events and assessment of their impact on alignment tools. Genome Biol. 2007, 8: R225- Siepel A, Haussler D: Computational identification of evolutionarily conserved exons. Proceedings of the Eighth Annual International Conference on Research in Computational Molecular Biology; San Diego: 27-31 March 2004. Edited by: Bourne P, Gusfield D. 2004, ACM, 177-186. Whelan S, de Bakker PI, Goldman N: Pandit: a database of protein and associated nucleotide domains with inferred trees. Bioinformatics. 2003, 19: 1556-1563. Drosophila 12 Genomes Consortium, Clark AG, Eisen MB, Smith DR, Bergman CM, Oliver B, Markow TA, Kaufman TC, Kellis M, Gelbart W, Iyer VN, Pollard DA, Sackton TB, Larracuente AM, Singh ND, Abad JP, Abt DN, Adryan B, Aguade M, Akashi H, Anderson WW, Aquadro CF, Ardell DH, Arguello R, Artieri CG, Barbash DA, Barker D, Barsanti P, Batterham P, Batzoglou S, Begun D, et al: Evolution of genes and genomes on the Drosophila phylogeny. Nature. 2007, 450: 203-218. Bradley RK, Holmes I: Transducers: an emerging probabilistic framework for modeling indels on trees. Bioinformatics. 2007, 23: 3258-3262. Klosterman PS, Uzilov AV, Bendaña YR, Bradley RK, Chao S, Kosiol C, Goldman N, Holmes I: XRate: a fast prototyping, training and annotation tool for phylo-grammars. BMC Bioinformatics. 2006, 7: 428- Sankoff D, Blanchette M: Multiple genome rearrangement and breakpoint phylogeny. J Comput Biol. 1998, 5: 555-570. Korkin D, Goldfarb L: Multiple genome rearrangement: a general approach via the evolutionary genome graph. Bioinformatics. 2002, 18 (Suppl 1): S303-S311. Brudno M, Malde S, Poliakov A, Do CB, Couronne O, Dubchak I, Batzoglou S: Glocal alignment: finding rearrangements during alignment. Bioinformatics. 2003, 19 (Suppl 1): i54-i62. Darling ACE, Mau B, Blattner FR, Perna NT: Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res. 2004, 14: 1394-1403. Sankoff D, Trinh P: Chromosomal breakpoint reuse in genome sequence rearrangement. J Comput Biol. 2005, 12: 812-821. Ma J, Zhang L, Suh BB, Raney BJ, Brian J, Burhans RC, Kent WJ, Blanchette M, Haussler D, Miller W: Reconstructing contiguous regions of an ancestral genome. Genome Res. 2006, 16: 1557-1565. Vinh le S, Varón A, Wheeler WC: Pairwise alignment with rearrangements. Genome Inform. 2006, 17: 141-151. Bhutkar A, Russo S, Smith TF, Gelbart WM: Techniques for multi-genome synteny analysis to overcome assembly limitations. Genome Inform. 2006, 17: 152-161. Holmes I: Phylocomposer and phylodirector: analysis and visualization of transducer indel models. Bioinformatics. 2007, 23: 3263-3264. Hein J, Schierup M, Wiuf C: Gene Genealogies, Variation and Evolution: A Primer in Coalescent Theory. 2005, Oxford, UK: Oxford University Press Etheridge A: An Introduction to Superprocesses. 2000, Providence, RI: American Mathematical Society Arenas M, Posada D: Recodon: coalescent simulation of coding DNA sequences with recombination, migration and demography. BMC Bioinformatics. 2007, 8: 458- Hoggart CJ, Chadeau-Hyam M, Clark TG, Lampariello R, Whittaker JC, De Iorio M, Balding DJ: Sequence-level population simulations over large genomic regions. Genetics. 2007, 177: 1725-1731. Antao T, Beja-Pereira A, Luikart G: MODELER4SIMCOAL2: a user-friendly, extensible modeler of demography and linked loci for coalescent simulations. Bioinformatics. 2007, 23: 1848-1850. BioWiki: Simulation tools. [http://biowiki.org/SimulationTools] Knudsen B, Hein J: RNA secondary structure prediction using stochastic context-free grammars and evolutionary history. Bioinformatics. 1999, 15: 446-454. Pedersen JS, Bejerano G, Siepel A, Rosenbloom K, Lindblad-Toh K, Lander ES, Kent J, Miller W, Haussler D: Identification and classification of conserved RNA secondary structures in the human genome. PLoS Comput Biol. 2006, 2: e33- Moses AM, Chiang DY, Pollard DA, Iyer VN, Eisen MB: MONKEY: identifying conserved transcription-factor binding sites in multiple alignments using a binding site-specific evolutionary model. Genome Biol. 2004, 5: R98- Bruno WJ: Modelling residue usage in aligned protein sequences via maximum likelihood. Mol Biol Evol. 1996, 13: 1368-1374. Coin L, Durbin R: Improved techniques for the identification of pseudogenes. Bioinformatics. 2004, 20 (Suppl 1): i94-i100. Thorne JL, Goldman N, Jones DT: Combining protein evolution and secondary structure. Mol Biol Evol. 1996, 13: 666-673. Siepel A, Haussler D: Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Mol Biol Evol. 2004, 21: 468-488. BioWiki: XrateFormat. [http://biowiki.org/XrateFormat] Kosiol C, Holmes I, Goldman N: An empirical codon model for protein sequence evolution. Mol Biol Evol. 2007, 24: 1464-1479. Dowell RD, Eddy SR: Efficient pairwise RNA structure prediction and alignment using sequence alignment constraints. BMC Bioinformatics. 2006, 7: 400- De Rijk P, Caers A, Van de Peer Y, De Wachter R: Database on the structure of large ribosomal subunit RNA. Nucleic Acids Res. 1998, 26: 183-6. Stark A, Lin MF, Kheradpour P, Pedersen JS, Parts L, Carlson JW, Crosby MA, Rasmussen MD, Roy S, Deoras AN, Ruby JG, Brennecke J, Curators HF, Project BDG, Hodges E, Hinrichs AS, Caspi A, Paten B, Park S, Han MV, Maeder ML, Polansky BJ, Robson BE, Aerts S, van Helden J, Hassan B, Gilbert DG, Eastman DA, Rice M, Weir M, et al: Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures. Nature. 2007, 450: 219-232. Paten B, Herrero J, Beal K, Fitzgerald S, Birney E: Enredo and Pecan: genome-wide mammalian consistency-based multiple alignment with paralogs. Genome Res. 2008, doi:10.1101/gr.076554.108. Kaminker JS, Bergman CM, Kronmiller B, Carlson J, Svirskas R, Patel S, Frise E, Wheeler DA, Lewis SE, Rubin GM, Ashburner M, Celniker SE: The transposable elements of the Drosophila melanogaster euchromatin: a genomics perspective. Genome Biol. 2002, 3: research0084.1-0084.20. Wilson RJ, Goodman JL, Strelets VB, FlyBase Consortium: FlyBase: integration and improvements to query tools. Nucleic Acids Res. 2008, 36 (Database issue): D588-D593. Franz G, Savakis C: Minos, a new transposable element from Drosophila hydei, is a member of the Tc1-like family of transposons. Nucleic Acids Res. 1991, 19: 6646- Metaxakis A, Oehler S, Klinakis A, Savakis C: Minos as a genetic and genomic tool in Drosophila melanogaster. Genetics. 2005, 171: 571-581. Pollard KS, Salama SR, Lambert N, Lambot M, Coppens S, Pedersen JS, Katzman S, King B, Onodera C, Siepel A, Kern AD, Dehay C, Igel H, Ares M, Vanderhaeghen P, Haussler D: An RNA gene expressed during cortical development evolved rapidly in humans. Nature. 2006, 443: 167-172. Rose D, Hackermüller J, Washietl S, Reiche K, Hertel J, Findeiss S, Stadler PF, Prohaska SJ: Computational RNomics of drosophilids. BMC Genomics. 2007, 8: 406- Pheasant M, Mattick JS: Raising the estimate of functional human sequences. Genome Res. 2007, 17: 1245-1253. Babak T, Blencowe BJ, Hughes TR: Considerations in the identification of functional RNA structural elements in genomic alignments. BMC Bioinformatics. 2007, 8: 33- Knudsen B, Hein J: Pfold: RNA secondary structure prediction using stochastic context-free grammars. Nucleic Acids Res. 2003, 31: 3423-3428. Waddell PJ, Steel MA: General time-reversible distances with unequal rates across sites: mixing gamma and inverse Gaussian distributions with invariant sites. Mol Phylogenet Evol. 1997, 8: 398-414. Holmes I: Studies in probabilistic sequence alignment and evolution. PhD thesis. 1998, Department of Genetics, University of Cambridge; The Wellcome Trust Sanger Institute, [http://biowiki.org/PaperArchive] Arvestad L, Berglund AC, Lagergren J, Sennblad B: Bayesian Gene/Species Tree Reconciliation and Orthology Analysis Using MCMC. Bioinformatics. 2003, 19 (Suppl 1): i7-i15. Hahn MW, De Bie T, Stajich JE, Nguyen C, Cristianini N: Estimating the tempo and mode of gene family evolution from comparative genomic data. Genome Res. 2005, 15: 1153-1160. Gillespie DT: Exact stochastic simulation of coupled chemical reactions. J Phys Chem. 1977, 81: 2340-2361. Lunter G, Hein J: A nucleotide substitution model with nearest-neighbour interactions. Bioinformatics. 2004, 20 (Suppl 1): i216-i223. Durbin R, Eddy S, Krogh A, Mitchison G: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. 1998, Cambridge, UK: Cambridge University Press Hein J: An algorithm for statistical alignment of sequences related by a binary tree. Pacific Symposium on Biocomputing. Edited by: Altman RB, Dunker AK, Hunter L, Lauderdale K, Klein TE. 2001, Singapore: World Scientific, 179-190. Holmes I, Bruno WJ: Evolutionary HMMs: a Bayesian approach to multiple alignment. Bioinformatics. 2001, 17: 803-820. Holmes I: Using guide trees to construct multiple-sequence evolutionary HMMs. Bioinformatics. 2003, 19 (Suppl 1): i147-i157. Jukes TH, Cantor C: Evolution of protein molecules. Mammalian Protein Metabolism. Edited by: Munro HN. 1969, New York: Academic Press, 21-132. Thorne JL, Kishino H, Felsenstein J: An evolutionary model for maximum likelihood alignment of DNA sequences. J Mol Evol. 1991, 33: 114-124. BioWiki: Stockholm Format. [http://biowiki.org/StockholmFormat] Gilks W, Richardson S, Spiegelhalter D: Markov Chain Monte Carlo in Practice. 1996, London, UK: Chapman & Hall