Assessing the evolutionary rate of positional orthologous genes in prokaryotes using synteny data

Springer Science and Business Media LLC - Tập 7 - Trang 1-18 - 2007
Frédéric Lemoine1,2, Olivier Lespinet1, Bernard Labedan1
1Institut de Génétique et Microbiologie, CNRS UMR 8621, Orsay Cedex, France
2Laboratoire de Recherche en Informatique, CNRS UMR 8623, Orsay Cedex, France

Tóm tắt

Comparison of completely sequenced microbial genomes has revealed how fluid these genomes are. Detecting synteny blocks requires reliable methods to determining the orthologs among the whole set of homologs detected by exhaustive comparisons between each pair of completely sequenced genomes. This is a complex and difficult problem in the field of comparative genomics but will help to better understand the way prokaryotic genomes are evolving. We have developed a suite of programs that automate three essential steps to study conservation of gene order, and validated them with a set of 107 bacteria and archaea that cover the majority of the prokaryotic taxonomic space. We identified the whole set of shared homologs between two or more species and computed the evolutionary distance separating each pair of homologs. We applied two strategies to extract from the set of homologs a collection of valid orthologs shared by at least two genomes. The first computes the Reciprocal Smallest Distance (RSD) using the PAM distances separating pairs of homologs. The second method groups homologs in families and reconstructs each family's evolutionary tree, distinguishing bona fide orthologs as well as paralogs created after the last speciation event. Although the phylogenetic tree method often succeeds where RSD fails, the reverse could occasionally be true. Accordingly, we used the data obtained with either methods or their intersection to number the orthologs that are adjacent in for each pair of genomes, the Positional Orthologous Genes (POGs), and to further study their properties. Once all these synteny blocks have been detected, we showed that POGs are subject to more evolutionary constraints than orthologs outside synteny groups, whichever the taxonomic distance separating the compared organisms. The suite of programs described in this paper allows a reliable detection of orthologs and is useful for evaluating gene order conservation in prokaryotes whichever their taxonomic distance. Thus, our approach will make easy the rapid identification of POGS in the next few years as we are expecting to be inundated with thousands of completely sequenced microbial genomes.

Tài liệu tham khảo

Rocha EP: Order and disorder in bacterial genomes. Curr Opin Microbiol. 2004, 7: 519-527. 10.1016/j.mib.2004.08.006. Mushegian AR, Koonin EV: Gene order is not conserved in bacterial evolution. Trends Genet. 1996, 12: 289-290. 10.1016/0168-9525(96)20006-X. Parkhill J, Sebaihia M, Preston A, Murphy LD, Thomson N, Harris DE, Holden MT, Churcher CM, Bentley SD, Mungall KL: Comparative analysis of the genome sequences of Bordetella pertussis,. Bordetella parapertussis and Bordetella bronchiseptica. Nat Genet. 2003, 35: 32-40. 10.1038/ng1227. Koski LB, Morton RA, Golding GB: Codon bias and base composition are poor indicators of horizontally transferred genes. Mol Biol Evol. 2001, 18: 404-412. Swidan F, Rocha EP, Shmoish M, Pinter RY: An integrative method for accurate comparative genome mapping. PLoS Comput Biol. 2006, 2: e75-10.1371/journal.pcbi.0020075. Huynen M, Snel B, Lathe W, Bork P: Predicting protein function by genomic context: quantitative evaluation and qualitative inferences. Genome Res. 2000, 10: 1204-1210. 10.1101/gr.10.8.1204. Wolf YI, Rogozin IB, Kondrashov AS, Koonin EV: Genome Alignment, Evolution of Prokaryotic Genome Organization, and Prediction of Gene Function Using Genomic Context. Genome Res. 2001, 11: 356-372. 10.1101/gr.GR-1619R. Dandekar T, Snel B, Huynen M, Bork P: Conservation of gene order: A fingerprint of proteins that physically interact. Trends Biochem Sci. 1998, 23: 324-328. 10.1016/S0968-0004(98)01274-2. Enright A, Ilipoulos I, Kyrpides N, Ouzounis C: Protein interaction maps for complete genomes based on gene fusion events. Nature. 1999, 402: 86-90. 10.1038/47056. Huynen MA, Bork P: Measuring genome evolution. Proc Natl Acad Sci USA. 1998, 95: 5849-5856. 10.1073/pnas.95.11.5849. Marcotte EM, Pellegrini M, Ng H, Rice WD, Yeates TO, Eisenberg D: Detecting protein function and protein-protein interactions from genome sequences. Science. 1999, 285: 751-753. 10.1126/science.285.5428.751. Overbeek R, Fonstein M, D'Souza M, Pusch GD, Maltsev N: The use of gene clusters to infer functional coupling. Proc Natl Acad Sci USA. 1999, 96: 2896-2901. 10.1073/pnas.96.6.2896. Pellegrini M, Marcotte EMJ, Thompson M, Eisenberg D, Yeats TO: Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles. Proc Natl Acad Sci USA. 1999, 96: 4285-4288. 10.1073/pnas.96.8.4285. [http://] Galperin MY, Koonin EV: Who's your neighbor? New computational approaches for functional genomics. Nat Biotechnol. 2000, 18: 609-613. 10.1038/76443. Mushegian AR, Koonin EV: A minimal gene set for cellular life derived by comparison of complete bacterial genomes. Proc Natl Acad Sci USA. 1996, 93: 10268-10273. 10.1073/pnas.93.19.10268. Tatusov RL, Koonin EV, Lipman DJ: A genomic perspective on protein families. Science. 1997, 278: 631-637. 10.1126/science.278.5338.631. Koski LB, Golding GB: The closest BLAST hit is often not the nearest neighbor. J Mol Evol. 2001, 52: 540-542. Wall DP, Fraser HB, Hirsh AE: Detecting putative orthologs. Bioinformatics. 2003, 19: 1710-1711. 10.1093/bioinformatics/btg213. Mao F, Su Z, Olman V, Dam P, Liu Z, Xu Y: Mapping of orthologous genes in the context of biological pathways: An application of integer programming. Proc Natl Acad Sci USA. 2006, 103: 129-134. 10.1073/pnas.0509737102. Fulton DL, Li YY, Laird MR, Horsman BG, Roche FM, Brinkman FS: Improving the specificity of high-throughput ortholog prediction. BMC Bioinformatics. 2006, 7: 270-10.1186/1471-2105-7-270. Deluca TF, Wu IH, Pu J, Monaghan T, Peshkin L, Singh S, Wall DP: Roundup: a multi-genome repository of orthologs and evolutionary distances. Bioinformatics. 2006, 22: 2044-2046. 10.1093/bioinformatics/btl286. Storm CE, Sonnhammer EL: Automated ortholog inference from phylogenetic trees and calculation of orthology reliability. Bioinformatics. 2002, 18: 92-99. 10.1093/bioinformatics/18.1.92. Dufayard JF, Duret L, Penel S, Gouy M, Rechenmann F, Perriere G: Tree pattern matching in phylogenetic trees: automatic search for orthologs or paralogs in homologous gene sequence databases. Bioinformatics. 2005, 21: 2596-2603. 10.1093/bioinformatics/bti325. van der Heijden RT, Snel B, van Noort V, Huynen MA: Orthology prediction at Scalable Resolution by Phylogenetic Tree analysis. BMC Bioinformatics. 2007, 8: 83-10.1186/1471-2105-8-83. Gonnet GH, Cohen MA, Benner SA: Exhaustive matching of the entire protein sequence database. Science. 1992, 256: 1443-144. 10.1126/science.1604319. Gonnet GH, Hallett MT, Korostensky C, Bernardin L: Darwin v. 2.0: an interpreted computer language for the biosciences. Bioinformatics. 2000, 16: 101-103. 10.1093/bioinformatics/16.2.101. [http://] Le Bouder-Langevin S, Capron-Montaland I, De Rosa R, Labedan B: A strategy to retrieve the whole set of protein modules in microbial proteomes. Genome Res. 2002, 12: 1961-1973. 10.1101/gr.393902. Labedan B, Lespinet O: Inter- and intraspecies comparisons of microbial proteins: Learning about gene ancestry, protein function and species life style. Methods Biochem Anal. 2006, 49: 415-436. Dayhoff MO, Schwartz RM, Orcutt BC: A model for evolutionary change. Atlas of protein sequence and structure. Edited by: MO Dayhoff. 1978, National Biomedical Research Foundation, Washington, D.C, 5 (suppl 3): 345-352. Schwartz RM, Dayhoff MO: Matrices for detecting distant relationships. Atlas of Protein Sequence and Structure. Edited by: MO Dayoff. 1978, National Biomedical Research Foundation, Washington, D.C, 5 (Suppl 3): 353-358. Altschul SF: Amino acid substitution matrices from an information theoretic perspective. J Mol Biol. 1991, 219: 555-565. 10.1016/0022-2836(91)90193-A. Remm M, Storm CE, Sonnhammer EL: Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J Mol Biol. 2001, 314: 1041-1052. 10.1006/jmbi.2000.5197. [http://inparanoid.sbc.su.se/cgi-bin/index.cgi] van Dongen Stijn: Graph Clustering by Flow Simulation. 2000, PhD thesis, University of Utrecht, [http://micans.org/] Enright AJ, Van Dongen S, Ouzounis CA: An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 2002, 30: 1575-1584. 10.1093/nar/30.7.1575. Huynen M, Snel B, Lathe W, Bork P: Exploitation of gene context. Curr Opin Struct Biol. 2000, 10: 366-70. 10.1016/S0959-440X(00)00098-1. Overbeek R, Begley T, Butler RM, Choudhuri JV, Chuang HY, Cohoon M, de Crecy-Lagard V, Diaz N, Disz T, Edwards R: The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res. 2005, 33: 5691-5702. 10.1093/nar/gki866. Riley M, Labedan B: Protein evolution viewed through Escherichia coli protein sequences: introducing the notion of structural segment of homology, the module. J Mol Biol. 1997, 269: 1-12. 10.1006/jmbi.1997.1025. Alexeyenko A, Tamas I, Liu G, Sonnhammer EL: Automatic clustering of orthologs and inparalogs shared by multiple proteomes. Bioinformatics. 2006, 22: e9-15. 10.1093/bioinformatics/btl213. Alexeyenko A, Lindberg J, Perez-Bercoff A, Sonnhammer EL: Overview and comparison of ortholog databases. Drug Discovery Today:Technologies. 2006, 3: 137-143. 10.1016/j.ddtec.2006.06.002. Chiu JC, Lee EK, Egan MG, Sarkar IN, Coruzzi GM, DeSalle R: OrthologID:automation of genome-scale ortholog identification within a parsimony framework. Bioinformatics. 2006, 22: 699-707. 10.1093/bioinformatics/btk040. Dessimoz C, Boeckmann B, Roth AC, Gonnet GH: Detecting non-orthology in the COGs database and other approaches grouping orthologs using genome-specific best hits. Nucleic Acids Res. 2006, 34: 3309-3316. 10.1093/nar/gkl433. Hulsen T, Huynen MA, de Vlieg J, Groenen PM: Benchmarking ortholog identification methods using functional genomics data. Genome Biol. 2006, 7: R31-10.1186/gb-2006-7-4-r31. Jothi R, Zotenko E, Tasneem A, Przytycka TM: COCO-CL: hierarchical clustering of homology relations based on evolutionary correlations. Bioinformatics. 2006, 22: 779-788. 10.1093/bioinformatics/btl009. Markowitz VM, Korzeniewski F, Palaniappan K, Szeto E, Werner G, Padki A, Zhao X, Dubchak I, Hugenholtz P, Anderson I, Lykidis A, Mavromatis K, Ivanova N, Kyrpides NC: The integrated microbial genomes (IMG) system. Nucleic Acids Res. 2006, 34: D344-348. 10.1093/nar/gkj024. [http://www.jgi.doe.gov/] Uchiyama I: Hierarchical clustering algorithm for comprehensive orthologous-domain classification in multiple genomes. Nucleic Acids Res. 2006, 34: 647-658. 10.1093/nar/gkj448. Dutilh BE, van Noort V, van der Heijden RT, Boekhout T, Snel B, Huynen MA: Assessment of phylogenomic and orthology approaches for phylogenetic inference. Bioinformatics. Advance Access published on January 19, 2007 Rocha EP: The quest for the universals of protein evolution. Trends Genet. 2006, 22: 412-416. 10.1016/j.tig.2006.06.004. Fraser HB: Modularity and evolutionary constraint on proteins. Nat Genet. 2005, 37: 351-352. 10.1038/ng1530. Hartwell LH, Hopfield JJ, Leibler S, Murray AW: From molecular to modular cell biology. Nature. 1999, 402: C47-C52. 10.1038/35011540. Chen Y, Dokholyan NV: The coordinated evolution of yeast proteins is constrained by functional modularity. Trends Genet. 2006, 22: 416-419. 10.1016/j.tig.2006.06.008. Fisher RA: The Genetical Theory of Natural Selection. 1930, Oxford: Oxford Univ Press Nei M: Genome evolution: let's stick together. Heredity. 2003, 90: 411-412. 10.1038/sj.hdy.6800287. Poyatos JF, Hurst LD: Is optimal gene order impossible?. Trends Genet. 2006, 22: 420-423. 10.1016/j.tig.2006.06.003. de Rosa R, Labedan B: The evolutionary relationships between the two bacteria Escherichia coli and Haemophilus influenzae and their putative last common ancestor. Mol Biol Evol. 1998, 15: 17-27. Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970, 48: 443-453. 10.1016/0022-2836(70)90057-4. Benner SA, Cohen MA, Gonnet GH: Empirical and structural models for insertions and deletions in the divergent evolution of proteins. J Mol Biol. 1993, 229: 1065-1082. 10.1006/jmbi.1993.1105. Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol. 1981, 147: 195-197. 10.1016/0022-2836(81)90087-5. Edgar RC: MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics. 2004, 5: 113-10.1186/1471-2105-5-113. [http://www.drive5.com/] Guindon S, Gascuel O: A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol. 2003, 52: 696-704. 10.1080/10635150390235520. [http://atgc.lirmm.fr/] Labedan B, Boyen A, Baetens M, Charlier D, Pingguo C, Cunin R, Durbecq V, Glansdorff N, Herve G, Legrain C: The evolutionary history of carbamoyltransferases: A complex set of paralogous genes was already present in the last universal common ancestor. J Mol Evol. 1999, 49: 461-473. 10.1007/PL00006569. PostgreSQL 8.1. [http://www.postgresql.org/] Efron B, Tibshirani R: An Introduction to the Bootstrap. 1993, CHAPMAN & HALL/CRC, Boca Raton