Quantitative synteny scoring improves homology inference and partitioning of gene families

BMC Bioinformatics - Tập 14 - Trang 1-9 - 2013
Raja Hashim Ali1,2, Sayyed Auwn Muhammad1,2,3, Mehmood Alam Khan1,2, Lars Arvestad2,3,4
1KTH Royal Institute of Technology, School of Computer Science and Communication, Department of Computational Biology, Stockholm, Sweden
2Science for Life Laboratory, Karolinska Institutet Science Park, Solna, Sweden
3Swedish e-Science Research Center, Sweden
4Department of Numerical Analysis and Computer Science, Stockholm University, Stockholm, Sweden

Tóm tắt

Clustering sequences into families has long been an important step in characterization of genes and proteins. There are many algorithms developed for this purpose, most of which are based on either direct similarity between gene pairs or some sort of network structure, where weights on edges of constructed graphs are based on similarity. However, conserved synteny is an important signal that can help distinguish homology and it has not been utilized to its fullest potential. Here, we present GenFamClust, a pipeline that combines the network properties of sequence similarity and synteny to assess homology relationship and merge known homologs into groups of gene families. GenFamClust identifies homologs in a more informed and accurate manner as compared to similarity based approaches. We tested our method against the Neighborhood Correlation method on two diverse datasets consisting of fully sequenced genomes of eukaryotes and synthetic data. The results obtained from both datasets confirm that synteny helps determine homology and GenFamClust improves on Neighborhood Correlation method. The accuracy as well as the definition of synteny scores is the most valuable contribution of GenFamClust.

Tài liệu tham khảo

Fitch WM: Distinguishing homologous from analogous proteins. Systematic Zoology. 1970, 19 (2): 99-113. 10.2307/2412448.

Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, and Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 3389-3402. 10.1093/nar/25.17.3389.

Overbeek R, Fonstein M: The use of gene clusters to infer functional coupling. Proc Natl Acad Sci. 1999, 96: 2896-2901. 10.1073/pnas.96.6.2896.

Tatusov RL, Koonin EV, and Lipman DJ: A genomic perspective on protein families. Science. 1997, 278: 631-637. 10.1126/science.278.5338.631.

BLASTCLUST. [http://www.ncbi.nlm.nih.gov/BLAST/]

Wolf YI, Novichkov PS, Karev GP: The universal distribution of evolutionary rates of genes and distinct characteristics of eukaryotic genes of different apparent ages. PNAS. 2009, 106 (18): 7273-80. 10.1073/pnas.0901808106.

Koonin EV, and Wolf YI: Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world. Nucleic Acids Res. 2008, 36 (21): 6688-719. 10.1093/nar/gkn668.

Enright AJ, Dongen VS, and Ouzounis CA: An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 2002, 30 (7): 1575-84. 10.1093/nar/30.7.1575.

Li L, Stoeckert CJ, and Roos DS: OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Research. 2003, 13 (9): 2178-89. 10.1101/gr.1224503.

Miele V, Penel S, Daubin V, Picard F, Kahn D, and Duret L: High-quality sequence clustering guided by network topology and multiple alignment likelihood. Bioinformatics. 2012, 28 (8): 1078-85. 10.1093/bioinformatics/bts098.

Jothi R, Zotenko E, Tasneem A, and Przytycka TM: COCO-CL: hierarchical clustering of homology relations based on evolutionary correlations. Bioinformatics. 2006, 22 (7): 779-88. 10.1093/bioinformatics/btl009.

Friedman R, and Hughes AL: Gene duplication and the structure of eukaryotic genomes. Genome Res. 2001, 11: 373-81. 10.1101/gr.155801.

Haas BJ, Delcher AL: DAGchainer: a tool for mining segmental genome duplications and synteny. Bioinformatics. 2004, 20 (18): 3643-3646. 10.1093/bioinformatics/bth397.

Wapinski I, Pfeffer A, Friedman N, and Regev A: Automatic genome-wide reconstruction of phylogenetic gene trees. Bioinformatics. 2007, 23 (13): i549-58. 10.1093/bioinformatics/btm193.

Åkerborg Ö, Sennblad B, Arvestad L, and Lagergren J: Simultaneous Bayesian gene tree reconstruction and reconciliation analysis. PNAS. 2009, 106 (14): 5714-5719. 10.1073/pnas.0806251106.

Joseph JM, and Durand D: Family classification without domain chaining. Bioinformatics. 2009, 25 (12): i45-53. 10.1093/bioinformatics/btp207.

Sorensen T: A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons. Biologiske Skrifter. 1948, 5: 1-34.

Dalquen DA, Anisimova M, Gonnet GH, Dessimoz C: ALF - A Simulation Framework for Genome Evolution. Mol Biol Evol. 2012, 29 (4): 1115-1123. 10.1093/molbev/msr268.

Flicek P, Amode MR, Barrell D: Ensembl 2012. Nucleic Acids Research. 2012, 40 (Database): D84-D90.

Species tree of species present in Ensembl as generated by Ensembl Compara. [http://www.ensembl.org/info/about/species_tree.pdf]

Waterston RH, Lindblad-Toh K: Initial sequencing and comparative analysis of the mouse genome. Nature. 2002, 420 (6915): 520-62. 10.1038/nature01262.

Wolf YI, and Koonin EV: A tight link between orthologs and bidirectional best hits in bacterial and archaeal genomes. Genome Biol Evol. 2012, 4 (12): 1286-94. 10.1093/gbe/evs100.