OrthoMCL: Identification of Ortholog Groups for Eukaryotic Genomes

Genome Research - Tập 13 Số 9 - Trang 2178-2189 - 2003
Li Li1, Christian J. Stoeckert2, David S. Roos2
1Department of Biology and Genetics, Center for Bioinformatics, and Genomics Institute, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA.
2Departments of Biology and Genetics, Center for Bioinformatics, and Genomics Institute, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA

Tóm tắt

The identification of orthologous groups is useful for genome annotation, studies on gene/protein evolution, comparative genomics, and the identification of taxonomically restricted sequences. Methods successfully exploited for prokaryotic genome analysis have proved difficult to apply to eukaryotes, however, as larger genomes may contain multiple paralogous genes, and sequence information is often incomplete. OrthoMCL provides a scalable method for constructing orthologous groups across multiple eukaryotic taxa, using a Markov Cluster algorithm to group (putative) orthologs and paralogs. This method performs similarly to the INPARANOID algorithm when applied to two genomes, but can be extended to cluster orthologs from multiple species. OrthoMCL clusters are coherent with groups identified by EGO, but improved recognition of “recent” paralogs permits overlapping EGO groups representing the same gene to be merged. Comparison with previously assigned EC annotations suggests a high degree of reliability, implying utility for automated eukaryotic genome annotation. OrthoMCL has been applied to the proteome data set from seven publicly available genomes (human, fly, worm, yeast, Arabidopsis, the malaria parasite Plasmodium falciparum, and Escherichia coli). A Web interface allows queries based on individual genes or user-defined phylogenetic patterns (http://www.cbil.upenn.edu/gene-family). Analysis of clusters incorporating P. falciparum genes identifies numerous enzymes that were incompletely annotated in first-pass annotation of the parasite genome.

Từ khóa


Tài liệu tham khảo

10.1093/bioinformatics/18.7.908

10.1038/75556

10.1093/nar/gkg081

10.1038/nature01099

10.1126/science.282.5396.2022

2001, IBM Systems J., 40, 512, 10.1147/sj.402.0512

10.1146/annurev.bi.64.070195.001443

10.1093/nar/30.7.1575

10.1038/37132

10.2307/2412448

10.1016/S0168-9525(00)02005-9

10.1016/S0168-9525(02)02650-1

10.1016/S0958-1669(99)00035-X

10.1038/nature01097

10.1101/gr.180801

10.1093/emboj/20.3.330

10.1126/science.278.5338.609

10.1038/419490a

10.1126/science.275.5305.1485

10.1101/gr.212002

1998, Genome Res., 8, 590, 10.1101/gr.8.6.590

10.1023/A:1004031323748

2000, Genome Biol., 1, research0009.1

10.1093/nar/28.1.141

10.1093/nar/29.1.159

10.1006/jmbi.2000.5197

10.1016/S0169-4758(98)01367-2

10.1126/science.287.5461.2204

10.1101/gr.222902

Shi, J. and Malik, J. 1997. Normalized cuts and image segmentation. Proc. IEEE Conf. Comp. Vision Pattern Recognit. 731–737.

10.1126/science.278.5338.631

10.1093/nar/28.1.33

10.1093/nar/29.1.22

10.1093/nar/22.22.4673

Van Dongen, S. 2000. “Graph clustering by flow simulation.” Ph.D thesis, University of Utrecht, The Netherlands.

10.1016/S0378-1119(99)00298-X

http://www.cbil.upenn.edu/gene-family; Putative ortholog groups generated by OrthoMCL, University of Pennsylvania.

http://www.ncbi.nlm.nih.gov/COG/; The Clusters of Orthologous Groups (COG) database, NCBI.

http://www.allgenes.org; The human and mouse gene index, University of Pennsylvania.

http://www.tigr.org/tdb/tgi/; TIGR Gene Indices.

http://www.tigr.org/tdb/tgi/ego/index.shtml; Eukaryotic Gene Orthologs (EGO), TIGR.

http://us.expasy.org/enzyme; The ENZYME database, Bairoch A.

http://blast.wustl.edu/; BLAST2, Washington University.

http://www.ebi.ac.uk/clustalw/; CLUSTALW alignment, EBI.

http://micans.org/mcl/; Markov Cluster Algorithm, Stijn van Dongen.

http://www.cgb.ki.se/inparanoid/; INPARANOID program.

http://www.plasmodb.org/, The Plasmodium Genome Database, University of Pennsylvania.

http://www.fruitfly.org; The Berkeley Drosophila Genome Project (BDGP).

http://genome-www.stanford.edu/Saccharomyces/; The Saccharomyces Genome Database (SGD).

http://www.sanger.ac.uk/Projects/C_elegans/; The C. elegans Genome Project.

http://www.genome.wisc.edu/; Escherichia coli Genome Project, University of Wisconsin.

http://www.ensembl.org/; Ensembl, Sanger.

http://www.tigr.org/tdb/e2k1/ath1/; TIGR, Arabidopsis thaliana Database.