A shortest-path graph kernel for estimating gene product semantic similarity

Journal of Biomedical Semantics - Tập 2 Số 1 - 2011
Marco Álvarez1, Xiaojun Qi1, Cong Yan2
1Department of Computer Science, Utah State University, Logan, 84322, USA
2Department of Computer Science, North Dakota State University, Fargo, 58108, USA

Tóm tắt

Abstract Background Existing methods for calculating semantic similarity between gene products using the Gene Ontology (GO) often rely on external resources, which are not part of the ontology. Consequently, changes in these external resources like biased term distribution caused by shifting of hot research topics, will affect the calculation of semantic similarity. One way to avoid this problem is to use semantic methods that are "intrinsic" to the ontology, i.e. independent of external knowledge. Results We present a shortest-path graph kernel (spgk) method that relies exclusively on the GO and its structure. In spgk, a gene product is represented by an induced subgraph of the GO, which consists of all the GO terms annotating it. Then a shortest-path graph kernel is used to compute the similarity between two graphs. In a comprehensive evaluation using a benchmark dataset, spgk compares favorably with other methods that depend on external resources. Compared with simUI, a method that is also intrinsic to GO, spgk achieves slightly better results on the benchmark dataset. Statistical tests show that the improvement is significant when the resolution and EC similarity correlation coefficient are used to measure the performance, but is insignificant when the Pfam similarity correlation coefficient is used. Conclusions Spgk uses a graph kernel method in polynomial time to exploit the structure of the GO to calculate semantic similarity between gene products. It provides an alternative to both methods that use external resources and "intrinsic" methods with comparable performance.

Từ khóa


Tài liệu tham khảo

Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25: 25-29. 10.1038/75556.

Barrell D, Dimmer E, Huntley RP, Binns D, O'Donovan C, Apweiler R: The GOA database in 2009--an integrated Gene Ontology Annotation resource. Nucl Acids Res. 2009, 37: D396-403. 10.1093/nar/gkn803.

Wang JZ, Du Z, Payattakool R, Yu PS, Chen C-F: A new method to measure the semantic similarity of go terms. Bioinformatics. 2007, 23: 1274-1281. 10.1093/bioinformatics/btm087.

Sheehan B, Quigley A, Gaudin B, Dobson S: A relation based measure of semantic similarity for gene ontology annotations. BMC Bioinformatics. 2008, 9: 468-10.1186/1471-2105-9-468.

Nagar A, Al-Mubaid H: A new path length measure based on go for gene similarity with evaluation using sgd pathways. Proceedings of IEEE International Symposium on Computer-Based Medical Systems. 2008, 590-595.

Du Z, Li L, Chen C-F, Yu PS, Wang JZ: G-sesame: web tools for go-term-based gene similarity analysis and knowledge discovery. Nucl Acids Res. 2009, 37: W345-349. 10.1093/nar/gkp463.

Xu T, Du L, Zhou Y: Evaluation of GO-based functional similarity measures using S. cerevisiae protein interaction and expression profile data. BMC Bioinformatics. 2008, 9: 472-10.1186/1471-2105-9-472.

Sevilla JL, Segura V, Podhorski A, Guruceaga E, Mato JM, Martinez-Cruz LA, Corrales FJ, Rubio A: Correlation between gene expression and go semantic similarity. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2005, 2: 330-338. 10.1109/TCBB.2005.50.

Pesquita C, Faria D, Bastos H, Ferreira AE, Falcão AO, Couto FM: Metrics for go based protein semantic similarity: a systematic evaluation. BMC Bioinformatics. 2008, 9: 5-10.1186/1471-2105-9-5.

Mistry M, Pavlidis P: Gene ontology term overlap as a measure of gene functional similarity. BMC Bioinformatics. 2008, 9: 327-10.1186/1471-2105-9-327.

Lord PW, Stevens RD, Brass A, Goble CA: Investigating semantic similarity measures across the gene ontology: the relationship between sequence and annotation. Bioinformatics. 2003, 19: 1275-1283. 10.1093/bioinformatics/btg153.

Fontana P, Cestaro A, Velasco R, Formentin E, Toppo S: Rapid Annotation of Anonymous Sequences from Genome Projects Using Semantic Similarities and a Weighting Scheme in Gene Ontology. PLoS ONE. 2009, 4: e4619-10.1371/journal.pone.0004619.

Couto FM, Silva MJ, Coutinho PM: Measuring semantic similarity between gene ontology terms. Data and Knowledge Engineering. 2007, 16: 137-152.

Schlicker A, Domingues F, Rahnenfuhrer J, Lengauer T: A new measure for functional similarity of gene products based on Gene Ontology. BMC Bioinformatics. 2006, 7: 302-10.1186/1471-2105-7-302.

Alvarez M, Qi X, Yan C: GO-Based Term Semantic Similarity. Ontology Learning and Knowledge Discovery Using the Web: Challenges and Recent Advances. Edited by: Wong W, Liu W, Bennamoun M. 2011, Pennsylvania: IGI-Global, 174-185.

Pesquita C, Faria D, Falcão AO, Lord P, Couto FM: Semantic similarity in biomedical ontologies. PLOS Computational Biology. 2009, 5: e1000443-10.1371/journal.pcbi.1000443.

Cheng J, Cline M, Martin J, Finkelstein D, Awad T, Kulp D, Siani-Rose MA: A knowledge-based clustering algorithm driven by gene ontology. J Biopharm Stat. 2004, 14: 687-700. 10.1081/BIP-200025659.

Wu X, Zhu L, Guo J, Zhang D-Y, Lin K: Prediction of yeast proteinprotein interaction network: insights from the gene ontology and annotations. Nucl Acids Res. 2006, 34: 2137-2150. 10.1093/nar/gkl219.

The UniProt Consortium: The Universal Protein Resource (UniProt) in 2010. Nucl Acids Res. 2010, 38: D142-148.

Borgwardt KM, Ong CS, Schonauer S, Vishwanathan SVN, Smola AJ, Kriegel H-P: Protein function prediction via graph kernels. Bioinformatics. 2005, 21: i47-56. 10.1093/bioinformatics/bti1007.

Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, H-R Hotz, Ceric G, Forslund K, Eddy SR, Sonnhammer ELL, Bateman A: The pfam protein families database. Nucl Acids Res. 2008, 36: D281-288. 10.1093/nar/gkn226.

Pesquita C, Pessoa D, Faria D, Couto F: CESSM: Collaborative Evaluation of Semantic Similarity Measures. Proceedings of JB2009: Challenges in Bioinformatics Lisbon, Portugal. 2009

Resnik P: Using information content to evaluate semantic similarity in a taxonomy. Proceedings of International Joint Conference on Artificial Intelligent. 1995, 448-453.

Lin D: An information-theoretic definition of similarity. Proceedings of International Conference on Machine Learning. 1998, 296-304.

Jiang JJ, Conrath DW: Semantic similarity based on corpus statistics and lexical taxonomy. Proceedings of International Conference Research on Computational Linguistics. 1997, 19-33.