A Resource of Quantitative Functional Annotation for<i>Homo sapiens</i>Genes

G3: Genes, Genomes, Genetics - Tập 2 Số 2 - Trang 223-233 - 2012
Murat Taşan1,2, Harold J. Drabkin3, John Beaver1, Hon Nian Chua1,2, Julie Dunham2, Weidong Tian4, Judith A. Blake3, Frederick P. Roth5,1,2,6
1Department of Biological Chemistry and Molecular Pharmacology, Harvard Medical School, Boston, Massachusetts 02115.
2Donnelly Centre for Cellular & Biomolecular Research, University of Toronto, Toronto, Ontario M5S-3E1, Canada
3Mouse Genome Informatics, The Jackson Laboratory, Bar Harbor, Maine 04609
4Institute of Biostatistics, School of Life Sciences, Fudan University, Shanghai 200433, P. R. China
5Center for Cancer Systems Biology, Dana Farber Cancer Institute, Boston, Massachusetts 02115
6Samuel Lunenfeld Research Institute, Mt. Sinai Hospital, Toronto, Ontario M5G-1X5, Canada

Tóm tắt

AbstractThe body of human genomic and proteomic evidence continues to grow at ever-increasing rates, while annotation efforts struggle to keep pace. A surprisingly small fraction of human genes have clear, documented associations with specific functions, and new functions continue to be found for characterized genes. Here we assembled an integrated collection of diverse genomic and proteomic data for 21,341 human genes and make quantitative associations of each to 4333 Gene Ontology terms. We combined guilt-by-profiling and guilt-by-association approaches to exploit features unique to the data types. Performance was evaluated by cross-validation, prospective validation, and by manual evaluation with the biological literature. Functional-linkage networks were also constructed, and their utility was demonstrated by identifying candidate genes related to a glioma FLN using a seed network from genome-wide association studies. Our annotations are presented—alongside existing validated annotations—in a publicly accessible and searchable web interface.

Từ khóa


Tài liệu tham khảo

Barrell, 2009, The GOA database in 2009–an integrated Gene Ontology Annotation resource, Nucleic Acids Res., 37, 396, 10.1093/nar/gkn803

Barrett, 2009, NCBI GEO: archive for high-throughput functional genomic data, Nucleic Acids Res., 37, D885, 10.1093/nar/gkn764

Beaver, 2010, FuncBase: a resource for quantitative gene function annotation, Bioinformatics, 26, 1806, 10.1093/bioinformatics/btq265

Benjamini, 1995, Controlling the false discovery rate: a practical and powerful approach to multiple testing, J. R. Stat. Soc., B, 57, 289, 10.1111/j.2517-6161.1995.tb02031.x

Berriz, 2008, The Synergizer service for translating gene, protein and other biological identifiers, Bioinformatics, 24, 2272, 10.1093/bioinformatics/btn424

Bieri, 2007, WormBase: new content and better access, Nucleic Acids Res., 35, 506, 10.1093/nar/gkl818

Bologna, G., A.-L. Veuthey, M. Pagni, L. Lane, and A. Bairoch 2011  A preliminary study on the prediction of human protein functions, in ‘Proceedings of the 4th international conference on Interplay between natural and artificial computation - Volume Part I’, IWINAC’11, Springer-Verlag, Berlin, Heidelberg, pp. 334–343. Available at:http://dl.acm.org/citation.cfm?id=2009405.2009440

Bredel, 2005, Functional network analysis reveals extended gliomagenesis pathway maps and three novel MYC-interacting genes in human gliomas, Cancer Res., 65, 8679, 10.1158/0008-5472.CAN-05-1204

Breiman, 2001, Random forests, Mach. Learn., 45, 5, 10.1023/A:1010933404324

Crosby, 2007, FlyBase: genomes by the dozen, Nucleic Acids Res., 35, 486, 10.1093/nar/gkl827

Dai, 2005, Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data, Nucleic Acids Res., 33, e175, 10.1093/nar/gni179

Deng, 2004, An integrated probabilistic model for functional prediction of proteins, J. Comput. Biol., 11, 463, 10.1089/1066527041410346

Eppig, 2007, The mouse genome database (mgd): new features facilitating a model system, Nucleic Acids Res., 35, D630, 10.1093/nar/gkl940

Gunsalus, 2004, RNAiDB and PhenoBlast: web tools for genome-wide phenotypic mapping projects, Nucleic Acids Res., 32, D406, 10.1093/nar/gkh110

Gunsalus, 2005, Predictive models of molecular machines involved in Caenorhabditis elegans early embryogenesis, Nature, 436, 861, 10.1038/nature03876

Hamosh, 2005, Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders, Nucleic Acids Res., 33, 514, 10.1093/nar/gki033

Huttenhower, 2009, Exploring the human genome with functional maps, Genome Res., 19, 1093, 10.1101/gr.082214.108

Huttenhower, 2010, A quick guide to large-scale genomic data mining, PLOS Comput. Biol., 6, e1000779, 10.1371/journal.pcbi.1000779

Joshi, 2004, Genome-scale gene function prediction using multiple sources of high-throughput data in yeast Saccharomyces cerevisiae, OMICS, 8, 322, 10.1089/omi.2004.8.322

Karaoz, 2004, Whole-genome annotation by using evidence integration in functional-linkage networks, Proc. Natl. Acad. Sci. USA, 101, 2888, 10.1073/pnas.0307326101

Keshava Prasad, 2009, Human Protein Reference Database–2009 update, Nucleic Acids Res., 37, D767, 10.1093/nar/gkn892

King, 2003, Predicting phenotype from patterns of annotation, Bioinformatics, 19, 183, 10.1093/bioinformatics/btg1024

Ko, 2009, Integrative approaches to the prediction of protein functions based on the feature selection, BMC Bioinformatics, 10, 455, 10.1186/1471-2105-10-455

Lanckriet, 2004, A statistical framework for genomic data fusion, Bioinformatics, 20, 2626, 10.1093/bioinformatics/bth294

Lee, 2004, A probabilistic functional network of yeast genes, Science, 306, 1555, 10.1126/science.1099511

Lee, 2010, Predicting genetic modifier loci using functional gene networks, Genome Res., 20, 1143, 10.1101/gr.102749.109

Lee, 2011, Prioritizing candidate disease genes by network-based boosting of genome-wide association data, Genome Res., 21, 1109, 10.1101/gr.118992.110

Letovsky, 2003, Predicting protein function from protein/protein interaction data: a probabilistic approach, Bioinformatics, 19, 197, 10.1093/bioinformatics/btg1026

Li, 2005, LRP5/6 in Wnt signaling and tumorigenesis, Future Oncol., 1, 673, 10.2217/14796694.1.5.673

Linghu, 2009, Genome-wide prioritization of disease genes and identification of disease-disease associations from an integrated human functional linkage network, Genome Biol., 10, R91, 10.1186/gb-2009-10-9-r91

Luu, 2004, Wnt/beta-catenin signaling pathway as a novel cancer drug target, Curr. Cancer Drug Targets, 4, 653, 10.2174/1568009043332709

Mulder, 2005, InterPro, progress and status in 2005, Nucleic Acids Res., 33, 201, 10.1093/nar/gki106

Murali, 2006, The art of gene function prediction, Nat. Biotechnol., 24, 1474, 10.1038/nbt1206-1474

Nash, 2007, Expanded protein information at SGD: new pages and proteome browser, Nucleic Acids Res., 35, 468, 10.1093/nar/gkl931

O’Brien, 2005, Inparanoid: a comprehensive database of eukaryotic orthologs, Nucleic Acids Res., 33, 476, 10.1093/nar/gki107

Pena-Castillo, 2008, A critical assessment of Mus musculus gene function prediction using integrated genomic evidence, Genome Biol., 9, S2, 10.1186/gb-2008-9-s1-s2

Piekutowska-Abramczuk, 2009, The frequency of NBN molecular variants in pediatric astrocytic tumors, J. Neuro-oncol., 96, 161, 10.1007/s11060-009-9958-5

Pujana, 2007, Network modeling links breast cancer susceptibility and centrosome dysfunction, Nat. Genet., 39, 1338, 10.1038/ng.2007.2

Schwikowski, 2000, A network of protein-protein interactions in yeast, Nat. Biotechnol., 18, 1257, 10.1038/82360

Shete, 2009, Genome-wide association study identifies five susceptibility loci for glioma, Nat. Genet., 41, 899, 10.1038/ng.407

Smith, 2005, The Mammalian Phenotype Ontology as a tool for annotating, analyzing and comparing phenotypic information, Genome Biol., 6, R7, 10.1186/gb-2004-6-1-r7

Sokolov, 2010, Hierarchical classification of gene ontology terms using the GOstruct method, J. Bioinform. Comput. Biol., 8, 357, 10.1142/S0219720010004744

Szklarczyk, 2011, The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored, Nucleic Acids Res., 39, D561, 10.1093/nar/gkq973

Tasan, 2008, An en masse phenotype and function prediction system for Mus musculus, Genome Biol., 9, S8, 10.1186/gb-2008-9-s1-s8

Tian, 2008, Combining guilt-by-association and guilt-by-profiling to predict Saccharomyces cerevisiae gene function, Genome Biol., 9, S7, 10.1186/gb-2008-9-s1-s7

Troyanskaya, 2003, A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae), Proc. Natl. Acad. Sci. USA, 100, 8348, 10.1073/pnas.0832373100

Wang, 2010, ‘It’s the machine that matters: predicting gene function and phenotype from protein networks’, J. Proteomics, 73, 2277, 10.1016/j.jprot.2010.07.005

Wong, 2004, Combining biological networks to predict genetic interactions, Proc. Natl. Acad. Sci. USA, 101, 15682, 10.1073/pnas.0406614101

Wu, 2004, A model-based background adjustment for oligonucleotide expression arrays, J. Am. Stat. Assoc., 99, 909, 10.1198/016214504000000683