Domain fusion analysis by applying relational algebra to protein sequence and domain databases

BMC Bioinformatics - Tập 4 - Trang 1-10 - 2003
Kevin Truong1, Mitsuhiko Ikura1
1Department of Medical Biophysics, University of Toronto, Toronto, Canada

Tóm tắt

Domain fusion analysis is a useful method to predict functionally linked proteins that may be involved in direct protein-protein interactions or in the same metabolic or signaling pathway. As separate domain databases like BLOCKS, PROSITE, Pfam, SMART, PRINTS-S, ProDom, TIGRFAMs, and amalgamated domain databases like InterPro continue to grow in size and quality, a computational method to perform domain fusion analysis that leverages on these efforts will become increasingly powerful. This paper proposes a computational method employing relational algebra to find domain fusions in protein sequence databases. The feasibility of this method was illustrated on the SWISS-PROT+TrEMBL sequence database using domain predictions from the Pfam HMM (hidden Markov model) database. We identified 235 and 189 putative functionally linked protein partners in H. sapiens and S. cerevisiae, respectively. From scientific literature, we were able to confirm many of these functional linkages, while the remainder offer testable experimental hypothesis. Results can be viewed at http://calcium.uhnres.utoronto.ca/pi . As the analysis can be computed quickly on any relational database that supports standard SQL (structured query language), it can be dynamically updated along with the sequence and domain databases, thereby improving the quality of predictions over time.

Tài liệu tham khảo

Martzen MR, McCraith SM, Spinelli SL, Torres FM, Fields S, Grayhack EJ, Phizicky EM: A biochemical genomics approach for identifying genes by the activity of their products. Science 1999, 286: 1153–5. 10.1126/science.286.5442.1153 Fields S, Song O: A novel genetic system to detect protein-protein interactions. Nature 1989, 340: 245–6. 10.1038/340245a0 Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams SL, Millar A, Taylor P, Bennett K, Boutilier K, et al.: Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 2002, 415: 180–3. 10.1038/415180a Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM, Cruciat CM, et al.: Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 2002, 415: 141–7. 10.1038/415141a Jones S, Thornton JM: Principles of protein-protein interactions. Proc Natl Acad Sci U S A 1996, 93: 13–20. 10.1073/pnas.93.1.13 Larsen TA, Olson AJ, Goodsell DS: Morphology of protein-protein interfaces. Structure 1998, 6: 421–7. Overbeek R, Fonstein M, D'Souza M, Pusch GD, Maltsev N: The use of gene clusters to infer functional coupling. Proc Natl Acad Sci U S A 1999, 96: 2896–901. 10.1073/pnas.96.6.2896 Dandekar T, Snel B, Huynen M, Bork P: Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochem Sci 1998, 23: 324–8. 10.1016/S0968-0004(98)01274-2 Tamames J, Casari G, Ouzounis C, Valencia A: Conserved clusters of functionally related genes in two bacterial genomes. J Mol Evol 1997, 44: 66–73. Pellegrini M, Marcotte EM, Thompson MJ, D Eisenbertg, Yeates TO: Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci U S A 1999, 96: 4285–8. 10.1073/pnas.96.8.4285 Marcotte EM, Xenarios I, van Der Bliek AM, Eisenberg D: Localizing proteins in the cell from their phylogenetic profiles. Proc Natl Acad Sci U S A 2000, 97: 12115–20. 10.1073/pnas.220399497 Deng M, Mehta S, Sun F, Chen T: Inferring domain-domain interactions from protein-protein interactions. Genome Res 2002, 12: 1540–8. 10.1101/gr.153002 Enright AJ, Iliopoulos I, Kyrpides NC, Ouzounis CA: Protein interaction maps for complete genomes based on gene fusion events. Nature 1999, 402: 86–90. 10.1038/47056 Marcotte EM, Pellegrini M, Ng HL, Rice DW, Yeates TO, Eisenberg D: Detecting protein function and protein-protein interactions from genome sequences. Science 1999, 285: 751–3. 10.1126/science.285.5428.751 Eisenberg D, Marcotte EM, Xenarios I, Yeates TO: Protein function in the post-genomic era. Nature 2000, 405: 823–6. 10.1038/35015694 Huynen M, Snel B, Lathe W 3rd, Bork P: Predicting protein function by genomic context: quantitative evaluation and qualitative inferences. Genome Res 2000, 10: 1204–10. 10.1101/gr.10.8.1204 Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389–402. 10.1093/nar/25.17.3389 Tsoka S, Ouzounis CA: Prediction of protein interactions: metabolic enzymes are frequently involved in gene fusion. Nat Genet 2000, 26: 141–2. 10.1038/79847 Marcotte EM, Pellegrini M, Thompson MJ, Yeates TO, Eisenberg D: A combined algorithm for genome-wide prediction of protein function. Nature 1999, 402: 83–6. 10.1038/47048 Enright AJ, Ouzounis CA: Functional associations of proteins in entire genomes by means of exhaustive detection of gene fusions. Genome Biol 2001, 2: Research 0034. 10.1186/gb-2001-2-9-research0034 Kriventseva EV, Biswas M, Apweiler R: Clustering and analysis of protein families. Curr Opin Struct Biol 2001, 11: 334–9. 10.1016/S0959-440X(00)00211-6 Apweiler R, Attwood TK, Bairoch A, Bateman A, Birney E, Biswas M, Bucher P, Cerutti L, Corpet F, Croning MD, et al.: InterPro – an integrated documentation resource for protein families, domains and functional sites. Bioinformatics 2000, 16: 1145–50. 10.1093/bioinformatics/16.12.1145 Bateman A, Birney E, Durbin R, Eddy SR, Howe KL, Sonnhammer EL: The Pfam protein families database. Nucleic Acids Res 2000, 28: 263–6. 10.1093/nar/28.1.263 Bader GD, Donaldson I, Wolting C, Ouellette BF, Pawson T, Hogue CW: BIND – The Biomolecular Interaction Network Database. Nucleic Acids Res 2001, 29: 242–5. 10.1093/nar/29.1.242 Xenarios I, Salwinski L, Duan XJ, Higney P, Kim SM, Eisenberg D: DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res 2002, 30: 303–5. 10.1093/nar/30.1.303 Kanehisa M, Goto S, Kawashima S, Nakaya A: The KEGG databases at GenomeNet. Nucleic Acids Res 2002, 30: 42–6. 10.1093/nar/30.1.42 Liu F, Thatcher JD, Barral JM, Epstein HF: Bifunctional glyoxylate cycle protein of Caenorhabditis elegans: a developmentally regulated protein of intestine and muscle. Dev Biol 1995, 169: 399–414. 10.1006/dbio.1995.1156 Lorenz MC, Fink GR: The glyoxylate cycle is required for fungal virulence. Nature 2001, 412: 83–6. 10.1038/35083594 Barros MH, Nobrega FG, Tzagoloff A: Mitochondrial ferredoxin is required for heme A synthesis in Saccharomyces cerevisiae. J Biol Chem 2002, 277: 9997–10002. 10.1074/jbc.M112025200 Pekarsky Y, Campiglio M, Siprashvili Z, Druck T, Sedkov Y, Tillib S, Draganescu A, Wermuth P, Rothman JH, Huebner K, et al.: Nitrilase and Fhit homologs are encoded as fusion proteins in Drosophila melanogaster and Caenorhabditis elegans. Proc Natl Acad Sci U S A 1998, 95: 8744–9. 10.1073/pnas.95.15.8744 Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P, et al.: A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 2000, 403: 623–7. 10.1038/35001009 Ito T, Tashiro K, Muta S, Ozawa R, Chiba T, Nishizawa M, Yamamoto K, Kuhara S, Sakaki Y: Toward a protein-protein interaction map of the budding yeast: A comprehensive system to examine two-hybrid interactions in all possible combinations between the yeast proteins. Proc Natl Acad Sci U S A 2000, 97: 1143–7. 10.1073/pnas.97.3.1143 Lashkari DA, DeRisi JL, McCusker JH, Namath AF, Gentile C, Hwang SY, Brown PO, Davis RW: Yeast microarrays for genome wide parallel genetic and gene expression analysis. Proc Natl Acad Sci U S A 1997, 94: 13057–62. 10.1073/pnas.94.24.13057 Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A 1998, 95: 14863–8. 10.1073/pnas.95.25.14863 Hofmann K, Bucher P, Falquet L, Bairoch A: The PROSITE database, its status in 1999. Nucleic Acids Res 1999, 27: 215–9. 10.1093/nar/27.1.215 Henikoff S, Henikoff JG, Pietrokovski S: Blocks+: a non-redundant database of protein alignment blocks derived from multiple compilations. Bioinformatics 1999, 15: 471–9. 10.1093/bioinformatics/15.6.471 Eddy SR: Profile hidden Markov models. Bioinformatics 1998, 14: 755–63. 10.1093/bioinformatics/14.9.755 Schultz J, Milpetz F, Bork P, Ponting CP: SMART, a simple modular architecture research tool: identification of signaling domains. Proc Natl Acad Sci U S A 1998, 95: 5857–64. 10.1073/pnas.95.11.5857 Attwood TK, Croning MD, Flower DR, Lewis AP, Mabey JE, Scordis P, Selley JN, Wright W: PRINTS-S: the database formerly known as PRINTS. Nucleic Acids Res 2000, 28: 225–7. 10.1093/nar/28.1.225 Corpet F, Servant F, Gouzy J, Kahn D: ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons. Nucleic Acids Res 2000, 28: 267–9. 10.1093/nar/28.1.267 Haft DH, Selengut JD, White O: The TIGRFAMs database of protein families. Nucleic Acids Res 2003, 31: 371–3. 10.1093/nar/gkg128