Application of text-mining for updating protein post-translational modification annotation in UniProtKB
Tóm tắt
The annotation of protein post-translational modifications (PTMs) is an important task of UniProtKB curators and, with continuing improvements in experimental methodology, an ever greater number of articles are being published on this topic. To help curators cope with this growing body of information we have developed a system which extracts information from the scientific literature for the most frequently annotated PTMs in UniProtKB. The procedure uses a pattern-matching and rule-based approach to extract sentences with information on the type and site of modification. A ranked list of protein candidates for the modification is also provided. For PTM extraction, precision varies from 57% to 94%, and recall from 75% to 95%, according to the type of modification. The procedure was used to track new publications on PTMs and to recover potential supporting evidence for phosphorylation sites annotated based on the results of large scale proteomics experiments. The information retrieval and extraction method we have developed in this study forms the basis of a simple tool for the manual curation of protein post-translational modifications in UniProtKB/Swiss-Prot. Our work demonstrates that even simple text-mining tools can be effectively adapted for database curation tasks, providing that a thorough understanding of the working process and requirements are first obtained. This system can be accessed at
http://eagl.unige.ch/PTM/
.
Tài liệu tham khảo
UniProt C: Reorganizing the protein space at the universal protein resource (UniProt). Nucleic Acids Res. 2012, 40 (Database issue): D71-D75.
Hirschman L, Burns GA, Krallinger M, Arighi C, Cohen B, Valencia A, Wu CH, Chatr-Aryamontri A, Dowell KG, Huala E: Text mining for the BioCuration workflow. Database. 2012, 2012: bas020-10.1093/database/bas020.
Krallinger M, Leitner F, Rodriguez-Penagos C, Valencia A: Overview of the protein-protein interaction annotation extraction task of BioCreative II. Genome Biol. 2008, 9 (Suppl 2): S4-10.1186/gb-2008-9-s2-s4.
Kim JD, Ohta T, Pyysalo S, Kano Y, Tsujii J: Extracting Bio-molecular events from literature - the BioNLP'09 shared task. Comput Intell. 2011, 27: 513-540. 10.1111/j.1467-8640.2011.00398.x.
Ohta T, Pyysalo S, Tsujii J: Overview of the epigenetics and post-translational modifications (EPI) task of BioNLP shared task 2011. Proceedings of the BioNLP 2011 Workshop Companion Volume for Shared Task: 24 June 2011. 2011, Portland: Association for Computational Linguistics, 16-25.
Buyko E, Faessler E, Wermter J, Hahn U: Event extraction from trimmed dependency graphs. Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task: 24 June 2009. 2009, Boulder: Association for Computational Linguistics, 19-27.
Bjorne J, Salakoski T: Generalizing biomedical event extraction. Proceedings of the BioNLP 2011 Workshop Companion Volume for Shared Task: 24 June 2011. 2011, Portland: Association for Computational Linguistics, 183-190.
Pyysalo S, Ohta T, Miwa M, Tsujii J: Towards exhaustive protein modification event extraction. Proceedings of the 2011 Workshop on Biomedical Natural Language Processing, ACL-HLT 2011; 23-24 June 2011. 2011, Portland: Association for Computational Linguistics, 114-123.
Hu ZZ, Narayanaswamy M, Ravikumar KE, Vijay-Shanker K, Wu CH: Literature mining and database annotation of protein phosphorylation using a rule-based system. Bioinformatics. 2005, 21 (11): 2759-2765. 10.1093/bioinformatics/bti390.
Narayanaswamy M, Ravikumar KE, Vijay-Shanker K: Beyond the clause: extraction of phosphorylation information from medline abstracts. Bioinformatics. 2005, 21 (Suppl 1): i319-i327. 10.1093/bioinformatics/bti1011.
Arighi CN, Siu AY, Tudor CO, Nchoutmboube JA, Wu CH, Shanker VK: eFIP: a tool for mining functional impact of phosphorylation from literature. Methods mol biol. 2011, 694: 63-75. 10.1007/978-1-60761-977-2_5.
Xu Y, Teng D, Lei Y: MinePhos: a literature mining system for protein phoshphorylation information extraction. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2012, 9: 311-315.
Farriol-Mathis N, Garavelli JS, Boeckmann B, Duvaud S, Gasteiger E, Gateau A, Veuthey AL, Bairoch A: Annotation of post-translational modifications in the swiss-prot knowledge base. Proteomics. 2004, 4 (6): 1537-1550. 10.1002/pmic.200300764.
Grimsrud PA, Swaney DL, Wenger CD, Beauchene NA, Coon JJ: Phosphoproteomics for the masses. ACS Chem Biol. 2010, 5 (1): 105-119. 10.1021/cb900277e.
Leroy G, Chen H: Filling preposition-based templates to capture information from medical abstracts. Proceedings of the Pacific Symposium on Biocomputing; Hawaii. 2002, 7: 350-361.
Leroy G, Chen H, Martinez JD: A shallow parser based on closed-class words to capture relations in biomedical text. J biomed infor. 2003, 36 (3): 145-158. 10.1016/S1532-0464(03)00039-X.
Hsu CN, Chang YM, Kuo CJ, Lin YS, Huang HS, Chung IF: Integrating high dimensional bi-directional parsing models for gene mention tagging. Bioinformatics. 2008, 24 (13): i286-i294. 10.1093/bioinformatics/btn183.
Smith L, Tanabe LK, Ando RJ, Kuo CJ, Chung IF, Hsu CN, Lin YS, Klinger R, Friedrich CM, Ganchev K: Overview of BioCreative II gene mention recognition. Genome Biol. 2008, 9 (Suppl 2): S2-10.1186/gb-2008-9-s2-s2.
Ohta T, Pyysalo S, Miwa M, Kim JD, Tsujii J: Event Extraction for Post-Translational Modifications. Proceedings of the 2010 Workshop on Biomedical Natural Language Processing, ACL 2010; 15 July 2010. 2010, Uppsala: Association for Computational Linguistics, 19-27.
Sayers E: Entrez programming utilities help. 2010, Bethesda (MD): National Center for Biotechnology Information (US)
McEntyre JR, Ananiadou S, Andrews S, Black WJ, Boulderstone R, Buttery P, Chaplin D, Chevuru S, Cobley N, Coleman LA: UKPMC: a full text article resource for the life sciences. Nucleic Acids Res. 2011, 39 (Database issue): D58-D65.
Magrane M, Consortium U: UniProt Knowledgebase: a hub of integrated protein data. Database j biol databases curation. 2011, 2011: bar009-
Yip YL, Lachenal N, Pillet V, Veuthey AL: Retrieving mutation-specific information for human proteins in UniProt/Swiss-Prot Knowledgebase. J Bioinforma Comput Biol. 2007, 5 (6): 1215-1231. 10.1142/S021972000700320X.
Lu Z, Kao HY, Wei CH, Huang M, Liu J, Kuo CJ, Hsu CN, Tsai RT, Dai HJ, Okazaki N: The gene normalization task in BioCreative III. BMC Bioinforma. 2011, 12 (Suppl 8): S2-10.1186/1471-2105-12-S8-S2.
Dealemans W, Zavrel J, Berk P, Gillis S: MBT: A memory-based part of speech tagger-generator. Proceedings of the 4th Workshop On Very Large Corpora; 4 August 1996. Edited by: Ejerhed A, Dagan I. 1996, Copenhagen, 14-27.
Montecchi-Palazzi L, Beavis R, Binz PA, Chalkley RJ, Cottrell J, Creasy D, Shofstahl J, Seymour SL, Garavelli JS: The PSI-MOD community standard for representation of protein modification data. Nat Biotechnol. 2008, 26 (8): 864-866. 10.1038/nbt0808-864.
Phan IQ, Pilbout SF, Fleischmann W, Bairoch A: NEWT, a new taxonomy portal. Nucleic Acids Res. 2003, 31 (13): 3822-3823. 10.1093/nar/gkg516.
Pillet V, Zehnder M, Seewald AK, Veuthey AL, Petrak J: GPSDB: a new database for synonyms expansion of gene and protein names. Bioinformatics. 2005, 21 (8): 1743-1744. 10.1093/bioinformatics/bti235.