Liên kết gen với tài liệu: khai thác văn bản, trích xuất thông tin và ứng dụng truy xuất cho sinh học

Genome Biology - Tập 9 - Trang 1-14 - 2008
Martin Krallinger1, Alfonso Valencia1, Lynette Hirschman2
1Structural Biology and BioComputing Programme, Spanish Nacional Cancer Research Centre (CNIO), Madrid, Spain
2The MITRE Corporation, Bedford, USA

Tóm tắt

Việc truy cập hiệu quả vào thông tin chứa trong các bộ sưu tập văn học khoa học trực tuyến là điều thiết yếu cho nghiên cứu sinh học, đóng vai trò quan trọng từ giai đoạn lập kế hoạch thí nghiệm ban đầu đến việc giải thích và truyền đạt kết quả cuối cùng. Văn học sinh học cũng là nguồn thông tin chính cho việc biên soạn văn học thủ công được sử dụng bởi các cơ sở dữ liệu do chuyên gia biên soạn. Trước sự gia tăng phổ biến của các ứng dụng dựa trên web để phân tích dữ liệu sinh học, các chiến lược khai thác văn bản và trích xuất thông tin mới đang được triển khai. Các hệ thống này khai thác các quy luật hiện có trong ngôn ngữ tự nhiên để trích xuất thông tin có liên quan sinh học từ các văn bản điện tử một cách tự động. Mục tiêu của thử thách BioCreative là thúc đẩy sự phát triển của những công cụ như vậy và cung cấp cái nhìn sâu sắc về hiệu suất của chúng. Bài đánh giá này trình bày một giới thiệu chung về các đặc điểm chính và ứng dụng của các hệ thống khai thác văn bản hiện có cho khoa học sự sống xét về các vấn đề sau: loại thông tin sinh học mà chúng đang xử lý; cấp độ chi tiết thông tin của cả truy vấn của người dùng và kết quả; và các đặc điểm cũng như phương pháp thường được các ứng dụng này khai thác. Xu hướng hiện tại trong khai thác văn bản sinh học chỉ ra sự đa dạng hóa ngày càng tăng về các loại ứng dụng và kỹ thuật, cùng với sự tích hợp các nguồn tài nguyên cụ thể cho miền như là các ontologies. Thêm mô tả về một số hệ thống được thảo luận tại đây có sẵn trên internet http://zope.bioinfo.cnio.es/bionlp_tools/ .

Từ khóa


Tài liệu tham khảo

Buckingham S: Bioinformatics: data's future shock. Nature. 2004, 428: 774-777. Searls D: Mining the bibliome. Pharmacogenomics J. 2001, 1: 88-89. Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, Binns D, Harte N, Lopez R, Apweiler R: The Gene Ontology Annotation (GOA) database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res. 2004, 32: 262-266. Galperin M: The Molecular Biology Database Collection: 2008 update. Nucleic Acids Res. 2008, 36: D2-D4. Baumgartner W, Cohen K, Fox L, Acquaah-Mensah G, Hunter L: Manual curation is not sufficient for annotation of genomic databases. Bioinformatics. 2007, 23: i41-i48. Hunter L, Cohen K: Biomedical language processing: what's beyond PubMed?. Mol Cell. 2006, 21: 589-594. Sood A, Erwin P, Ebbert J: Using advanced search tools on PubMed for citation retrieval. Mayo Clin Proc. 2004, 79: 1295-1299. Sneiderman C, Demner-Fushman D, Fiszman M, Ide N, Rindflesch T: Knowledge-based methods to help clinicians find answers in MEDLINE. J Am Med Inform Assoc. 2007, 14: 772-780. Haynes BR, McKibbon K, Wilczynski N, Walter S, Werre S: Optimal search strategies for retrieving scientifically strong studies of treatment from Medline: analytical survey. BMJ. 2005, 330: 1179- Gomez-Lopez G, Valencia A: Bioinformatics and cancer research: building bridges for translational research. Clin Transl Oncol. 2008, 10: 85-95. Weeber M, Klein H, Aronson A, Mork J, deJong L, Vos R: Text-based discovery in biomedicine: the architecture of the DAD-system. Proc AMIA Symp. 2000, 903-907. Roberts P, Hayes W: Information needs and the role of text mining in drug development. Pac Symp Biocomput. 2008, 592-603. Synnestvedt M, Chen C, Holmes J: CiteSpace II: visualization and knowledge discovery in bibliographic databases. AMIA Annu Symp Proc. 2005, 724-728. Mane K, Boerner K: Mapping topics and topic bursts in PNAS. Proc Natl Acad Sci USA. 2004, 101: 5287-5290. Errami M, Hicks J, Fisher W, Trust D, Wren J, Long T, Garner H: Deja vu-a study of duplicate citations in Medline. Bioinformatics. 2008, 24: 243-249. Douglas S, Montelione G, Gerstein M: PubNet: a flexible system for visualizing literature derived networks. Genome Biol. 2005, 6: R80- Falagas M, Giannopoulou K, Issaris E, Spanos A: World databases of summaries of articles in the biomedical fields. Arch Intern Med. 2007, 167: 1204-1206. Wheeler D, Barrett T, Benson D, Bryant S, Canese K, Chetvernin V, Church D, Dicuccio M, Edgar R, Federhen S, Feolo M, Geer L, Helmberg W, Kapustin Y, Khovayko O, Landsman D, Lipman D, Madden T, Maglott DR, Miller V, Ostell J, Pruitt K, Schuler G, Shumway M, Sequeira E, Sherry S, Sirotkin K, Souvorov A, Starchenko G, Tatusov R, et al: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2008, 36: D13-D21. Nelson S, Schopen A, Savage M, Schulman J, Arluk N: The MeSH translation maintenance system: structure, interface design, and implementation. Medinfo. 2008, 11: 67-69. Mangalam H: The Bio* toolkits-a brief overview. Brief Bioinform. 2002, 3: 296-302. Shultz M, DeGroote S: MEDLINE SDI services: how do they compare?. J Med Libr Assoc. 2003, 91: 460-467. Oliver D, Bhalotia G, Schwartz A, Altman R, Hearst M: Tools for loading MEDLINE into a local relational database. BMC Bioinformatics. 2004, 5: 146- Fontelo P, Liu F, Muin M, Tolentino H, Ackerman M: Txt2MEDLINE: text-messaging access to MEDLINE/PubMed. AMIA Annu Symp Proc. 2006, 259-263. Muin M, Fontelo P, Ackerman M: PubMed Informer: monitoring MEDLINE/PubMed through e-mail alerts, SMS, PDA downloads and RSS feeds. AMIA Annu Symp Proc. 2005, 1057- Steinbrook R: Searching for the right search-reaching the medical literature. N Engl J Med. 2006, 354: 4-7. Shultz M: Comparing test searches in PubMed and Google Scholar. J Med Libr Assoc. 2007, 95: 442-445. Vanhecke T, Barnes M, Zimmerman J, Shoichet S: PubMed vs. HighWire Press: a head-to-head comparison of two medical literature search engines. Comput Biol Med. 2007, 37: 1252-1258. Schuemie MJ, Weeber M, Schijvenaars BJ, van Mulligen EM, Eijk van der CC, Jelier R, Mons B, Kors JA: Distribution of information in biomedical abstracts and full-text publications. Bioinformatics. 2004, 20: 2597-2604. Smith L, Rindflesch T, Wilbur W: MedPost: a part-of-speech tagger for bioMedical text. Bioinformatics. 2004, 20: 2320-2321. Divita G, Browne A, Loane R: dTagger: a POS tagger. AMIA Annu Symp Proc. 2006, 200-203. Huang M, Zhu X, Hao Y, Payan D, Qu K, Li M: Discovering patterns to extract protein-protein interactions from full texts. Bioinformatics. 2004, 20: 3604-3612. Tanabe L, Wilbur W: Tagging gene and protein names in biomedical text. Bioinformatics. 2002, 18: 1124-1132. Krallinger M, Malik R, Valencia A: Text mining and protein annotations: the construction and use of protein description sentences. Genome Inform. 2006, 17: 121-130. Harris Z: Discourse Analysis Reprints. 1963, The Hague, The Netherlands: Mouton Kittredge R, Lehrberger J: Sublanguage: Studies of Language in Restricted Semantic Domains. 1982, Berlin and New York: Walter de Gruyter Grishman R: Adaptive information extraction and sublanguage analysis. Proceedings of the Workshop on Adaptive Text Extraction and Mining, at the 17th International Joint Conference on Artificial Intelligence. 2001, [http://nlp.cs.nyu.edu/publication/papers/grishman-ijcai01.pdf] Netzel R, Perez-Iratxeta C, Bork P, Andrade M: The way we write. EMBO Rep. 2003, 4: 446-451. Staab S, Blaschke C, Nedellec C, Park J, Schatz B, Valencia A, Bernardi L, Ratsch E, Kania R, Saric J, Rojas I, Staab S: Mining information for functional genomics. IEEE Intell Syst. 2002, 17: 66-80. Krauthammer M, Nenadic G: Term identification in the biomedical literature. J Biomed Inform. 2004, 37: 512-526. Chen L, Liu H, Friedman C: Gene name ambiguity of eukaryotic nomenclatures. Bioinformatics. 2005, 21: 248-256. Yu H, Kim W, Hatzivassiloglou V, Wilbur W: Using MEDLINE as a knowledge source for disambiguating abbreviations and acronyms in full-text biomedical journal articles. J Biomed Inform. 2007, 40: 150-159. Okazaki N, Ananiadou S: Building an abbreviation dictionary using a term recognition approach. Bioinformatics. 2006, 22: 3089-3095. Zhou W, Torvik V, Smalheiser N: ADAM: another database of abbreviations in MEDLINE. Bioinformatics. 2006, 22: 2813-2818. Chang J, Schuetze H, Altman R: Creating an online dictionary of abbreviations from MEDLINE. J Am Med Inform Assoc. 2002, 9: 612-620. Hoffmann R, Valencia A: Life cycles of successful genes. Trends Genet. 2003, 19: 79-81. Tamames J, Valencia A: The success (or not) of HUGO nomenclature. Genome Biol. 2006, 7: 402- Tomanek K, Wermter J, Hahn U: A reappraisal of sentence and token splitting for life science documents. MEDINFO 2007: Proceedings of the 12th World Congress on Medical Informatics; 2007; Brisbane, Australia. 2007, Amsterdam, The Netherlands: IOS Press Wilbur J, Smith L, Tanabe L: BioCreative 2. gene mention task. Proceedings of the BioCreative Workshop; 22 to 25. 2007, [http://compbio.uchsc.edu/Hunter_lab/Cohen/BC2_Proceedings.pdf]April ; Madrid, Spain Baumgartner J, Lu Z, Johnson H, Caporaso J, Paquette J, Lindemann A, White E, Medvedeva O, Cohen K, Hunter L: Concept recognition for extracting protein interaction relations from biomedical text. Genome Biol. 2008, 9 (Suppl 2): S9- Settles B: ABNER: an open source tool for automatically tagging genes, proteins, and other entity names in text. Bioinformatics. 2005, 21: 3191-3192. Carpenter B: Phrasal queries with LingPipe and Lucene: ad hoc genomics text retrieval. Proceedings of the 13th Annual Text Retrieval Conference. 2004, [http://trec.nist.gov/pubs/trec13/papers/alias-i.geo.pdf] Hakenberg J, Plake C, Royer L, Strobelt H, Leser U, Schroeder M: Gene mention normalization and interaction extraction with context models and sentence motifs. Genome Biol. 2008, 9 (Suppl 2): S14- Porter M: An algorithm for suffix stripping. Program. 1980, 14: 130-137. Smalheiser N, Zhou W, Torvik V: Anne O'Tate: a tool to support user-driven summarization, drill-down and browsing of PubMed search results. J Biomed Discov Collab. 2008, 3: 2- Ashburner M, Ball C, Blake J, Botstein D, Butler H, Cherry J, Davis A, Dolinski K, Dwight S, Eppig J, Harris M, Hill D, Issel-Tarver L, Kasarskis A, Lewis S, Matese J, Richardson J, Ringwald M, Rubin G, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25: 25-29. Doms A, Schroeder M: GoPubMed: exploring PubMed with the Gene Ontology. Nucleic Acids Res. 2005, 33: W783-W786. Couto F, Silva M, Lee V, Dimmer E, Camon E, Apweiler R, Kirsch H, Rebholz-Schuhmann D: GOAnnotator: linking protein GO annotations to evidence text. J Biomed Discov Collab. 2006, 1: 19- Couto F, Silva M, Lee V, Dimmer E, Camon E, Apweiler R, Kirsch H, Rebholz-Schuhmann D: GOAnnotator: linking protein GO annotations to evidence text. J Biomed Discov Collab. 2006, 1: 19- Blaschke C, Leon E, Krallinger M, Valencia A: Evaluation of BioCreAtIvE assessment of task 2. BMC Bioinformatics. 2005, 6: S16- Craven M, Kumlien J: Constructing biological knowledge bases by extracting information from text sources. Proc Int Conf Intell Syst Mol Biol. 1999, 77-86. Liu H, Hu Z, Zhang J, Wu C: BioThesaurus: a web-based thesaurus of protein and gene names. Bioinformatics. 2006, 22: 103-105. Hanisch D, Fundel K, Mevissen H, Zimmer R, Fluck J: ProMiner: rule-based protein and gene entity recognition. BMC Bioinformatics. 2005, 6: S14- Frantzi K, Ananiadou S, Mima H: Automatic recognition of multi-word terms. Int J Digital Libr. 2000, 3: 117-132. Andrade MA, Valencia A: Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families. Bioinformatics. 1998, 14: 600-607. Alako B, Veldhoven A, vanBaal S, Jelier R, Verhoeven S, Rullmann T, Polman J, Jenster G: CoPub Mapper: mining MEDLINE based on search term co-publication. BMC Bioinformatics. 2005, 6: 51- Jenssen T, Laegreid A, Komorowski J, Hovig E: A literature network of human genes for high-throughput analysis of gene expression. Nat Genet. 2001, 28: 21-28. Zhou W, Smalheiser N, Yu C: A tutorial on information retrieval: basic terms and concepts. J Biomed Discov Collab. 2006, 1: 2- Lewis J, Ossowski S, Hicks J, Errami M, Garner H: Text similarity: an alternative way to search MEDLINE. Bioinformatics. 2006, 22: 2298-2304. Chen D, Mueller H, Sternberg P: Automatic document classification of biological literature. BMC Bioinformatics. 2006, 7: 370- Iliopoulos I, Enright A, Ouzounis C: Textquest: document clustering of Medline abstracts for concept discovery in molecular biology. Pac Symp Biocomput. 2001, 384-395. Altschul S: Amino acid substitution matrices from an information theoretic perspective. J Mol Biol. 1991, 219: 555-565. Shatkay H: Hairpins in bookstacks: information retrieval from biomedical text. Brief Bioinform. 2005, 6: 222-238. Fattore M, Arrigo P: Knowledge discovery and system biology in molecular medicine: an application to neurodegenerative diseases. In Silico Biol. 2004, 5: 199-208. Yamamoto Y, Takagi T: Biomedical knowledge navigation by literature clustering. J Biomed Inform. 2007, 40: 114-130. Brady S, Shatkay H: EpiLoc: a (working) text-based system for predicting protein subcellular location. Pac Symp Biocomput. 2008, 604-615. Hakenberg J, Schmeier S, Kowald A, Klipp E, Leser U: Finding kinetic parameters using text mining. OMICS. 2004, 8: 131-152. Nedellec C, Ould-Abdel-Vetah M, Bessieres P: Sentence filtering for information extraction in genomics, a classification problem. Lecture Notes Comp Sci. 2001, 2168: 326-337. Smith L, Tanabe L, Ando R, Kuo C, Chung I, Hsu C, Lin Y, Klinger R, Friedrich C, Ganchev K, Torii M, Liu H, Haddow B, Struble C, Povinelli R, Vlachos A, Baumgartner W, Hunter L, Carpenter B, Tsai R, Dai H, Liu F, Chen Y, Sun C, Katrenko S, Adriaans P, Blaschke C, Torres R, Neves M, Nakov P, Divoli M, Mana-Lopez A, Mata-Vazquez J, Wilbur W: Overview of BioCreative II gene mention recognition. Genome Biol. 2008, 9 (Suppl 2): S2- Krallinger M, Valencia A: Text-mining and information-retrieval services for molecular biology. Genome Biol. 2006, 6: 224- Chang J, Schutze H, Altman R: GAPSCORE: finding gene and protein names one word at a time. Bioinformatics. 2004, 20: 216-225. Batchelor C, Corbett P: Semantic enrichment of journal articles using chemical named entity recognition. Proceedings of the 45th Annual Meeting of the ACL. 2007, [http://www.aclweb.org/anthology-new/P/P07/P07-2012.pdf] Sarkar I: Biodiversity informatics: organizing and linking information across the spectrum of life. Brief Bioinform. 2007, 8: 347-357. Leary P, Remsen D, Norton C, Patterson D, Sarkar I: uBioRSS: tracking taxonomic literature using RSS. Bioinformatics. 2007, 23: 1434-1436. Koning D, Sarkar I, Moritz T: TaxonGrab: extracting taxonomic names from text. Biodiversity Informat. 2005, 2: 79-82. Divoli A, Attwood T: BioIE: extracting informative sentences from the biomedical literature. Bioinformatics. 2005, 21: 2138-2139. Hoffmann R, Valencia A: Implementing the iHOP concept for navigation of biomedical literature. Bioinformatics. 2005, 21: ii252-ii258. Rebholz-Schuhmann D, Kirsch H, Arregui M, Gaudan S, Riethoven M, Stoehr P: EBIMed: text crunching to gather facts for proteins from Medline. Bioinformatics. 2007, 23: e237-e244. Mitchell A, Divoli A, Kim J, Hilario M, Selimas I, Attwood T: METIS: multiple extraction techniques for informative sentences. Bioinformatics. 2005, 21: 4196-4197. Tu Q, Tang H, Ding D: MedBlast: searching articles related to a biological sequence. Bioinformatics. 2004, 20: 75-77. Rodriguez-Penagos C, Salgado H, Martinez-Flores I, Collado-Vides J: Automatic reconstruction of a bacterial regulatory network using natural language processing. BMC Bioinformatics. 2007, 8: 293- Che H, Sharp B: Content-rich biological network constructed by mining PubMed abstracts. BMC Bioinformatics. 2004, 5: 147- Caporaso J, Baumgartner W, Randolph D, Cohen K, Hunter L: MutationFinder: a high-performance system for extracting point mutation mentions from text. Bioinformatics. 2007, 23: 1862-1865. Xuan W, Wang P, Watson S, Meng F: Medline search engine for finding genetic markers with biological significance. Bioinformatics. 2007, 23: 2477-2484. Shtatland T, Guettler D, Kossodo M, Pivovarov M, Weissleder R: PepBank-a database of peptides based on sequence text mining and public peptide data sources. BMC Bioinformatics. 2007, 8: 280- Narayanaswamy M, Ravikumar K, Vijay-Shanker K: Beyond the clause: extraction of phosphorylation information from medline abstracts. Bioinformatics. 2005, i319-i327. Suppl 1 Fang Y, Huang H, Juan H: MeInfoText: associated gene methylation and cancer information from text mining. BMC Bioinformatics. 2008, 9: 22- Ongenaert M, VanNeste L, DeMeyer T, Menschaert G, Bekaert S, VanCriekinge W: PubMeth: a cancer methylation database combining text-mining and expert annotation. Nucleic Acids Res. 2008, 36: D842-D846. Grimes G, Wen T, Mewissen M, Baxter R, Moodie S, Beattie J, Ghazal P: PDQ Wizard: automated prioritization and characterization of gene and protein lists using biomedical literature. Bioinformatics. 2006, 22: 2055-2057. Smalheiser N, Swanson D: Using ARROWSMITH: a computer-assisted approach to formulating and assessing scientific hypotheses. Comput Methods Programs Biomed. 1998, 57: 149-153. Bio-NLP resources. [http://zope.bioinfo.cnio.es/bionlp_tools/] Wren J: 404 not found: the stability and persistence of URLs published in MEDLINE. Bioinformatics. 2004, 20: 668-672. Chen D, Orthner H, Sell S: Personalized online information search and visualization. BMC Med Inform Decis Mak. 2005, 5: 6- Poulter G, Rubin D, Altman R, Seoighe C: MScanner: a classifier for retrieving Medline citations. BMC Bioinformatics. 2008, 9: 108-