Semi-automated literature mining to identify putative biomarkers of disease from multiple biofluids

ClinTransMed, AB - Tập 4 - Trang 1-9 - 2014
Rick Jordan1, Shyam Visweswaran1,2,3, Vanathi Gopalakrishnan1,2,3
1Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, USA
2Intelligent Systems Program, University of Pittsburgh, Pittsburgh, USA
3Department of Computational & Systems Biology, University of Pittsburgh, Pittsburgh, USA

Tóm tắt

Computational methods for mining of biomedical literature can be useful in augmenting manual searches of the literature using keywords for disease-specific biomarker discovery from biofluids. In this work, we develop and apply a semi-automated literature mining method to mine abstracts obtained from PubMed to discover putative biomarkers of breast and lung cancers in specific biofluids. A positive set of abstracts was defined by the terms ‘breast cancer’ and ‘lung cancer’ in conjunction with 14 separate ‘biofluids’ (bile, blood, breastmilk, cerebrospinal fluid, mucus, plasma, saliva, semen, serum, synovial fluid, stool, sweat, tears, and urine), while a negative set of abstracts was defined by the terms ‘(biofluid) NOT breast cancer’ or ‘(biofluid) NOT lung cancer.’ More than 5.3 million total abstracts were obtained from PubMed and examined for biomarker-disease-biofluid associations (34,296 positive and 2,653,396 negative for breast cancer; 28,355 positive and 2,595,034 negative for lung cancer). Biological entities such as genes and proteins were tagged using ABNER, and processed using Python scripts to produce a list of putative biomarkers. Z-scores were calculated, ranked, and used to determine significance of putative biomarkers found. Manual verification of relevant abstracts was performed to assess our method’s performance. Biofluid-specific markers were identified from the literature, assigned relevance scores based on frequency of occurrence, and validated using known biomarker lists and/or databases for lung and breast cancer [NCBI’s On-line Mendelian Inheritance in Man (OMIM), Cancer Gene annotation server for cancer genomics (CAGE), NCBI’s Genes & Disease, NCI’s Early Detection Research Network (EDRN), and others]. The specificity of each marker for a given biofluid was calculated, and the performance of our semi-automated literature mining method assessed for breast and lung cancer. We developed a semi-automated process for determining a list of putative biomarkers for breast and lung cancer. New knowledge is presented in the form of biomarker lists; ranked, newly discovered biomarker-disease-biofluid relationships; and biomarker specificity across biofluids.

Tài liệu tham khảo

Hirschman L, Park JC, Tsujii J, Wong L, Wu CH: Accomplishments and challenges in literature data mining for biology. Bioinformatics. 2002, 18: 1553-1561. 10.1093/bioinformatics/18.12.1553. Adamic LA, Wilkinson D, Huberman BA, Adar E: A literature based method for identifying gene-disease connections. Proc IEEE Comput Soc Bioinform Conf. 2002, 1: 109-117. Wren JD, Bekeredjian R, Stewart JA, Shohet RV, Garner HR: Knowledge discovery by automated identification and ranking of implicit relationships. Bioinformatics. 2004, 20: 389-398. 10.1093/bioinformatics/btg421. Xuan W, Wang P, Watson SJ, Meng F: Medline search engine for finding genetic markers with biological significance. Bioinformatics. 2007, 23: 2477-2484. 10.1093/bioinformatics/btm375. Hristovski D, Peterlin B, Mitchell JA, Humphrey SM: Using literature-based discovery to identify disease candidate genes. Int J Med Inform. 2005, 74: 289-298. 10.1016/j.ijmedinf.2004.04.024. Novichkova S, Egorov S, Daraseila N: MedScan, a natural language processing engine for MEDLINE abstracts. Bioinformatics. 2003, 19: 1699-1706. 10.1093/bioinformatics/btg207. Srinivasan P: Text mining: generating hypotheses from MEDLINE. J Am Soc Inform Sci Technol. 2004, 55: 396-413. 10.1002/asi.10389. Leonard JE, Colombe JB, Levy JL: Finding relevant references to genes and proteins in Medline using a Bayesian approach. Bioinformatics. 2002, 18: 1515-1522. 10.1093/bioinformatics/18.11.1515. Jensen LJ, Saric J, Bork P: Literature mining for the biologist: from information retrieval to biological discovery. Nat Rev Genet. 2006, 7: 119-129. 10.1038/nrg1768. Krallinger M, Valencia A, Hirschman L: Linking genes to literature: text mining, information extraction, and retrieval applications for biology. Genome Biol. 2008, 9 (Suppl.2): S8- Cohen AM, Hersh WR: A survey of current work in biomedical text mining. Brief Bioinform. 2005, 6: 57-71. 10.1093/bib/6.1.57. Swanson DR: Medical literature as a potential source of new knowledge. Bull Med Libr Assoc. 1990, 78: 29-37. Zhu S, Okuno Y, Tsujimoto G, Mamitsuka H: Application of a new probabilistic model for mining implicit associated cancer genes from OMIM and Medline. Cancer Inform. 2006, 2: 361-371. Frijters R, Van Vugt M, Smeets R, Van Schaik R, De Vlieg J, Alkema W: Literature mining for the discovery of hidden connections between drugs, genes and diseases. PLoS Comput Biol. 2010, 6: e1000943-10.1371/journal.pcbi.1000943. Li H, Liu C: Biomarker identification using text mining. Comput Math Methods Med. 2012, 2012: 135780- Al-Mubaid H, Singh RK: A new text mining approach for finding protein-to-disease associations. Am J Biochem Biotechnol. 2005, 1: 145-152. Andrade MA, Valencia A: Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families. Bioinformatics. 1998, 14: 600-607. 10.1093/bioinformatics/14.7.600. Younesi E, Toldo L, Muller B, Friedrich CM, Novac N, Scheer A, Hofmann-Apitius M, Fluck J: Mining biomarker information in biomedical literature. BMC Med Inform Decis Mak. 2012, 12: 148-10.1186/1472-6947-12-148. Deyati A, Younesi E, Hofmann-Apitius M, Novac N: Challenges and opportunities for oncology biomarker discovery. Drug Discov Today. 2012, 18: 614-624. Veenstra T, Conrads T, Hood B, Avellino A, Ellenbogen R, Morrison R: Biomarkers: mining the biofluid proteome. Mol Cell Proteomics. 2005, 4: 409-418. 10.1074/mcp.M500006-MCP200. Zhou M, Conrads T, Veenstra T: Proteomics approaches to biomarker detection. Brief Funct Genom Proteomics. 2005, 4: 69-75. 10.1093/bfgp/4.1.69. Lee Y, Wong D: Saliva: An emerging biofluid for early detection of diseases. Am J Dent. 2009, 22: 241-248. Gao K, Zhou H, Zhang L, Lee J, Zhou Q, Hu S, Wolinsky L, Farrell J, Eibl G, Wong D: Systemic disease-induced salivary biomarker profiles in mouse models of melanoma and non-small cell lung cancer. PLoS One. 2009, 4: e5875-10.1371/journal.pone.0005875. Xu X, Veenstra T: Analysis of biofluids for biomarker research. Proteomics Clin Appl. 2008, 2: 1403-1412. 10.1002/prca.200780173. Delaleu N, Immervoll H, Cornelius J, Jonsson R: Biomarker profiles in serum and saliva of experimental Sjogren’s syndrome: associations with specific autoimmune manifestations. Arthritis Res Ther. 2008, 10: R22-10.1186/ar2375. Alterovitz G, Xiang M, Liu J, Chang A, Ramoni MF: System-wide peripheral biomarker discovery using information theory. Pac Symp Biocomput. 2008, 231-242. Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, Binns D, Harte N, Lopez R, Apweiler R: The Gene Ontology Annotation (GOA) database: sharing knowledge in uniprot with gene ontology. Nucleic Acids Res. 2004, 32 (Database issue): D262-D266. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25 (1): 25-29. 10.1038/75556. Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S, Geer LY, Kapustin Y, Khovayko O, Landsman D, Lipman DJ, Madden TL, Maglott DR, Ostell J, Miller V, Pruitt KD, Schuler GD, Sequeira E, Sherry ST, Sirotkin K, Souvorov A, Starchecko G, Tatusov RL, Tatusova TA, Wagner L, Yaschenko E: Database resources of the national center for biotechnology information. Nucleic Acids Res. 2007, 35 (Database issue): D5-D12. Epub 2006 Dec 14 Hewett M, Oliver DE, Rubin DL, Easton KL, Stuart JM, Altman RB, Klein TE: PharmGKB: the pharmacogenetics knowledge base. Nucleic Acids Res. 2002, 30 (1): 163-165. 10.1093/nar/30.1.163. Settles B: ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics. 2005, 21: 3191-3192. 10.1093/bioinformatics/bti475. Park YK, Kang TW, Baek SJ, Kim KI, Kim SY, Lee D, Kim YS: CaGe: a web-based cancer gene annotation system for cancer genomics. Genom Inform. 2012, 10 (1): 33-39. 10.5808/GI.2012.10.1.33. Epub 2012 Mar 31 National Center for Biotechnology Information (US): Genes and Disease [Internet]. 1998, Bethesda (MD): National Center for Biotechnology Information (US), Available from: http://www.ncbi.nlm.nih.gov/books/NBK22183/ Wagner PD, Srivastava S: New paradigms in translational science research in cancer biomarkers. Transl Res. 2012, 159 (4): 343-353. 10.1016/j.trsl.2012.01.015. Epub 2012 Feb 3 Bigbee WL, Gopalakrishnan V, Weissfeld JL, Wilson DO, Dacic S, Lokshin AE, Siegfried JM: A multiplexed serum biomarker immunoassay panel discriminates clinical lung cancer patients from high-risk individuals found to be cancer-free by CT screening. J Thorac Oncol. 2012, 7 (4): 698-708. 10.1097/JTO.0b013e31824ab6b0. Cancer Genome Atlas Network: Comprehensive molecular portraits of human breast tumours. Nature. 2012, Advanced online publication