A comparison of conditional random fields and structured support vector machines for chemical entity recognition in biomedical literature

Springer Science and Business Media LLC - 2015

Buzhou Tang¹, Ying Feng², Xiaolong Wang¹, Yonghui Wu³, Yaoyun Zhang³, Min Jiang³, Jingqi Wang³, Hua Xu³

¹Department of Computer Science, Harbin Institute of Technology Shenzhen Guraduate, Shenzhen, Guangdong, China

²Department of Pharmacy, the First Affiliated Hospital, Harbin Medical University Harbin, Heilongjiang, China

³School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA

Tóm tắt

Abstract Background Chemical compounds and drugs (together called chemical entities) embedded in scientific articles are crucial for many information extraction tasks in the biomedical domain. However, only a very limited number of chemical entity recognition systems are publically available, probably due to the lack of large manually annotated corpora. To accelerate the development of chemical entity recognition systems, the Spanish National Cancer Research Center (CNIO) and The University of Navarra organized a challenge on Chemical and Drug Named Entity Recognition (CHEMDNER). The CHEMDNER challenge contains two individual subtasks: 1) Chemical Entity Mention recognition (CEM); and 2) Chemical Document Indexing (CDI). Our study proposes machine learning-based systems for the CEM task. Methods The 2013 CHEMDNER challenge organizers provided a manually annotated 10,000 UTF8-encoded PubMed abstracts according to a predefined annotation guideline: a training set of 3,500 abstracts, a development set of 3,500 abstracts and a test set of 3,000 abstracts. We developed machine learning-based systems, based on conditional random fields (CRF) and structured support vector machines (SSVM) respectively, for the CEM task for this data set. The effects of three types of word representation (WR) features, generated by Brown clustering, random indexing and skip-gram, on both two machine learning-based systems were also investigated. The performance of our system was evaluated on the test set using scripts provided by the CHEMDNER challenge organizers. Primary evaluation measures were micro Precision, Recall, and F-measure. Results Our best system was among the top ranked systems with an official micro F-measure of 85.05%. Fixing a bug caused by inconsistent features marginally improved the performance (micro F-measure of 85.20%) of the system. Conclusions The SSVM-based CEM systems outperformed the CRF-based CEM systems when using the same features. Each type of the WR feature was beneficial to the CEM task. Both the CRF-based and SSVM-based systems using the all three types of WR features showed better performance than the systems using only one type of the WR feature.

Từ khóa

Tài liệu tham khảo

Vazquez M, Krallinger M, Leitner F, Valencia A: Text Mining for Drugs and Chemical Compounds: Methods, Tools and Applications. Mol Informatics. 2011, 30 (6-7): 506-519. 10.1002/minf.201100005.

Tjong Kim Sang E, De Meulder F: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. presented at the Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003. 2003, 4: 142-147.

Kim J-D, Ohta T, Tsuruoka Y, Tateisi Y, Collier N: Introduction to the bio-entity recognition task at JNLPBA. Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications. 2004, Stroudsburg, PA, USA, 70-75.

Smith L, Tanabe LK, nee Ando RJ, Kuo C-J, Chung I-F, Hsu C-N, Lin Y-S, Klinger R, Friedrich CM, Ganchev K, Torii M, Liu H, Haddow B, Struble CA, Povinelli RJ, Vlachos A, Baumgartner WA, Hunter L, Carpenter B, Tsai RT-H, Dai H-J, Liu F, Chen Y, Sun C, Katrenko S, Adriaans P, Blaschke C, Torres R, Neves M, Nakov P, Divoli A, Maña-López M, Mata J, Wilbur WJ: Overview of BioCreative II gene mention recognition. Genome Biol. 2008, 9 (Suppl 2): S2-10.1186/gb-2008-9-s2-s2.

Uzuner Ö, South BR, Shen S, DuVall SL: 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. J Am Med Informatics Assoc JAMIA. 2011, 18 (5): 552-556. 10.1136/amiajnl-2011-000203. Oct

Wang Y, Xiao J, Suzek TO, Zhang J, Wang J, Bryant SH: PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res. 2009, W623-633. Jul, 37 Web Server

Degtyarenko K, de Matos P, Ennis M, Hastings J, Zbinden M, McNaught A, Alcántara R, Darsow M, Guedj M, Ashburner M: ChEBI: a database and ontology for chemical entities of biological interest. Nucleic Acids Res. 2008, D344-350. Jan, 36 Database

Hettne KM, Stierum RH, Schuemie MJ, Hendriksen PJM, Schijvenaars BJA, van Mulligen EM, Kleinjans J, Kors JA: A dictionary to identify small molecules and drugs in free text. Bioinforma Oxf Engl. 2009, 25 (22): 2983-2991. 10.1093/bioinformatics/btp535. Nov

Pence HE, Williams A: ChemSpider: An Online Chemical Information Resource. J Chem Educ. 2010, 87 (11): 1123-1124. 10.1021/ed100697w.

Van Camp AJ: The MeSH Vocabulary File and CHEMLINE. Med Ref Serv Q. 1984, 3 (1): 1-17. 10.1300/J115v03n01_01.

Law V, Knox C, Djoumbou Y, Jewison T, Guo AC, Liu Y, Maciejewski A, Arndt D, Wilson M, Neveu V, Tang A, Gabriel G, Ly C, Adamjee S, Dame ZT, Han B, Zhou Y, Wishart DS: DrugBank 4.0: shedding new light on drug metabolism. Nucleic Acids Res. 2014, D1091-1097. Jan, 42 Database

Rebholz-Schuhmann D, Arregui M, Gaudan S, Kirsch H, Jimeno A: Text processing through Web services: calling Whatizit. Bioinforma Oxf Engl. 2008, 24 (2): 296-298. 10.1093/bioinformatics/btm557. Jan

Corbett P, Murray-Rust P: High-Throughput Identification of Chemistry in Life Science Texts. Proceedings of the Second International Conference on Computational Life Sciences. 2006, Berlin, Heidelberg, 107-118.

Jessop DM, Adams SE, Willighagen EL, Hawizy L, Murray-Rust P: OSCAR4: a flexible architecture for chemical text-mining. J Cheminformatics. 2011, 3 (1): 41-10.1186/1758-2946-3-41.

Rocktäschel T, Weidlich M, Leser U: ChemSpot: a hybrid system for chemical named entity recognition. Bioinforma Oxf Engl. 2012, 28 (12): 1633-1640. 10.1093/bioinformatics/bts183. Jun

McCallum A, Freitag D, Pereira FCN: Maximum Entropy Markov Models for Information Extraction and Segmentation. Proceedings of the Seventeenth International Conference on Machine Learning, San Francisco, CA, USA. 2000, 591-598.

Kolárik C, Klinger R, Friedrich CM, Hofmann-Apitius M, Fluck J: Chemical names: terminological resources and corpora annotation. Workshop on Building and evaluating resources for biomedical text mining (6th edition of the Language Resources and Evaluation Conference). 2008

Klinger R, Kolárik C, Fluck J, Hofmann-Apitius M, Friedrich CM: Detection of IUPAC and IUPAC-like chemical names. Bioinforma Oxf Engl. 2008, 24 (13): i268-276. 10.1093/bioinformatics/btn181. Jul

Lafferty J, McCallum A, Pereira FCN: Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. Dep Pap CIS. 2001

Krallinger M, Rabal O, Leitner F, Vazquez M, Oyarzabal J, Valencia A: Overview of the chemical compound and drug name recognition (CHEMDNER) task. Proceddings of the fourth BioCreative challenge evaluation workshop. 2013, 2: 2-33.

Krallinger M, Leitner F, Rabal O, Vazquez M, Oryazabal J, Valencia A: CHEMDNER: The drugs and chemical names extraction challenge. J Cheminformatics. 2014

Krallinger M, Rabal O, Leitner F, Vazquez M, Salgado D, Lu Z, Leaman R, Lu Y, Ji D, Lowe DM, Sayle RA, Batista-Navarro RT, Rak R, Huber T, Rocktaschel T, Matos S, Campos D, Tang B, Xu H, Munkhdalai T, Ryu KH, Ramanan SV, Nathan S, Zitnik S, Bajec M, Weber L, Irmer M, Akhondi SA, Kors JA, Xu S, An X, Sikdar UK, Ekbal A, Yoshioka M, Dieb TM, Choi M, Verspoor K, Khabsa M, Giles CL, Liu H, Ravikumar KE, Lamurias A, Couto FM, Dai H, Tsai RT, Ata C, Can T, Usie A, Alves R, Segura-Bedmar I, Martinez P, Oryzabal J, Valencia A: The CHEMDNER corpus of chemicals and drugs and its annotation principles. J Cheminform. 2015, 7 (Suppl 1): S2-

Tsochantaridis I, Joachims T, Hofmann T, Altun Y: Large margin methods for structured and interdependent output variables. J Mach Learn Res. 2005, 6: 1453-1484.

Brown PF, deSouza PV, Mercer RL, Pietra VJD, Lai JC: Class-Based n-gram Models of Natural Language. Comput Linguist. 1992, 18: 467-479.

Kanerva P, Kristoferson J, Holst A: Random Indexing of Text Samples for Latent Semantic Analysis. Proceedings of the 22nd Annual Conference of the Cognitive Science Society. 2000, 103-6.

Mikolov T, Chen K, Corrado G, Dean J: Efficient Estimation of Word Representations in Vector Space. CoRR. 2013, abs/1301.3781

Cho H-C, Okazaki N, Miwa M, Tsujii J: Named entity recognition with multiple segment representations. Inf Process Manag. 2013, 49 (4): 954-965. 10.1016/j.ipm.2013.03.002.

Unified Medical Language System (UMLS) - Home." [Online]. [Accessed: 02-Aug-2013], [http://www.nlm.nih.gov/research/umls/]

Savova GK, Masanz JJ, Ogren PV, Zheng J, Sohn S, Kipper-Schuler KC, Chute CG: Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Informatics Assoc JAMIA. 2010, 17 (5): 507-513. 10.1136/jamia.2009.001560.

Aronson AR, Lang F-M: An overview of MetaMap: historical perspective and recent advances. J Am Med Informatics Assoc JAMIA. 2010, 17 (3): 229-236. 10.1136/jamia.2009.002733. Jun

Tang B, Cao H, Wang X, Chen Q, Xu H: Evaluating Word Representation Features in Biomedical Named Entity Recognition Tasks. BioMed Res Int. 2014, 2014: Mar

Tang B, Cao H, Wu Y, Jiang M, Xu H: Clinical entity recognition using structural support vector machines with rich features. Proceedings of the ACM sixth international workshop on Data and text mining in biomedical informatics, New York, NY, USA. 2012, 13-20.

Tang B, Cao H, Wu Y, Jiang M, Xu H: Recognizing clinical entities in hospital discharge summaries using Structural Support Vector Machines with word representation features. BMC Med Inform Decis Mak. 2013, 13 (Suppl 1): S1-10.1186/1472-6947-13-S1-S1. Apr

Tang B, Wu Y, Jiang M, Chen Y, Denny JC, Xu H: A hybrid system for temporal information extraction from clinical text. J Am Med Informatics Assoc JAMIA. 2013, Apr

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA