An adaptive annotation approach for biomedical entity and relation recognition

Brain Informatics - Tập 3 - Trang 157-168 - 2016
Seid Muhie Yimam1, Chris Biemann1, Ljiljana Majnaric2, Šefket Šabanović2, Andreas Holzinger3
1TU Darmstadt CS Department, FG Language Technology, Darmstadt, Germany
2Josip Juraj Strossmayer University of Osijek Faculty of Medicine Osijek, Osijek, Croatia
3Research Unit HCI-KDD, Institute for Medical Informatics, Statistics and Documentation, Medical University Graz, Graz, Austria

Tóm tắt

In this article, we demonstrate the impact of interactive machine learning: we develop biomedical entity recognition dataset using a human-into-the-loop approach. In contrary to classical machine learning, human-in-the-loop approaches do not operate on predefined training or test sets, but assume that human input regarding system improvement is supplied iteratively. Here, during annotation, a machine learning model is built on previous annotations and used to propose labels for subsequent annotation. To demonstrate that such interactive and iterative annotation speeds up the development of quality dataset annotation, we conduct three experiments. In the first experiment, we carry out an iterative annotation experimental simulation and show that only a handful of medical abstracts need to be annotated to produce suggestions that increase annotation speed. In the second experiment, clinical doctors have conducted a case study in annotating medical terms documents relevant for their research. The third experiment explores the annotation of semantic relations with relation instance learning across documents. The experiments validate our method qualitatively and quantitatively, and give rise to a more personalized, responsive information extraction technology.

Tài liệu tham khảo

Holzinger A (2013) Human-n++n++computer interaction and knowledge discovery (HCI-KDD): what is the benefit of bringing those two fields to work together? In: Multidiscipl. Res. and Pract. for Inf. Sys., LNCS 8127. Springer 319–328 Miner G, Delen D, Elder J, Fast A, Hill T, Nisbet RA ()2012 Preface. In: Practical text mining and statistical analysis for non-structured text data applications. Academic Press, Boston xxiii–xxiv Holzinger A, Schantl J, Schroettner M, Seifert C, Verspoor K (2014) Biomedical text mining: state-of-the-art, open problems and future challenges. In Holzinger A, Jurisica I, eds.: Interactive knowledge discovery and data mining in biomedical informatics, LNCS 8401. Springer 271–300 Holzinger A, Geierhofer R, Modritscher F, Tatzl R (2008) Semantic information in medical information systems: utilization of text mining techniques to analyze medical diagnoses. JUCS 14:3781–3795 Holzinger A, Yildirim P, Geier M, Simonic KM (2013) Quality-based knowledge discovery from medical text on the web. In Pasi G, Bordogna G, Jain LC, eds.: ISRL 50. Springer 145–158 Suchanek FM, Kasneci G, Weikum G (2007) Yago: a core of semantic knowledge. In: Proceedings of the 16th International Conference on World Wide Web. WWW ’07, New York, NY, USA, ACM, 697–706 Bizer C, Lehmann J, Kobilarov G, Auer S, Becker C, Cyganiak R, Hellmann S (2009) Dbpedia—a crystallization point for the web of data. Web Semant 7:154–165 Hirst G (2015) Overcoming linguistic barriers to the multilingual semantic web. In: Buitelaar Paul, Cimiano Philipp (eds) Towards the multilingual semantic web. Springer, Berlin, Germany, pp 1–14 Biemann C (2005) Ontology learning from text: a survey of methods. LDV Forum 20:75–93 Ghiasvand O, Kate R (2014) UWM: disorder mention extraction from clinical text using CRFs and normalization using learned edit distance patterns. In: Proc. SemEval 2014, Dublin, Ireland Leser U, Hakenberg J (2005) What makes a gene name? Named entity recognition in the biomedical literature. Brief Bioinform 6:357–369 Holzinger A (2016) Interactive machine learning. Informatik Spektrum 39 in print Holzinger A (2016) Interactive machine learning for health informatics: when do we need the human-in-the-loop? Springer Brain Informatics (BRIN) GuoDong Z, Jian S (2004) Exploring deep knowledge resources in biomedical name recognition. In: Proceedings NLPBA/BioNLP at COLING’04, Geneva, Switzerland 99–102 Yimam SM, Biemann C, Majnaric L, Šefket Šabanović, Holzinger A (2015) Interactive and iterative annotation for biomedical entity recognition. In: International Conference on Brain Informatics and Health (BIH’15) Biemann C (2014) In: design principles for transparent software in computational humanities. Dagstuhl Publishing, Germany Daelemans W, Zavrel J, van der Sloot K, van den Bosch A (1998) Timbl: Tilburg memory-based learner—version 1.0—reference guide Bengio Y, Goodfellow IJ, Courville A (2015) Deep learning. Book in preparation for MIT Press Ludl MC, Lewandowski A, Dorffner G (2008) Adaptive machine learning in delayed feedback domains by selective relearning. Appl Artif Intell 22:543–557 Drucker SM, Fisher D, Basu S (2011) Helping users sort faster with adaptive machine learning recommendations. In: Proceedings Interaction 2011 Stumpf S, Rajaram V, Li L, Burnett M, Dietterich T, Sullivan E, Drummond R, Herlocker J (2007) Toward harnessing user feedback for machine learning. In: Proceedings 12th IUI 82–91 Das S, Moore T, Wong WK, Stumpf S, Oberst I, Mcintosh K, Burnett M (2013) End-user feature labeling: supervised and semi-supervised approaches based on locally-weighted logistic regression. Artif Intell 204:56–74 Cohen AM, Hersh WR (2005) A survey of current work in biomedical text mining. Brief Bioinform 6:57–71 Ohta T, Tateisi Y, Kim JD (2002) The GENIA corpus: an annotated research abstract corpus in molecular biology domain. In: Proceedings Human Language Technology Research. HLT ’02, San Francisco USA 82–86 Tateisi Y, Tsujii J (2004) Part-of-speech annotation of biology research abstracts. In: Proceedings LREC 2004, Lisbon, Portugal 1267–1270 Tateisi Y, Yakushiji A, Ohta T, Tsujii J (2005) Syntax annotation for the GENIA corpus. In: Proceedings IJCNLP 2005, Lisbon, Portugal (2005) 222–227 Lee C, Hou WJ, Chen HH (2004) Annotating multiple types of biomedical entities: a single word classification approach. In: Proceedings Int’l Joint Workshop on NLP in biomedicine and its applications. 80–83 Yetisgen-Yildiz M, Solti I, Xia F, Halgrim SR (20100 Preliminary experience with amazon’s mechanical turk for annotating medical named entities. In: Proceedings NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk 180–183 Rindflesch TC, Tanabe L, Weinstein JN, Hunter L (2000) EDGAR: extraction of drugs. Pacific Symposium on Biocomputing, Genes And Relations from the Biomedical Literature. In Pyysalo S, Ohta T, Tsujii J (2011) Overview of the entity relations (rel) supporting task of bionlp shared task 2011. In: Proceedings of the BioNLP Shared Task 2011 Workshop, 83–88 Rosario B, Hearst MA (2005) Multi-way relation classification: application to protein-protein interactions. In: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, 732–739 Vidulin V, Bohanec M, Gams M (2014) Combining human analysis and machine data mining to obtain credible data relations. Inf Sci 288:254–278 Hoens TR, Chawla NV (2012) Learning in non-stationary environments with class imbalance. In: Proceedings 18th ACM SIGKDD, New York USA, 168–176 Yimam S, Gurevych I, Eckart de Castilho R, Biemann C (2013) WebAnno: a flexible, web-based and visually supported system for distributed annotations. In: Proceedings ACL 2013 System Demonstrations, Sofia, Bulgaria, 1–6 Yimam S, Eckart de Castilho R, Gurevych I, Biemann C (2014) Automatic annotation suggestions and custom annotation layers in WebAnno. In: Proceedings ACL 2014 System Demonstrations, Baltimore USA, 91–96 Crammer K, Singer Y (2003) Ultraconservative online algorithms for multiclass problems. J Machine Learning Res 3:951–991 Uzuner Ö, Luo Y, Szolovits P (2007) Evaluating the state-of-the-art in automatic de-identification. J Am Med Inform Assoc 14:550–563 Uzuner Ö, Solti I, Xia F, Cadag E (2010) Community annotation experiment for ground truth generation for the i2b2 medication challenge. J Am Med Inform Assoc 17:561–570 Kim JD, Ohta T, Pyysalo S, Kano Y, Tsujii J (2009) Overview of BioNLP’09 shared task on event extraction. In: Proceedings BioNLP ’09. 1–9 Kim JD, Pyysalo S, Ohta T, Bossy R, Nguyen N, Tsujii J (2011) Overview of bionlp shared task 2011. In: Proceedings BioNLP. 1–6 Benikova D, Yimam SM, Santhanam P, Biemann C (2015) GermaNER: free open german named entity recognition tool. In: Proceedings of GSCL 2015, Essen, Germany, 31–28 Okazaki N (2007) CRFsuite: a fast implementation of Conditional Random Fields (CRFs) Biemann C (2009) Unsupervised part-of-speech tagging in the large. Res Lang Comput, 101–135 Biemann C, Quasthoff U, Heyer G, Holz F (2008) ASV toolbox—a modular collection of language exploration tools. In: Proceedings LREC’08., 1760–1767 Biemann C (2011) Structure discovery in natural language. Theory and applications of natural language processing. Springer Brown JR (2013) Inherited susceptibility to chronic lymphocytic leukemia: evidence and prospects for the future. Ther Adv Hematol 4:298–308 Nieto WG, Teodosio CE (2010) Non-cll-like monoclonal b-cell lymphocytosis in the general population: prevalence and phenotypic/genetic characteristics. Cytom Part B (2010) 24–34 Larsson SC, Wolk A (2007) Obesity and risk of non-Hodgkin’s lymphoma: a meta-analysis. Int J Cancer 121:1564–1570 Tsugane S, Inoue M (2010) Insulin resistance and cancer: epidemiological evidence. Cancer Sci 101:1073–1079 Bastard JP, Maachi M, Lagathu C, Kim MJ, Caron M, Vidal H, Capeau J, Feve B (2006) Recent advances in the relationship between obesity, inflammation, and insulin resistance. Eur Cytokine Netw 17:4–12 Ginaldi L, De Martinis M, Monti D, Franceschi C (2004) The immune system in the elderly. Immunol Res 30:81–94 Le Marchand-Brustel Y, Gual P, Grémeaux T, Gonzalez T, Barrès R (2003) Fatty acid-induced insulin resistance: role of insulin receptor substrate 1 serine phosphorylation in the retroregulation of insulin signalling. Biochem Soc Trans 31:1152–1156 Yimam SM (2015) Narrowing the loop: integration of resources and linguistic dataset development with interactive machine learning. In: Proceedings HLT-NAACL: student research workshop, Denver, Colorado 88–95