Detection of sentence boundaries and abbreviations in clinical narratives
Tóm tắt
In Western languages the period character is highly ambiguous, due to its double role as sentence delimiter and abbreviation marker. This is particularly relevant in clinical free-texts characterized by numerous anomalies in spelling, punctuation, vocabulary and with a high frequency of short forms. The problem is addressed by two binary classifiers for abbreviation and sentence detection. A support vector machine exploiting a linear kernel is trained on different combinations of feature sets for each classification task. Feature relevance ranking is applied to investigate which features are important for the particular task. The methods are applied to German language texts from a medical record system, authored by specialized physicians. Two collections of 3,024 text snippets were annotated regarding the role of period characters for training and testing. Cohen's kappa resulted in 0.98. For abbreviation and sentence boundary detection we can report an unweighted micro-averaged F-measure using a 10-fold cross validation of 0.97 for the training set. For test set based evaluation we obtained an unweighted micro-averaged F-measure of 0.95 for abbreviation detection and 0.94 for sentence delineation. Language-dependent resources and rules were found to have less impact on abbreviation detection than on sentence delineation. Sentence detection is an important task, which should be performed at the beginning of a text processing pipeline. For the text genre under scrutiny we showed that support vector machines exploiting a linear kernel produce state of the art results for sentence boundary detection. The results are comparable with other sentence boundary detection methods applied to English clinical texts. We identified abbreviation detection as a supportive task for sentence delineation.
Tài liệu tham khảo
Xu H, Stetson P, Friedman C: A study of abbreviations in clinical notes. AMIA Annual Symposium Proceedings. 2007, 2007: 821-825.
Wiesenauer M, Johner C, Röhrig R: Secondary use of clinical data in healthcare providers-an overview on research, regulatory and ethical requirements. Studies in health technology and informatics. 2012, 180: 614-618.
Meystre SM, Savova G, Kipper-Schuler K, Hurdle J: Extracting information from textual documents in the electronic health record: A review of recent research. Yearbook of Medical Informatics. 2008, 35: 128-144.
International Classification of Diseases. [http://www.who.int/classifications/icd/en/]
Kreuzthaler M, Schulz S: Disambiguation of period characters in clinical narratives. Proceedings of the 5th International Workshop on Health Text Mining and Information Analysis (Louhi)@EACL. 2014, 96-100.
Gillick D: Sentence boundary detection and the problem with the u.s. Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers. 2009, Association for Computational Linguistics, 241-244.
Kiss T, Strunk J: Scaled log likelihood ratios for the detection of abbreviations in text corpora. Proceedings of the 19th International Conference on Computational Linguistics - Volume 2, pp. 1-5. 2002, Association for Computational Linguistics
Apache Lucene Core. [http://lucene.apache.org/core/]
Talend Open Studio. [http://www.talend.com/]
Hagerup T, Rüb C: A guided tour of chernoff bounds. Information Processing Letters. 1990, 33 (6): 305-308. 10.1016/0020-0190(90)90214-I.
O'Donnell R: Probability and Computing (CMU course 15-359) Lecture Notes, Lecture 10. Carnegie Mellon Univerity, School of Computer Science. 2009, [http://www.cs.cmu.edu/%7Eodonnell/papers/probability-and-computing-lecture-notes.pdf]
Di Eugenio B, Glass M: The kappa statistic: A second look. Computational Linguistics. 2004, 30 (1): 95-101. 10.1162/089120104773633402.
Hripcsak G, Heitjan DF: Measuring agreement in medical informatics reliability studies. Journal of Biomedical Informatics. 2002, 35 (2): 99-110. 10.1016/S1532-0464(02)00500-2.
Free German Dictionary. [http://sourceforge.net/projects/germandict/]
Pschyrembel: Klinisches Wörterbuch. CD-ROM Version 1/97. de Gruyter, Berlin. 1997
Netdoktor. [http://www.netdoktor.at/]
Medizinische Abkürzungen. [http://de.wikipedia.org/wiki/Medizinische_Abkürzungen]
Deutsche Abkürzungen. [http://de.wiktionary.org/wiki/Kategorie:Abkürzung_(Deutsch))]
Deutsche Grammatik 2.0. [http://www.deutschegrammatik20.de/]
Apache UIMA. [https://uima.apache.org/]
Hearst MA, Dumais S, Osman E, Platt J, Scholkopf B: Support vector machines. Intelligent Systems and their Applications, IEEE. 1998, 13 (4): 18-28. 10.1109/5254.708428.
Schölkopf B, Smola A: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. 2002, MIT Press, Cambridge
Cristianini N, Shawe-Taylor J: An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. 2000, Cambridge University Press, Cambridge
Cortes C, Vapnik V: Support-vector networks. Machine Learning. 1995, 20 (3): 273-297.
Bishop CM, et al: Pattern Recognition and Machine Learning. 2006, Springer, New York, 1:
Joachims T: Text categorization with support vector machines: Learning with many relevant features. European Conference on Machine Learning (ECML). 1998, Springer, Berlin, 137-142.
LIBLINEAR - A Library for Large Linear Classification. [http://www.csie.ntu.edu.tw/%7Ecjlin/liblinear/]
Weka 3: Data Mining Software in Java. [http://www.cs.waikato.ac.nz/ml/weka/]
Hsu CW, Chang CC, Lin CJ, et al: A Practical Guide to Support Vector Classification. 2010
Joachims T: Learning to Classify Text Using Support Vector Machines - Methods, Theory, and Algorithms. 2002, Kluwer Academic Publishers, Norwell
Guyon I, Weston J, Barnhill S, Vapnik V: Gene selection for cancer classification using support vector machines. Machine Learning. 2002, 46: 389-422. 10.1023/A:1012487302797.
Dunning T: Accurate methods for the statistics of surprise and coincidence. Computational Linguistics. 1993, 19 (1): 61-74.
SURPRISE AND COINCIDENCE - MUSINGS FROM THE LONG TAIL. [http://tdunning.blogspot.co.at/2008/03/surprise-and-coincidence.html]
Apache Mahout. [https://mahout.apache.org/]
Kiss T, Strunk J: Viewing sentence boundary detection as collocation identification. Proceedings of KONVENS. 2002, 75-82.
Kiss T, Strunk J: Unsupervised multilingual sentence boundary detection. Computational Linguistics. 2006, 32 (4): 485-525. 10.1162/coli.2006.32.4.485.
Manning CD, Raghavan P, Schütze H: Introduction to Information Retrieval. 2008, Cambridge University Press, Cambridge
Xu H, Stetson P, Friedman C: Combining corpus-derived sense profiles with estimated frequency information to disambiguate clinical abbreviations. 2012, AMIA Annual Symposium Proceedings, 2012: 1004-1013.
Okazaki N, Ananiadou S, Tsujii J: Building a high-quality sense inventory for improved abbreviation disambiguation. Bioinformatics. 2010, 26 (9): 1246-1253. 10.1093/bioinformatics/btq129.
Suominen H, Salanterä S, Velupillai S, Chapman WW, Savova G, Elhadad N, Pradhan S, South BR, Mowery DL, Jones GJ, et al: Overview of the ShARe/CLEF eHealth Evaluation Lab 2013. In: Information Access Evaluation. Multilinguality, Multimodality, and Visualization. 2012, 212-231.
Unified Medical Language System. [http://www.nlm.nih.gov/research/umls/]
Patrick J, Safari L, Ou Y: ShaARe/CLEF eHealth 2013 Normalization of Acronyms/Abbreviation Challenge. CLEF 2013 Evaluation Labs and Workshop Abstracts - Working Notes. 2013
Friedman C, Alderson PO, Austin JH, Cimino JJ, Johnson SB: A general natural-language text processor for clinical radiology. Journal of the American Medical Informatics Association. 1994, 1 (2): 161-174. 10.1136/jamia.1994.95236146.
Friedman C, Hripcsak G, Shagina L, Liu H: Representing information in patient reports using natural language processing and the extensible markup language. Journal of the American Medical Informatics Association. 1999, 6 (1): 76-87. 10.1136/jamia.1999.0060076.
Wu Y, Rosenbloom S, Denny J, Miller A, Mani S, DA G, Xu H: Detecting abbreviations in discharge summaries using machine learning methods. AMIA Annual Symposium Proceedings. 2011, 2011: 1541-1549.
Wu Y, Denny J, Rosenbloom S, Miller R, Giuse D, Xu H: A comparative study of current clinical natural language processing systems on handling abbreviations in discharge summaries. AMIA Annual Symposium Proceedings. 2012, 2012: 997-1003.
Wu Y, Denny J, Rosenbloom S, Miller RA, Giuse DA, Song M, Xu H: A prototype application for real-time recognition and disambiguation of clinical abbreviations. Proceedings of the 7th International Workshop on Data and Text Mining in Biomedical Informatics. 2013, 7-8.
Apache OpenNLP. [https://opennlp.apache.org/]
Savova GK, Masanz JJ, Ogren PV, Zheng J, Sohn S, Kipper-Schuler KC, Chute CG: Mayo clinical text analysis and knowledge extraction system (ctakes): architecture, component evaluation and applications. Journal of the American Medical Informatics Association. 2010, 17 (5): 507-513. 10.1136/jamia.2009.001560.
Buyko E, Wermter J, Poprat M, Hahn U: Automatically adapting an nlp core engine to the biology domain. Proceedings of the Joint BioLINK-Bio-Ontologies Meeting A Joint Meeting of the ISMB Special Interest Group on Bio-Ontologies and the BioLINK Special Interest Group on Text Data Mining in Association with ISMB. 2006, 65-68.
Friedman C: A broad-coverage natural language processing system. Proceedings of the AMIA Symposium. 2000, American Medical Informatics Association, 270-274.
Patterson O, Igo S, Hurdle JF: Automatic acquisition of sublanguage semantic schema: Towards the word sense disambiguation of clinical narratives. AMIA Annual Symposium Proceedings. 2010, American Medical Informatics Association, 2010: 612-616.
MedKAT. [http://ohnlp.sourceforge.net/MedKATp/#d4e5]