NeighBERT: Medical Entity Linking Using Relation-Induced Dense Retrieval
Journal of Healthcare Informatics Research - Trang 1-17 - 2024
Tóm tắt
One of the common tasks in clinical natural language processing is medical entity linking (MEL) which involves mention detection followed by linking the mention to an entity in a knowledge base. One reason that MEL has not been solved is due to a problem that occurs in language where ambiguous texts can be resolved to several named entities. This problem is exacerbated when processing the text found in electronic health records. Recent work has shown that deep learning models based on transformers outperform previous methods on linking at higher rates of performance. We introduce NeighBERT, a custom pre-training technique which extends BERT (Devlin et al [1]) by encoding how entities are related within a knowledge graph. This technique adds relational context that has been traditionally missing in original BERT, helping resolve the ambiguity found in clinical text. In our experiments, NeighBERT improves the precision, recall, and F1-score of the state of the art by 1–3 points for named entity recognition and 10–15 points for MEL on two widely known clinical datasets.
Tài liệu tham khảo
Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota. https://doi.org/10.18653/v1/N19-1423. https://aclanthology.org/N19-1423. Accessed 10 Jan 2024
Kraljevic Z, Searle T, Shek A, Roguski L, Noor K, Bean D, Mascio A, Zhu L, Folarin AA, Roberts A, Bendayan R, Richardson MP, Stewart R, Shah AD, Wong WK, Ibrahim Z, Teo JT, Dobson RJB (2021) Multi-domain clinical natural language processing with MedCAT: The medical concept annotation toolkit. Artif Intell Med 117:102083. https://doi.org/10.1016/j.artmed.2021.102083
Soldaini L, Goharian N (2016) QuickUMLS: a fast, unsupervised approach for medical concept extraction. In: MedIR Workshop, Sigir, pp. 1–4
Mohan S, Angell R, Monath N, McCallum A (2021) Low resource recognition and linking of biomedical concepts from a large ontology. In: Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, pp. 1–10. https://doi.org/10.1145/3459930.3469524
Newman-Griffis D, Divita G, Desmet B, Zirikly A, Rose CP, Fosler-Lussier E (2021) Ambiguity in medical concept normalization: an analysis of types and coverage in electronic health record datasets. J Am Med Inform Assoc 28(3):516–532
Sang ETK, De Meulder F (2003) Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pp. 142–147
Bodenreider O (2004) The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Res 32(suppl 1):267–270
Wang Q, Mao Z, Wang B, Guo L (2017) Knowledge graph embedding: a survey of approaches and applications. IEEE Trans Knowl Data Eng 29(12):2724–2743
Uzuner O, South BR, Shen S, DuVall SL (2011) 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. J Am Medi- cal Informatics Assoc 18(5):552–556. https://doi.org/10.1136/amiajnl-2011-000203
Mohan S, Li D (2018) Medmentions: A large biomedical corpus annotated with UMLS concepts. In: Automated Knowledge Base Construction (AKBC)
Si Y, Wang J, Xu H, Roberts K (2019) Enhancing clinical concept extraction with contextual embeddings. J Am Med Inform Assoc 26(11):1297–1304
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems 26. https://proceedings.neurips.cc/paper_files/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf
Pennington J, Socher R, Manning CD (2014) GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empiri- cal Methods in Natural Language Processing (EMNLP), pp. 1532–1543
Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguis 5:135–146
Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2227–2237. Association for Computational Linguistics, New Orleans, Louisiana. https://doi.org/10.18653/v1/N19-1423. https://aclanthology.org/N19-1423. Accessed 10 Jan 2024
Yang X, Bian J, Hogan WR, Wu Y (2020) Clinical concept extraction using transformers. J Am Med Inform Assoc 27(12):1935–1942
Fu S, Chen D, He H, Liu S, Moon S, Peterson KJ, Shen F, Wang L, Wang Y, Wen A et al (2020) Clinical concept extraction: a methodology review. J Biomed Inform 109:103526
Michalopoulos G, Wang Y, Kaka H, Chen H, Wong A (2021) Umls- BERT: Clinical domain knowledge augmentation of contextual embed- dings using the Unified Medical Language System Metathesaurus. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1744–1753. Association for Computational Linguis- tics, Online. https://doi.org/10.18653/v1/2021.naacl-main.139. https://aclanthology.org/2021.naacl-main.139. Accessed 10 Jan 2024
Ji Z, Wei Q, Xu H (2020) BERT-based ranking for biomedical entity normalization. AMIA Summits Transl Sci Proc 2020:269
Joshi M, Chen D, Liu Y, Weld DS, Zettlemoyer L, Levy O (2020) Spanbert: Improving pre-training by representing and predicting spans. Trans Assoc Comput Linguis 8:64–77
Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J (2020) BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4):1234–1240
Nejadgholi I, Fraser KC, De Bruijn B, Li M, LaPlante A, El Abidine KZ (2019) Recognizing UMLS semantic types with deep learning. In: Proceedings of the Tenth International Workshop on Health Text Mining and Information Analysis (LOUHI 2019), pp. 157–167
Peng Y, Yan S, Lu Z (2019) Transfer learning in biomedical natural language processing: an evaluation of BERT and ELMo on ten benchmarking datasets. BioNLP 2019:58
Harnoune A, Rhanoui M, Mikram M, Yousfi S, Elkaimbillah Z, El Asri B (2021) BERT based clinical knowledge extraction for biomedical knowledge graph construction and analysis. Comput Methods Programs Biomed Update 1:100042
Zhai Z, Nguyen DQ, Akhondi SA, Thorne C, Druckenbrodt C, Cohn T, Gregory M, Verspoor K (2019) Improving chemical named entity recognition in patents with contextualized word embeddings. arXiv preprint arXiv:1907.02679. Accessed 10 Jan 2024
Zhang T, Cai Z, Wang C, Qiu M, Yang B, He X (2021) SMedBERT: a knowledge-enhanced pre-trained language model with structured semantics for medical text mining. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th Inter- national joint conference on natural language processing (volume 1: long papers), pp. 5882–5893. Association for computational linguistics, online. https://doi.org/10.18653/v1/2021.acl-long.457. https://aclanthology.org/2021.acl-long.457. Accessed 10 Jan 2024
Liu F, Shareghi E, Meng Z, Basaldella M, Collier N (2021) Self-alignment pretraining for biomedical entity representations. In: Proceedings of the 2021 Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies, pp. 4228–4238
Nadeau D, Sekine S (2007) A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1):3–26
Doddington GR, Mitchell A, Przybocki MA, Ramshaw LA, Strassel SM, Weischedel RM (2004) The automatic content extraction (ACE) program - tasks, data, and evaluation. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation, LREC 2004, May 26–28, 2004, Lisbon, Portugal. Euro- pean Language Resources Association, ???. http://www.lrec-conf.org/proceedings/lrec2004/summaries/5.htm. Accessed 10 Jan 2024
Cohen J (2007) The GALE project: a description and an update. In: 2007 IEEE Workshop on automatic speech recognition & understanding (ASRU), pp. 237. IEEE. https://doi.org/10.1109/ASRU.2007.4430115
Todorovic BT, Rancic SR, Markovic IM, Mulalic EH, Ilic VM (2008) Named entity recognition and classification using context hidden Markov model. In: 2008 9th Symposium on Neural Network Applications in Electrical Engineering, pp. 43–46. IEEE
Cucchiarelli A, Velardi P (2001) Unsupervised named entity recognition using syntactic and semantic contextual evidence. Comput Linguist 27(1):123–131
Miller GA (1995) WordNet: a lexical database for English. Commun ACM 38(11):39–41
Li J, Sun A, Han J, Li C (2020) A survey on deep learning for named entity recognition. IEEE Trans Knowl Data Eng 34(1):50–70
Schmidhuber J, Hochreiter S et al (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Strakov´a J, Straka M, Hajic J (2019) Neural architectures for nested ner through linearization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5326–5331
Fei H, Ren Y, Zhang Y, Ji D, Liang X (2021) Enriching contextualized language model from knowledge graph for biomedical information extraction. Brief Bioinform 22(3):110
Kotitsas S, Pappas D, Androutsopoulos I, McDonald R, Apidianaki M (2019) Embedding biomedical ontologies by jointly encoding network structure and textual node descriptors. In: Proceedings of the 18th BioNLP Workshop and Shared Task, pp. 298–308
Grover A, Leskovec J (2016) node2vec: Scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 855–864
Park J, Kim K, Hwang W, Lee D (2019) Concept embedding to measure semantic relatedness for biomedical information ontologies. J Biomed Inform 94:103182
Lamurias A, Sousa D, Clarke LA, Couto FM (2019) BO-LSTM: classifying relations via long short-term memory networks along biomedical ontologies. BMC Bioinformatics 20(1):1–12
Beam AL, Kompa B, Schmaltz A, Fried I, Weber G, Palmer N, Shi X, Cai T, Kohane IS (2019) Clinical concept embeddings learned from massive sources of multimodal medical data. In: Pacific Symposium on Biocomputing 2020, pp. 295–306. World Scientific
Mao Y, Fung KW (2020) Use of word and graph embedding to measure semantic relatedness between unified medical language system concepts. J Am Med Inform Assoc 27(10):1538–1546
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Accessed 10 Jan 2024
Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R (2020) ALBERT: A lite BERT for self-supervised learning of language representations. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30, 2020. OpenReview.net,???. https://openreview.net/forum?id=H1eA7AEtvS. Accessed 10 Jan 2024
Fiorini N, Leaman R, Lipman DJ, Lu Z (2018) How user intelligence is improving PubMed. Nat Biotechnol 36(10):937–945
Johnson AE, Pollard TJ, Shen L, Lehman LH, Feng M, Ghassemi M, Moody B, Szolovits P, Anthony Celi L, Mark RG (2016) MIMIC-III, a freely accessible critical care database. Sci data 3(1):1–9
Ramshaw L, Marcus M (1995) Text chunking using transformation-based learning. In: Third Workshop on Very Large Corpora. https://aclanthology.org/W95-0107. Accessed 10 Jan 2024
Dogan RI, Leaman R, Lu Z (2014) NCBI disease corpus: A resource for dis- ease name recognition and concept normalization. J Biomed Informatics 47:1–10. https://doi.org/10.1016/j.jbi.2013.12.006
Li J, Sun Y, Johnson RJ, Sciaky D, Wei C, Leaman R, Davis AP, Mattingly CJ, Wiegers TC, Lu Z (2016) Biocreative V CDR task corpus: a resource for chemical disease relation extraction. Database J Biol Databases Curation 2016. https://doi.org/10.1093/database/baw068
Vashishth S, Newman-Griffis D, Joshi R, Dutt R, Rosé CP (2021) Improving broad-coverage medical entity linking with semantic type prediction and large-scale datasets. J Biomed Inform 121:103880. https://doi.org/10.1016/j.jbi.2021.1038805
Fei H, Ji D, Li B, Liu Y, Ren Y, Li F (2021) Rethinking boundaries: End-to-end recognition of discontinuous mentions with pointer networks. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 12785–12793
Muis AO, Lu W (2017) Labeling gaps between words: recognizing overlapping mentions with mention separators. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2608–2618
Li F, Lin Z, Zhang M, Ji D (2021) A span-based model for joint overlapped and discontinuous named entity recognition. In: Proceedings of the ACL