NeighBERT: Medical Entity Linking Using Relation-Induced Dense Retrieval

Journal of Healthcare Informatics Research - Trang 1-17 - 2024

Ayush Singh¹, Saranya Krishnamoorthy¹, John E. Ortega¹

¹inQbator AI, Evernorth Health Services, Saint Louis, USA

Tóm tắt

One of the common tasks in clinical natural language processing is medical entity linking (MEL) which involves mention detection followed by linking the mention to an entity in a knowledge base. One reason that MEL has not been solved is due to a problem that occurs in language where ambiguous texts can be resolved to several named entities. This problem is exacerbated when processing the text found in electronic health records. Recent work has shown that deep learning models based on transformers outperform previous methods on linking at higher rates of performance. We introduce NeighBERT, a custom pre-training technique which extends BERT (Devlin et al [1]) by encoding how entities are related within a knowledge graph. This technique adds relational context that has been traditionally missing in original BERT, helping resolve the ambiguity found in clinical text. In our experiments, NeighBERT improves the precision, recall, and F1-score of the state of the art by 1–3 points for named entity recognition and 10–15 points for MEL on two widely known clinical datasets.

Tài liệu tham khảo

Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota. https://doi.org/10.18653/v1/N19-1423. https://aclanthology.org/N19-1423. Accessed 10 Jan 2024 Kraljevic Z, Searle T, Shek A, Roguski L, Noor K, Bean D, Mascio A, Zhu L, Folarin AA, Roberts A, Bendayan R, Richardson MP, Stewart R, Shah AD, Wong WK, Ibrahim Z, Teo JT, Dobson RJB (2021) Multi-domain clinical natural language processing with MedCAT: The medical concept annotation toolkit. Artif Intell Med 117:102083. https://doi.org/10.1016/j.artmed.2021.102083 Soldaini L, Goharian N (2016) QuickUMLS: a fast, unsupervised approach for medical concept extraction. In: MedIR Workshop, Sigir, pp. 1–4 Mohan S, Angell R, Monath N, McCallum A (2021) Low resource recognition and linking of biomedical concepts from a large ontology. In: Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, pp. 1–10. https://doi.org/10.1145/3459930.3469524 Newman-Griffis D, Divita G, Desmet B, Zirikly A, Rose CP, Fosler-Lussier E (2021) Ambiguity in medical concept normalization: an analysis of types and coverage in electronic health record datasets. J Am Med Inform Assoc 28(3):516–532 Sang ETK, De Meulder F (2003) Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pp. 142–147 Bodenreider O (2004) The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Res 32(suppl 1):267–270 Wang Q, Mao Z, Wang B, Guo L (2017) Knowledge graph embedding: a survey of approaches and applications. IEEE Trans Knowl Data Eng 29(12):2724–2743 Uzuner O, South BR, Shen S, DuVall SL (2011) 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. J Am Medi- cal Informatics Assoc 18(5):552–556. https://doi.org/10.1136/amiajnl-2011-000203 Mohan S, Li D (2018) Medmentions: A large biomedical corpus annotated with UMLS concepts. In: Automated Knowledge Base Construction (AKBC) Si Y, Wang J, Xu H, Roberts K (2019) Enhancing clinical concept extraction with contextual embeddings. J Am Med Inform Assoc 26(11):1297–1304 Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems 26. https://proceedings.neurips.cc/paper_files/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf Pennington J, Socher R, Manning CD (2014) GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empiri- cal Methods in Natural Language Processing (EMNLP), pp. 1532–1543 Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguis 5:135–146 Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2227–2237. Association for Computational Linguistics, New Orleans, Louisiana. https://doi.org/10.18653/v1/N19-1423. https://aclanthology.org/N19-1423. Accessed 10 Jan 2024 Yang X, Bian J, Hogan WR, Wu Y (2020) Clinical concept extraction using transformers. J Am Med Inform Assoc 27(12):1935–1942 Fu S, Chen D, He H, Liu S, Moon S, Peterson KJ, Shen F, Wang L, Wang Y, Wen A et al (2020) Clinical concept extraction: a methodology review. J Biomed Inform 109:103526 Michalopoulos G, Wang Y, Kaka H, Chen H, Wong A (2021) Umls- BERT: Clinical domain knowledge augmentation of contextual embed- dings using the Unified Medical Language System Metathesaurus. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1744–1753. Association for Computational Linguis- tics, Online. https://doi.org/10.18653/v1/2021.naacl-main.139. https://aclanthology.org/2021.naacl-main.139. Accessed 10 Jan 2024 Ji Z, Wei Q, Xu H (2020) BERT-based ranking for biomedical entity normalization. AMIA Summits Transl Sci Proc 2020:269 Joshi M, Chen D, Liu Y, Weld DS, Zettlemoyer L, Levy O (2020) Spanbert: Improving pre-training by representing and predicting spans. Trans Assoc Comput Linguis 8:64–77 Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J (2020) BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4):1234–1240 Nejadgholi I, Fraser KC, De Bruijn B, Li M, LaPlante A, El Abidine KZ (2019) Recognizing UMLS semantic types with deep learning. In: Proceedings of the Tenth International Workshop on Health Text Mining and Information Analysis (LOUHI 2019), pp. 157–167 Peng Y, Yan S, Lu Z (2019) Transfer learning in biomedical natural language processing: an evaluation of BERT and ELMo on ten benchmarking datasets. BioNLP 2019:58 Harnoune A, Rhanoui M, Mikram M, Yousfi S, Elkaimbillah Z, El Asri B (2021) BERT based clinical knowledge extraction for biomedical knowledge graph construction and analysis. Comput Methods Programs Biomed Update 1:100042 Zhai Z, Nguyen DQ, Akhondi SA, Thorne C, Druckenbrodt C, Cohn T, Gregory M, Verspoor K (2019) Improving chemical named entity recognition in patents with contextualized word embeddings. arXiv preprint arXiv:1907.02679. Accessed 10 Jan 2024 Zhang T, Cai Z, Wang C, Qiu M, Yang B, He X (2021) SMedBERT: a knowledge-enhanced pre-trained language model with structured semantics for medical text mining. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th Inter- national joint conference on natural language processing (volume 1: long papers), pp. 5882–5893. Association for computational linguistics, online. https://doi.org/10.18653/v1/2021.acl-long.457. https://aclanthology.org/2021.acl-long.457. Accessed 10 Jan 2024 Liu F, Shareghi E, Meng Z, Basaldella M, Collier N (2021) Self-alignment pretraining for biomedical entity representations. In: Proceedings of the 2021 Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies, pp. 4228–4238 Nadeau D, Sekine S (2007) A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1):3–26 Doddington GR, Mitchell A, Przybocki MA, Ramshaw LA, Strassel SM, Weischedel RM (2004) The automatic content extraction (ACE) program - tasks, data, and evaluation. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation, LREC 2004, May 26–28, 2004, Lisbon, Portugal. Euro- pean Language Resources Association, ???. http://www.lrec-conf.org/proceedings/lrec2004/summaries/5.htm. Accessed 10 Jan 2024 Cohen J (2007) The GALE project: a description and an update. In: 2007 IEEE Workshop on automatic speech recognition & understanding (ASRU), pp. 237. IEEE. https://doi.org/10.1109/ASRU.2007.4430115 Todorovic BT, Rancic SR, Markovic IM, Mulalic EH, Ilic VM (2008) Named entity recognition and classification using context hidden Markov model. In: 2008 9th Symposium on Neural Network Applications in Electrical Engineering, pp. 43–46. IEEE Cucchiarelli A, Velardi P (2001) Unsupervised named entity recognition using syntactic and semantic contextual evidence. Comput Linguist 27(1):123–131 Miller GA (1995) WordNet: a lexical database for English. Commun ACM 38(11):39–41 Li J, Sun A, Han J, Li C (2020) A survey on deep learning for named entity recognition. IEEE Trans Knowl Data Eng 34(1):50–70 Schmidhuber J, Hochreiter S et al (1997) Long short-term memory. Neural Comput 9(8):1735–1780 Strakov´a J, Straka M, Hajic J (2019) Neural architectures for nested ner through linearization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5326–5331 Fei H, Ren Y, Zhang Y, Ji D, Liang X (2021) Enriching contextualized language model from knowledge graph for biomedical information extraction. Brief Bioinform 22(3):110 Kotitsas S, Pappas D, Androutsopoulos I, McDonald R, Apidianaki M (2019) Embedding biomedical ontologies by jointly encoding network structure and textual node descriptors. In: Proceedings of the 18th BioNLP Workshop and Shared Task, pp. 298–308 Grover A, Leskovec J (2016) node2vec: Scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 855–864 Park J, Kim K, Hwang W, Lee D (2019) Concept embedding to measure semantic relatedness for biomedical information ontologies. J Biomed Inform 94:103182 Lamurias A, Sousa D, Clarke LA, Couto FM (2019) BO-LSTM: classifying relations via long short-term memory networks along biomedical ontologies. BMC Bioinformatics 20(1):1–12 Beam AL, Kompa B, Schmaltz A, Fried I, Weber G, Palmer N, Shi X, Cai T, Kohane IS (2019) Clinical concept embeddings learned from massive sources of multimodal medical data. In: Pacific Symposium on Biocomputing 2020, pp. 295–306. World Scientific Mao Y, Fung KW (2020) Use of word and graph embedding to measure semantic relatedness between unified medical language system concepts. J Am Med Inform Assoc 27(10):1538–1546 Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Accessed 10 Jan 2024 Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R (2020) ALBERT: A lite BERT for self-supervised learning of language representations. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30, 2020. OpenReview.net,???. https://openreview.net/forum?id=H1eA7AEtvS. Accessed 10 Jan 2024 Fiorini N, Leaman R, Lipman DJ, Lu Z (2018) How user intelligence is improving PubMed. Nat Biotechnol 36(10):937–945 Johnson AE, Pollard TJ, Shen L, Lehman LH, Feng M, Ghassemi M, Moody B, Szolovits P, Anthony Celi L, Mark RG (2016) MIMIC-III, a freely accessible critical care database. Sci data 3(1):1–9 Ramshaw L, Marcus M (1995) Text chunking using transformation-based learning. In: Third Workshop on Very Large Corpora. https://aclanthology.org/W95-0107. Accessed 10 Jan 2024 Dogan RI, Leaman R, Lu Z (2014) NCBI disease corpus: A resource for dis- ease name recognition and concept normalization. J Biomed Informatics 47:1–10. https://doi.org/10.1016/j.jbi.2013.12.006 Li J, Sun Y, Johnson RJ, Sciaky D, Wei C, Leaman R, Davis AP, Mattingly CJ, Wiegers TC, Lu Z (2016) Biocreative V CDR task corpus: a resource for chemical disease relation extraction. Database J Biol Databases Curation 2016. https://doi.org/10.1093/database/baw068 Vashishth S, Newman-Griffis D, Joshi R, Dutt R, Rosé CP (2021) Improving broad-coverage medical entity linking with semantic type prediction and large-scale datasets. J Biomed Inform 121:103880. https://doi.org/10.1016/j.jbi.2021.1038805 Fei H, Ji D, Li B, Liu Y, Ren Y, Li F (2021) Rethinking boundaries: End-to-end recognition of discontinuous mentions with pointer networks. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 12785–12793 Muis AO, Lu W (2017) Labeling gaps between words: recognizing overlapping mentions with mention separators. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2608–2618 Li F, Lin Z, Zhang M, Ji D (2021) A span-based model for joint overlapped and discontinuous named entity recognition. In: Proceedings of the ACL

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA