Nội dung được dịch bởi AI, chỉ mang tính chất tham khảo

Bộ dữ liệu nhúng mô hình ngôn ngữ lớn đã nén cho các mô tả ICD-10-CM

BMC Bioinformatics - Tập 24 - Trang 1-13 - 2023

Michael J. Kane¹, Casey King^2,3, Denise Esserman¹, Nancy K. Latham⁴, Erich J. Greene¹, David A. Ganz⁵

¹Department of Biostatistics, School of Public Health, Yale University, New Haven, USA

²The Jackson School of Global Affairs, Yale University, New Haven, USA

³US Healthcare and Life Sciences Microsoft, Redmond, USA

⁴Research Program in Men’s Health: Aging and Metabolism, Boston Claude D. Pepper Older Americans Independence Center for Function Promoting Therapies, Brigham and Women’s Hospital, Boston, USA

⁵Department of Medicine, VA Greater Los Angeles/UCLA, Los Angeles, USA

Tóm tắt

Bài báo này trình bày những bộ dữ liệu mới cung cấp các đại diện số cho các mã ICD-10-CM bằng cách tạo ra các nhúng mô tả sử dụng mô hình ngôn ngữ lớn, sau đó thực hiện giảm chiều thông qua autoencoder. Các nhúng này phục vụ như là các đặc trưng đầu vào thông tin cho các mô hình học máy bằng cách nắm bắt mối quan hệ giữa các danh mục và bảo tồn thông tin ngữ cảnh vốn có. Mô hình tạo ra dữ liệu đã được xác thực theo hai cách. Đầu tiên, việc giảm chiều được xác thực bằng autoencoder, và thứ hai, một mô hình có giám sát đã được tạo ra để ước lượng các danh mục phân cấp của ICD-10-CM. Kết quả cho thấy rằng kích thước của dữ liệu có thể được giảm xuống chỉ với 10 chiều trong khi vẫn duy trì khả năng tái tạo các nhúng gốc, với độ trung thực giảm dần khi việc đại diện với kích thước giảm. Nhiều cấp độ nén được cung cấp, cho phép người dùng lựa chọn theo yêu cầu của họ, tải xuống và sử dụng mà không cần thiết lập thêm. Các bộ dữ liệu có sẵn ngay lập tức của các mã ICD-10-CM dự kiến sẽ có giá trị cao cho các nhà nghiên cứu trong lĩnh vực tin học y sinh, cho phép các phân tích tiên tiến hơn trong lĩnh vực này. Cách tiếp cận này có tiềm năng cải thiện đáng kể tính hữu ích của các mã ICD-10-CM trong lĩnh vực y sinh.

Từ khóa

#ICD-10-CM #đại diện số #nhúng mô hình ngôn ngữ lớn #giảm chiều #học máy #tin học y sinh

Tài liệu tham khảo

DiSantostefano J. International classification of diseases 10th revision (ICD-10). J Nurse Pract. 2009;5(1):56–7. The Center for Disease Control and Prevention (CDC): ICD-10-CM. Accessed: 2023-04-15. https://www.cdc.gov/nchs/icd/icd-10-cm.htm Choi E, Bahadori MT, Searles E, Coffey C, Thompson M, Bost J, Tejedor-Sojo J, Sun J. Multi-layer representation learning for medical concepts. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 2016. p. 1495–1504 Wang Y, Xu X, Jin T, Li X, Xie G, Wang J. Inpatient2vec: medical representation learning for inpatients. In: 2019 IEEE international conference on bioinformatics and biomedicine (BIBM). IEEE; 2019. p. 1113–1117. Wang L, Wang Q, Bai H, Liu C, Liu W, Zhang Y, Jiang L, Xu H, Wang K, Zhou Y. EHR2Vec: representation learning of medical concepts from temporal patterns of clinical notes based on self-attention mechanism. Front Genet. 2020;11:630. Beam AL, Kompa B, Schmaltz A, Fried I, Weber G, Palmer N, Shi X, Cai T, Kohane IS. Clinical concept embeddings learned from massive sources of multimodal medical data. In: Pacific Symposium on Biocomputing 2020. World Scientific; 2019. p. 295–306 Church KW. Word2vec. Nat Lang Eng. 2017;23(1):155–62. Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013) Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. In: Advances in neural information processing systems, 2017. vol. 30. Devlin J, Chang M-W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) Radford A, Narasimhan K, Salimans T, Sutskever I. Improving language understanding with unsupervised learning. Citado. 2018;17:1–12. Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu PJ. Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res. 2020;21(1):5485–551. Huang K, Altosaar J, Ranganath R. ClinicalBERT: modeling clinical notes and predicting hospital readmission. arXiv preprint arXiv:1904.05342 (2019) Alsentzer E, Murphy JR, Boag W, Weng W-H, Jin D, Naumann T, McDermott M. Publicly available clinical BERT embeddings. arXiv preprint arXiv:1904.03323 (2019) Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36(4):1234–40. Rasmy L, Xiang Y, Xie Z, Tao C, Zhi D. Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ Digit Med. 2021;4(1):86. Luo R, Sun L, Xia Y, Qin T, Zhang S, Poon H, Liu T-Y. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Brief Bioinform. 2022;23(6):bbac409. White J. Pubmed 2.0. Med Ref Serv Q. 2020;39(4):382–7. Roberts RJ. PubMed Central: the GenBank of the published literature. Proc Natl Acad Sci. 2001;98(2):381–2. Johnson AE, Pollard TJ, Shen L, Lehman L-WH, Feng M, Ghassemi M, Moody B, Szolovits P, Anthony Celi L, Mark RG. MIMIC-III, a freely accessible critical care database. Sci Data. 2016;3(1):1–9. R Core Team: R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. 2023. R Foundation for Statistical Computing. https://www.R-project.org/ Vasantharajan C, Tun KZ, Thi-Nga H, Jain S, Rong T, Siong CE. MedBERT: a pre-trained language model for biomedical named entity recognition. In: 2022 Asia-Pacific signal and information processing association annual summit and conference (APSIPA ASC). 2022. p. 1482–1488. https://doi.org/10.23919/APSIPAASC55919.2022.9980157 Deka P, Jurek-Loughrey A, Deepak P. Improved methods to aid unsupervised evidence-based fact checking for online health news. J Data Intell. 2022;3(4):474–504. Nguyen T, Rosenberg M, Song X, Gao J, Tiwary S, Majumder R, Deng L. MS MARCO: a human-generated machine reading comprehension dataset. Choice. 2016;2640:660. Bodenreider O. The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Res. 2004;32(supp1-):267–70. Brodersen KH, Ong CS, Stephan KE, Buhmann JM. The balanced accuracy and its posterior distribution. In: 2010 20th international conference on pattern recognition. IEEE; 2010. p. 3121–3124 Wickham H, François R, Henry L, Müller K, Vaughan D. Dplyr: a grammar of data manipulation. 2023. R package version 1.1.1. https://CRAN.R-project.org/package=dplyr Wickham H. Ggplot2: elegant graphics for data analysis. New York: Springer-Verlag; 2016. Wickham H, Hester J, Bryan J. Readr: read rectangular text data. 2023. R package version 2.1.4. https://CRAN.R-project.org/package=readr Krijthe, JH. Rtsne: T-Distributed Stochastic Neighbor Embedding Using Barnes-Hut implementation. 2015. R package version 0.16. https://github.com/jkrijthe/Rtsne Wickham, H. Stringr: simple, consistent wrappers for common string operations. 2023. https://stringr.tidyverse.org, https://github.com/tidyverse/stringr

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA