Representation of EHR data for predictive modeling: a comparison between UMLS and other terminologies

Journal of the American Medical Informatics Association - Tập 27 Số 10 - Trang 1593-1599 - 2020
Laila Rasmy1, Firat Tiryaki1, Yujia Zhou1, Yang Xiang1, Cui Tao1, Hua Xu1, Degui Zhi1
1School of Biomedical Informatics University of Texas Health Science Center, Houston, Texas, USA

Tóm tắt

Abstract Objective Predictive disease modeling using electronic health record data is a growing field. Although clinical data in their raw form can be used directly for predictive modeling, it is a common practice to map data to standard terminologies to facilitate data aggregation and reuse. There is, however, a lack of systematic investigation of how different representations could affect the performance of predictive models, especially in the context of machine learning and deep learning. Materials and Methods We projected the input diagnoses data in the Cerner HealthFacts database to Unified Medical Language System (UMLS) and 5 other terminologies, including CCS, CCSR, ICD-9, ICD-10, and PheWAS, and evaluated the prediction performances of these terminologies on 2 different tasks: the risk prediction of heart failure in diabetes patients and the risk prediction of pancreatic cancer. Two popular models were evaluated: logistic regression and a recurrent neural network. Results For logistic regression, using UMLS delivered the optimal area under the receiver operating characteristics (AUROC) results in both dengue hemorrhagic fever (81.15%) and pancreatic cancer (80.53%) tasks. For recurrent neural network, UMLS worked best for pancreatic cancer prediction (AUROC 82.24%), second only (AUROC 85.55%) to PheWAS (AUROC 85.87%) for dengue hemorrhagic fever prediction. Discussion/Conclusion In our experiments, terminologies with larger vocabularies and finer-grained representations were associated with better prediction performances. In particular, UMLS is consistently 1 of the best-performing ones. We believe that our work may help to inform better designs of predictive models, although further investigation is warranted.

Từ khóa


Tài liệu tham khảo

Maragatham, 2019, LSTM model for prediction of heart failure in big data, J Med Syst, 43, 111, 10.1007/s10916-019-1243-3

Choi, 2016, RETAIN: an interpretable predictive model for healthcare using reverse time attention mechanism, Adv Neural Inf Process Syst, 3504

Choi, 2017, Using recurrent neural network models for early detection of heart failure onset, J Am Med Inform Assoc, 24, 361, 10.1093/jamia/ocw112

Rasmy, 2018, A study of generalizability of recurrent neural network-based predictive models for heart failure onset risk using a large and heterogeneous EHR data set, J Biomed Inform, 84, 10.1016/j.jbi.2018.06.011

Jin, 2018, Predicting the risk of heart failure with EHR sequential data modeling, IEEE Access, 6, 9256, 10.1109/ACCESS.2017.2789324

Muhammad, 2019, Pancreatic cancer prediction through an artificial neural network, Front Artif Intell, 2, 2, 10.3389/frai.2019.00002

Hsieh, 2018, Development of a prediction model for pancreatic cancer in patients with type 2 diabetes using logistic regression and artificial neural network models, Cancer Manag Res, 10, 6317, 10.2147/CMAR.S180791

Ayala Solares, 2020, Deep learning for electronic health records: A comparative review of multiple deep neural architectures, J. Biomed. Inform, 101, 103337, 10.1016/j.jbi.2019.103337

Min, 2019, Predictive modeling of the hospital readmission risk from patients’ claims data using machine learning: a case study on COPD, Sci Rep, 9, 10.1038/s41598-019-39071-y

Rajkomar, 2018, Scalable and accurate deep learning with electronic health records, NPJ Digit Med, 1, 18, 10.1038/s41746-018-0029-1

Subramanyam, 2020, Deep contextualized medical concept normalization in social media text, Proc Comput Sci, 171, 1353, 10.1016/j.procs.2020.04.145

Wei, 2017, Evaluating phecodes, clinical classification software, and ICD-9-CM codes for phenome-wide association studies in the electronic health record, PLoS One, 12, e0175508, 10.1371/journal.pone.0175508

Wu, 2018, Developing and evaluating mappings of ICD-10 and ICD-10-CM codes to Phecodes, bioRxiv, 462077

Thompson, 2012, An evaluation of the NQF quality data model for representing electronic health record driven phenotyping algorithms, AMIA Ann Symp Proc, 2012, 911

Choi, 2018, 4547

Beam, 2018

Alawad

Xiang, 2019, Time-sensitive clinical concept embeddings learned from large electronic health records, BMC Med Inform Decis Mak, 19, 58, 10.1186/s12911-019-0766-3

Feng, 2019

Jung, 2019, Predicting need for advanced illness or palliative care in a primary care population using electronic health record data, J Biomed Inform, 92, 103115, 10.1016/j.jbi.2019.103115

Bodenreider, 2004, The Unified Medical Language System (UMLS): integrating biomedical terminology, Nucleic Acids Res, 32 (Database issue, D267, 10.1093/nar/gkh061

Choi, 2016, Learning low-dimensional representations of medical concepts, AMIA Joint Summits Translational Science Proceedings, 41

Maldonado, 2019, Adversarial learning of knowledge embeddings for the unified medical language system, AMIA Jt Summits Transl Sci Proc 2019, 543

UMLS Knowledge Sources: File Downloads, 2019

2018-ICD-10-CM-and-GEMs;, 2017

PheWAS-Phenome Wide Association Studies, 2019

Beta Clinical Classifications Software (CCS) for ICD-10-CM/PCS, 2019

HCUP CCS

Clinical Classifications Software Refined (CCSR) for ICD-10-CM Diagnoses, 1330

2018

sklearn.linear_model.LogisticRegression—scikit-learn 0.20.3 documentation, 2019

Ma, 2017

Ma, 2017

Rasmy, 2019, Medinfo 2019 (podium abstract submitted Nov 2018). Simple Recurrent Neural Networks is all we need for clinical events predictions using EHR data. Lyon, France: MedInfo

DeLong, 1988, Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach, Biometrics, 44, 837, 10.2307/2531595