Imputation of Missing Data in Electronic Health Records Based on Patients’ Similarities

Journal of Healthcare Informatics Research - Tập 4 - Trang 295-307 - 2020
Ali Jazayeri1, Ou Stella Liang1, Christopher C. Yang1
1College of Computing & Informatics, Drexel University, Philadelphia, USA

Tóm tắt

Using electronic health records (EHR) as the source of data for mining and analysis of different health conditions has become an increasingly common approach. However, due to irregular observation times and other uncertainties inherent in medical settings, the EHR data sets suffer from a large number of missing values. Most of the traditional data mining and machine learning approaches are designed to operate on complete data. In this paper, we propose a novel imputation method for missing data to facilitate using these approaches for the analysis of EHR data. The imputation is based on a set of interpatient, multivariate similarities among patients. For a missing data point in a patient’s lab results during his/her intensive care unit stay, the method ranks other patients based on their similarities with the ego patient in terms of lab values, then the missing value is estimated as a weighted average of the known values of the same laboratory test from other patients, considering their similarities as weights. A comparison of the estimated values by the proposed method with values estimated by several common and state-of-the-are methods, such as MICE and 3D-MICE, shows that the proposed method outperforms them and produces promising results.

Tài liệu tham khảo

Ajami S, Bagheri-Tadi T (2013) Barriers for adopting electronic health records (EHRs) by physicians. Acta Informatica Medica 21 (2):129. https://doi.org/10.5455/aim.2013.21.129-134 Azur MJ, Stuart EA, Frangakis C, Leaf PJ (2011) Multiple imputation by chained equations: what is it and how does it work? Int J Methods Psychiatr Res 20(1):40–49. https://doi.org/10.1002/mpr.329 van Buuren S, Groothuis-Oudshoorn K (2011) MICE: multivariate imputation by chained equations in R. J Stat Softw 45(3):1–67. https://doi.org/10.18637/jss.v045.i03 Che Z, Purushotham S, Cho K, Sontag D, Liu Y (2018) Recurrent neural networks for multivariate time series with missing values. Sci Rep 8(1):6085–12. https://doi.org/10.1038/s41598-018-24271-9 Dhevi AS (2014) Imputing missing values using inverse distance weighted interpolation for time series data. In: 2014 Sixth international conference on advanced computing (ICoAC), pp 255–259, DOI https://doi.org/10.1109/ICoAC.2014.7229721, (to appear in print) Gheyas IA, Smith LS (2010) A neural network-based framework for the reconstruction of incomplete data sets. Neurocomputing 73(16):3039–3065. https://doi.org/10.1016/j.neucom.2010.06.021 Hripcsak G, Albers DJ (2012) Next-generation phenotyping of electronic health records. J Am Med Inform Assoc 20(1):117–121. https://doi.org/10.1136/amiajnl-2012-001145 Jerez JM, Molina I, García-Laencina PJ, Alba E, Ribelles N, Martín M, Franco L (2010) Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artif Intell Med 50(2):105–115. https://doi.org/10.1016/j.artmed.2010.05.002 Johnson AEW, Pollard TJ, Shen L, Lehman LWH, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, Mark RG (2016) MIMIC-III, a freely accessible critical care database. Scientific Data 3 (1):160035–160035. https://doi.org/10.1038/sdata.2016.35 Lee J, Maslove DM, Dubin JA (2015) Personalized mortality prediction driven by electronic medical data and a patient similarity metric. PLoS One 10 (5):1–13. https://doi.org/10.1371/journal.pone.0127428 Lipton ZC, Kale DC, Wetzel R (2016) Modeling missing data in clinical time series with RNNs. arXiv:https://arxiv.org/abs/1606.04130 Luo Y, Szolovits P, Dighe AS, Baron JM (2017) 3D-MICE: integration of cross-sectional and longitudinal imputation for multi-analyte longitudinal clinical data. J Am Med Inform Assoc 25(6):645–653. https://doi.org/10.1093/jamia/ocx133 Menachemi N, Collum TH (2011) Benefits and drawbacks of electronic health record systems. Risk Manag Healthcare Polic 4:47. https://doi.org/10.2147/RMHP.S12985 Moritz S, Bartz-Beielstein T (2017) ImputeTS: time series missing value imputation in R. R J 9(1):207–218 Peissig PL, Rasmussen LV, Berg RL, Linneman JG, McCarty CA, Waudby C, Chen L, Denny JC, Wilke RA, Pathak J, Carrell D, Kho AN, Starren JB (2012) Importance of multi-modal approaches to effectively identify cataract cases from electronic health records. J Am Med Inform Assoc 19 (2):225–234. https://doi.org/10.1136/amiajnl-2011-000456 Rahman R, Reddy CK (2015) Electronic health records: a survey. Healthcare Data Analytics 36:21 Rasmussen CE (2003) Gaussian processes in machine learning. In: Summer school on machine learning. Springer, pp 63–71 Strike K, El Emam K, Madhavji N (2001) Software cost estimation with incomplete data. IEEE Trans Softw Eng 27(10):890–908. https://doi.org/10.1109/32.962560 Wells BJ, Kattan MW, Nowacki AS, Chagin K (2013) Strategies for handling missing data in electronic health record derived data. eGEMs (Generating Evidence & Methods to improve patient outcomes) 1(3):1035–1035. https://doi.org/10.13063/2327-9214.1035 Zeileis A, Grothendieck G (2005) zoo: S3 infrastructure for regular and irregular time series. J Stat Softw 14(6):1–27. https://doi.org/10.18637/jss.v014.i06