Big data driven co-occurring evidence discovery in chronic obstructive pulmonary disease patients

Journal of Big Data - Tập 4 - Trang 1-18 - 2017
Christopher Baechle1, Ankur Agarwal1, Xingquan Zhu1
1Department of Computer & Electrical Engineering and Computer Science, College of Engineering, Florida Atlantic University, Boca Raton, USA

Tóm tắt

Chronic Obstructive Pulmonary Disease (COPD) is a chronic lung disease that affects airflow to the lungs. Discovering the co-occurrence of COPD with other diseases, symptoms, and medications is invaluable to medical staff. Building co-occurrence indexes and finding causal relationships with COPD can be difficult because often times disease prevalence within a population influences results. A method which can better separate occurrence within COPD patients from population prevalence would be desirable. Large hospital systems may potentially have tens of millions of patient records spanning decades of collection and a big data approach that is scalable is desirable. The presented method, Co-Occurring Evidence Discovery (COED), presents a methodology and framework to address these issues. Natural Language Processing methods are used to examine 64,371 deidentified clinical notes and discover associations between COPD and medical terms. Apache cTAKES is leveraged to annotate and structure clinical notes. Several extensions to cTAKES have been written to parallelize the annotation of large sets of clinical notes. A co-occurrence score is presented which can penalize scores based on term prevalence, as well as a baseline method traditionally used for finding co-occurrence. These scoring systems are implemented using Apache Spark. Dictionaries of ground truth terms for diseases, medications, and symptoms have been created using clinical domain knowledge. COED and baseline methods are compared using precision, recall, and F1 score. The highest scoring diseases using COED are lung and respiratory diseases. In contrast, baseline methods for co-occurrence rank diseases with high population prevalence highest. Medications and symptoms evaluated with COED share similar results. When evaluated against ground truth dictionaries, the maximum improvements in recall for symptoms, diseases, and medications were 0.212, 0.130, and 0.174. The maximum improvements in precision for symptoms, diseases, and medications were 0.303, 0.333, and 0.180. Median increase in F1 score for symptoms, diseases, and medications were 38.1%, 23.0%, and 17.1%. A paired t-test was performed and F1 score increases were found to be statistically significant, where p < 0.01. Penalizing terms which are highly frequent in the corpus results in better precision and recall performance. Penalizing frequently occurring terms gives a better picture of the diseases, symptoms, and medications co-occurring with COPD. Using a mathematical and computational approach rather than purely expert driven approach, large dictionaries of COPD related terms can be assembled in a short amount of time.

Tài liệu tham khảo

American Lung Association. COPD Fact Sheet, 2014. http://bit.ly/1rOoy1i. Accessed 05 Aug 2016. Petty TL. The history of COPD early historical landmarks. Int J COPD. 2006;1:3–14. Marengoni A, Rizzuto D, Wang HX, Winblad B, Fratiglioni L. Patterns of chronic multimorbidity in the elderly population. J Am Geriatr Soc. 2009;57(2):225–30. Aaron CP, Schwartz JE, Hoffman EA, Tracy R, Austin JHM, Smith LJ, Jacobs DR, Watson KE, Barr RG. Aspirin use and longitudinal progression of percent emphysema on CT : the MESA lung study. Am J Respiration Crit Care Med. 2015;191:A6354. Guthrie B, Payne K, Alderson P, McMurdo MET, Mercer SW. Adapting clinical guidelines to take account of multimorbidity. Br Med J. 2012;345:e6341. Tinetti ME, Fried TR, Boyd CM, Badalà F, Nouri-mahdavi K, Raoof DA. Designing health care for the most common chronic condition—multimorbidity. JAMA. 2012;307(23):2493–4. D’Hoore W, Sicotte C, Tilquin C. Risk adjustment in outcome assessment: the Charlson comorbidity index. Methods Inf Med. 1993;32(5):382–7. Danielsen RD, Simon AF, Pavlick R. The culture of cheating: from the classroom to the exam room. J Phys Assist Educ. 2006;17(1):23–9. Rosenbloom ST, Denny JC, Xu H, Lorenzi N, Stead WW, Johnson KB. Data from clinical notes: a perspective on the tension between structure and flexible documentation. J Am Med Inform Assoc. 2011;18(2):181–6. Porter MF. An algorithm for suffix stripping. Program. 1980;14(3):130–7. Sager N. Natural language information processing. Advanced Book Program. Boston: Addison-Wesley Publishing Company; 1981. Friedman C.A Broad-coverage natural language processing system. Proceeding of the AMIA Symposium. American Medical Informatics Association. 2000; 270–4. Savova GK, Masanz JJ, Ogren PV, Zheng J, Sohn S, Kipper-Schuler KC, Chute CG. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc. 2010;17(5):507–13. Zeng QT, Goryachev S, Weiss S, Sordo M, Murphy SN, Lazarus R. Extracting principal diagnosis, co-morbidity and smoking status for asthma research: evaluation of a natural language processing system. BMC Med Inform Decis Mak. 2006;6:30. Ramos J, Eden J, Edu R. Using TF-IDF to determine word relevance in document queries. Process Manag. 2003;24(5):513–23. Wu ST, Liu H, Li D, Tao C, Musen MA, Chute CG, Shah NH. Unified Medical Language System term occurrences in clinical notes: a large-scale corpus analysis. J Am Med Inform Assoc. 2012;19(e1):e149–56. Herland M, Khoshgoftaar TM, Wald R. A review of data mining using big data in health informatics. J Big Data. 2014;1(1):2. Demner-Fushman D, Chapman WW, McDonald CJ. What can natural language processing do for clinical decision support? J Biomed Inform. 2009;42(5):760–72. Wu Y, Denny JC, Rosenbloom ST, Miller RA, Giuse DA, Xu H. A comparative study of current Clinical Natural Language Processing systems on handling abbreviations in discharge summaries. AMIA Annu Symp Proc. 2012;2012:997–1003. Ruch P, Gobeill J, Lovis C, Geissbühler A. Automatic medical encoding with SNOMED categories. BMC Med Inform Decis Mak. 2008;8(1):S6. Sioutos N, de Coronado S, Haber MW, Hartel FW, Shaiu WL, Wright LW. NCI thesaurus: a semantic model integrating cancer-related clinical and molecular information. J Biomed Inform. 2007;40(1):30–43. Lipscomb CE. Medical subject headings (MeSH). Bull Med Libr Assoc. 2000;88(3):265–6. United States Department of Health and Human Services. The international classification of diseases. Geneva: World Health Organization; 1969. Slee VN. The International classification of diseases: ninth revision (ICD-9). Ann Intern Med. 1978;88(3):424–6. International Classification of Diseases, Ninth Revision (ICD-9). http://www.cdc.gov/nchs/icd/icd9.htm. Accessed 11 Jul 2016. WebMD. COPD Comorbid Conditions: heart disease, osteoporosis, and more. http://wb.md/2dGwUqq. Accessed 01 Aug 2016. CDC. Addressing the Nation’s most common cause of disability at A Glance 2015. http://bit.ly/1FKbR7i. Accessed 01 Aug 2016.