Development and evaluation of a de-identification procedure for a case register sourced from mental health electronic records

Andrea Fernandes1, Danielle Cloete2, Matthew Broadbent1, Richard D. Hayes1, Chin‐Kuo Chang1, Roy Jackson2, Angus Roberts3, Jason Tsang1, Murat Soncul2, Jennifer Liebscher1, Robert Stewart1, Felicity Callard
1King's College London Institute of Psychiatry, London, UK
2South London and Maudsley NHS Foundation Trust, London, UK
3University of Sheffield Department of Computer Science, Sheffield, UK

Tóm tắt

Abstract Background Electronic health records (EHRs) provide enormous potential for health research but also present data governance challenges. Ensuring de-identification is a pre-requisite for use of EHR data without prior consent. The South London and Maudsley NHS Trust (SLaM), one of the largest secondary mental healthcare providers in Europe, has developed, from its EHRs, a de-identified psychiatric case register, the Clinical Record Interactive Search (CRIS), for secondary research. Methods We describe development, implementation and evaluation of a bespoke de-identification algorithm used to create the register. It is designed to create dictionaries using patient identifiers (PIs) entered into dedicated source fields and then identify, match and mask them (with ZZZZZ) when they appear in medical texts. We deemed this approach would be effective, given high coverage of PI in the dedicated fields and the effectiveness of the masking combined with elements of a security model. We conducted two separate performance tests i) to test performance of the algorithm in masking individual true PIs entered in dedicated fields and then found in text (using 500 patient notes) and ii) to compare the performance of the CRIS pattern matching algorithm with a machine learning algorithm, called the MITRE Identification Scrubber Toolkit – MIST (using 70 patient notes – 50 notes to train, 20 notes to test on). We also report any incidences of potential breaches, defined by occurrences of 3 or more true or apparent PIs in the same patient’s notes (and in an additional set of longitudinal notes for 50 patients); and we consider the possibility of inferring information despite de-identification. Results True PIs were masked with 98.8% precision and 97.6% recall. As anticipated, potential PIs did appear, owing to misspellings entered within the EHRs. We found one potential breach. In a separate performance test, with a different set of notes, CRIS yielded 100% precision and 88.5% recall, while MIST yielded a 95.1% and 78.1%, respectively. We discuss how we overcome the realistic possibility – albeit of low probability – of potential breaches through implementation of the security model. Conclusion CRIS is a de-identified psychiatric database sourced from EHRs, which protects patient anonymity and maximises data available for research. CRIS demonstrates the advantage of combining an effective de-identification algorithm with a carefully designed security model. The paper advances much needed discussion of EHR de-identification – particularly in relation to criteria to assess de-identification, and considering the contexts of de-identified research databases when assessing the risk of breaches of confidential patient information.

Từ khóa


Tài liệu tham khảo

Stewart R, Soremekun M, Perera G, Broadbent M, Callard F, Denis M, Hotopf M, Thornicroft G, Lovestone S: The South London and Maudsley NHS Foundation Trust Biomedical Research Centre (SLAM BRC) case register: development and descriptive data. BMC Psychiatry. 2009, 9: 51-10.1186/1471-244X-9-51.

Robertson A, Cresswell K, Takian A, Petrakaki D, Crowe S, Cornford T, Barber N, Avery A, Fernando B, Jacklin A: Implementation and adoption of nationwide electronic health records in secondary care in England: qualitative analysis of interim results from a prospective national evaluation. BMJ. 2010, 341: c4564-10.1136/bmj.c4564.

Bahn AK: Psychiatric case register conference, 1965. Public Health Rep. 1966, 81 (8): 748-754.

Armstrong V, Barnett J, Cooper H, Monkman M, Moran-Ellis J, Shepherd R: Public Perspectives on the Governance of Biomedical Research: a qualitative study in a deliberative context. 2007, London: Wellcome Trust

Callard F, Wykes T: Mental health and perceptions of biomarker research – possible effects on participation. J Ment Health. 2008, 17 (1): 1-7. 10.1080/09638230801931944.

Yawn BP, Yawn RA, Geier GR, Xia Z, Jacobsen SJ: The impact of requiring patient authorization for use of data in medical records research. J Fam Pract. 1998, 47 (5): 361-365.

Powell J, Fitton R, Fitton C: Sharing electronic health records: the patient view. Inform Prim Care. 2006, 14 (1): 55-57.

The Academy of Medical Sciences (www.acmedsci.ac.uk): Personal data for public good: using health information in medical research. Report from Academy of Medical Sciences. 2006, London: Academy of Medical Sciences, 80-

Malin B, Benitez K, Masys D: Never too old for anonymity: a statistical standard for demographic data sharing via the HIPAA Privacy Rule. J Am Med Inform Assoc. 2011, 18 (1): 3-10. 10.1136/jamia.2010.004622.

Neamatullah I, Douglass MM, Lehman LW, Reisner A, Villarroel M, Long WJ, Szolovits P, Moody GB, Mark RG, Clifford GD: Automated de-identification of free-text medical records. BMC Med Inform Decis Mak. 2008, 8: 32-10.1186/1472-6947-8-32.

DoH: Department of Health (DoH) - NHS confidentiality code of practice. Department of Health. 2006

Greenough A, Graham H: Protecting and using patient information: the role of the Caldicott guardian. Clin Med. 2004, 4 (3): 246-249. 10.7861/clinmedicine.4-3-246.

Chang CK: Improving the life expectancy of people with serious mental illness. Br J Hosp Med (Lond). 2012, 73 (3): 126-127.

Chang CK, Hayes RD, Broadbent M, Fernandes AC, Lee W, Hotopf M, Stewart R: All-cause mortality among people with serious mental illness (SMI), substance use disorders, and depressive disorders in southeast London: a cohort study. BMC Psychiatry. 2010, 10: 77-10.1186/1471-244X-10-77.

Chang CK, Hayes RD, Perera G, Broadbent MT, Fernandes AC, Lee WE, Hotopf M, Stewart R: Life expectancy at birth for people with serious mental illness and other major disorders from a secondary mental health care case register in London. PLoS One. 2011, 6 (5): e19590-10.1371/journal.pone.0019590.

Hayes RD, Chang CK, Fernandes A, Begum A, To D, Broadbent M, Hotopf M, Stewart R: Associations between symptoms and all-cause mortality in individuals with serious mental illness. J Psychosom Res. 2012, 72 (2): 114-119. 10.1016/j.jpsychores.2011.09.012.

Page LA, Seetharaman S, Suhail I, Wessely S, Pereira J, Rubin GJ: Using electronic patient records to assess the impact of swine flu (influenza H1N1) on mental health patients. J Ment Health. 2011, 20 (1): 60-69. 10.3109/09638237.2010.542787.

Tulloch AD, Fearon P, David AS: Residential mobility among patients admitted to acute psychiatric wards. Health Place. 2011, 17 (4): 859-866. 10.1016/j.healthplace.2011.05.006.

Tulloch AD, Fearon P, David AS: Timing, prevalence, determinants and outcomes of homelessness among patients admitted to acute psychiatric wards. Soc Psychiatry Psychiatr Epidemiol. 2011, 47 (7): 1180-1191.

Fok ML-Y, Hayes RD, Chang C-K, Stewart R, Callard FJ, Moran P: Life expectancy at birth and all-cause mortality among people with personality disorder. J Psychosom Res. 2012, 73 (2): 104-107. 10.1016/j.jpsychores.2012.05.001.

Uzuner O, Luo Y, Szolovits P: Evaluating the state-of-the-art in automatic de-identification. J Am Med Inform Assoc. 2007, 14 (5): 550-563. 10.1197/jamia.M2444.

El Emam K, Jonker E, Arbuckle L, Malin B: A systematic review of re-identification attacks on health data. PLoS One. 2011, 6 (12): e28071-10.1371/journal.pone.0028071.

Meystre SM, Friedlin FJ, South BR, Shen S, Samore MH: Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC Med Res Methodol. 2010, 10: 70-10.1186/1471-2288-10-70.

Benton A, Hill S, Ungar L, Chung A, Leonard C, Freeman C, Holmes JH: A system for de-identifying medical message board text. BMC Bioinforma. 2011, 12 (3): S2-

Beckwith BA, Mahaadevan R, Balis UJ, Kuo F: Development and evaluation of an open source software tool for deidentification of pathology reports. BMC Med Inform Decis Mak. 2006, 6: 12-10.1186/1472-6947-6-12.

Friedlin FJ, McDonald CJ: A software tool for removing patient identifying information from clinical documents. J Am Med Inform Assoc. 2008, 15 (5): 601-610. 10.1197/jamia.M2702.

DoH: Department Of Health (DoH) – confidentiality – NHS code of practice. Department of Health. 2003

Aberdeen J, Bayer S, Yeniterzi R, Wellner B, Clark C, Hanauer D, Malin B, Hirschman L: The MITRE Identification Scrubber Toolkit: design, training, and assessment. Int J Med Inform. 2010, 79 (12): 849-859. 10.1016/j.ijmedinf.2010.09.007.

Navarro R: An ethical framework for sharing patient data without consent. Inform Prim Care. 2008, 16 (4): 257-262.

El Emam K, Dankar FK, Issa R, Jonker E, Amyot D, Cogo E, Corriveau JP, Walker M, Chowdhury S, Vaillancourt R: A globally optimal k-anonymity method for the de-identification of health data. J Am Med Inform Assoc. 2009, 16 (5): 670-682. 10.1197/jamia.M3144.

NLP Research Data Sets. i2b2 - Informatics for Integrating Biology & the Bedside. 2013, Partners Healthcare: A National Center for Biomedical Computing,https://www.i2b2.org/NLP/DataSets/Main.php,

Collen MF: Clinical research databases–a historical review. J Med Syst. 1990, 14 (6): 323-344. 10.1007/BF00996713.

Wasserman RC: Electronic medical records (EMRs), epidemiology, and epistemology: reflections on EMRs and future pediatric clinical research. Acad Pediatr. 2011, 11 (4): 280-287. 10.1016/j.acap.2011.02.007.