Empirical evaluation of language modeling to ascertain cancer outcomes from clinical text reports

BMC Bioinformatics - Tập 24 Số 1
Haitham Elmarakeby1, Pavel Trukhanov2, Vidal M. Arroyo3, Irbaz Bin Riaz4, Deborah Schrag5, Eliezer M. Van Allen6, Kenneth L. Kehl6
1Dana-Farber Cancer Institute, Boston, MA, USA
2Dana Farber Cancer Institute, Boston, USA
3Stanford University, Stanford, USA
4Dana Farber Cancer Institute, Boston, MA, USA
5Memorial Sloan Kettering Cancer Center, New York, USA
6Harvard Medical School, Boston, MA, USA

Tóm tắt

Abstract Background Longitudinal data on key cancer outcomes for clinical research, such as response to treatment and disease progression, are not captured in standard cancer registry reporting. Manual extraction of such outcomes from unstructured electronic health records is a slow, resource-intensive process. Natural language processing (NLP) methods can accelerate outcome annotation, but they require substantial labeled data. Transfer learning based on language modeling, particularly using the Transformer architecture, has achieved improvements in NLP performance. However, there has been no systematic evaluation of NLP model training strategies on the extraction of cancer outcomes from unstructured text. Results We evaluated the performance of nine NLP models at the two tasks of identifying cancer response and cancer progression within imaging reports at a single academic center among patients with non-small cell lung cancer. We trained the classification models under different conditions, including training sample size, classification architecture, and language model pre-training. The training involved a labeled dataset of 14,218 imaging reports for 1112 patients with lung cancer. A subset of models was based on a pre-trained language model, DFCI-ImagingBERT, created by further pre-training a BERT-based model using an unlabeled dataset of 662,579 reports from 27,483 patients with cancer from our center. A classifier based on our DFCI-ImagingBERT, trained on more than 200 patients, achieved the best results in most experiments; however, these results were marginally better than simpler “bag of words” or convolutional neural network models. Conclusion When developing AI models to extract outcomes from imaging reports for clinical cancer research, if computational resources are plentiful but labeled training data are limited, large language models can be used for zero- or few-shot learning to achieve reasonable performance. When computational resources are more limited but labeled training data are readily available, even simple machine learning architectures can achieve good performance for such tasks.

Từ khóa


Tài liệu tham khảo

Garraway LA, Verweij J, Ballman KV. Precision oncology: an overview. J Clin Oncol Off J Am Soc Clin Oncol. 2013;31(15):1803–5.

AACR Project GENIE Consortium. AACR Project GENIE: powering precision medicine through an international consortium. Cancer Discov. 2017;7(8):818–31.

Kehl KL, Elmarakeby H, Nishino M, Van Allen EM, Lepisto EM, Hassett MJ, et al. Assessment of deep natural language processing in ascertaining oncologic outcomes from radiology reports. JAMA Oncol. 2019;5(10):1421–9.

Kehl KL, Xu W, Gusev A, Bakouny Z, Choueiri TK, Riaz IB, et al. Artificial intelligence-aided clinical annotation of a large multi-cancer genomic dataset. Nat Commun. 2021;12(1):7304.

Kehl KL, Xu W, Lepisto E, Elmarakeby H, Hassett MJ, Van Allen EM, et al. Natural language processing to ascertain cancer outcomes from medical oncologist notes. JCO Clin Cancer Inform. 2020;4:680–90.

Dai AM, Le QV. Semi-supervised sequence learning. arXiv; 2015 [cited 2022 Sep 6]. http://arxiv.org/abs/1511.01432

Howard J, Ruder S. Universal language model fine-tuning for text classification. arXiv; 2018 [cited 2022 Sep 6]. http://arxiv.org/abs/1801.06146

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN et al. Attention is all you need. arXiv; 2017 [cited 2022 Sep 6]. http://arxiv.org/abs/1706.03762

Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv; 2019 [cited 2022 Sep 6]. http://arxiv.org/abs/1810.04805

Huang K, Altosaar J, Ranganath R. ClinicalBERT: modeling clinical notes and predicting hospital readmission. arXiv; 2020 Nov [cited 2022 May 31]. Report No. http://arxiv.org/abs/1904.05342

Dai Z, Yang Z, Yang Y, Carbonell J, Le QV, Salakhutdinov R. Transformer-XL: Attentive language models beyond a fixed-length context. arXiv; 2019 [cited 2022 Sep 6]. http://arxiv.org/abs/1901.02860

Kitaev N, Kaiser Ł, Levskaya A. Reformer: the efficient transformer. arXiv; 2020 [cited 2022 Sep 6]. http://arxiv.org/abs/2001.04451

Beltagy I, Peters ME, Cohan A. Longformer: the long-document transformer. arXiv; 2020 [cited 2022 Sep 6]. http://arxiv.org/abs/2004.05150

Olthof AW, Shouche P, Fennema EM, IJpma FFA, Koolstra RHC, Stirler VMA, et al. Machine learning based natural language processing of radiology reports in orthopaedic trauma. Comput Methods Programs Biomed. 2021;208:106304.

Chaudhari GR, Liu T, Chen TL, Joseph GB, Vella M, Lee YJ, et al. Application of a domain-specific BERT for detection of speech recognition errors in radiology reports. Radiol Artif Intell. 2022;4(4): e210185.

Nakamura Y, Hanaoka S, Nomura Y, Nakao T, Miki S, Watadani T, et al. Automatic detection of actionable radiology reports using bidirectional encoder representations from transformers. BMC Med Inform Decis Mak. 2021;21(1):262.

Olthof AW, van Ooijen PMA, Cornelissen LJ. Deep learning-based natural language processing in radiology: the impact of report complexity, disease prevalence, dataset size, and algorithm type on model performance. J Med Syst. 2021;45(10):91.

Wei J, Bosma M, Zhao VY, Guu K, Yu AW, Lester B et al. Finetuned language models are zero-shot learners. arXiv; 2022 [cited 2023 May 26]. http://arxiv.org/abs/2109.01652

Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv; 2020 [cited 2023 May 22]. http://arxiv.org/abs/1910.10683

Chung HW, Hou L, Longpre S, Zoph B, Tay Y, Fedus W, et al. Scaling instruction-finetuned language models. arXiv; 2022 [cited 2023 May 22]. http://arxiv.org/abs/2210.11416

Gutiérrez BJ, McNeal N, Washington C, Chen Y, Li L, Sun H, et al. Thinking about GPT-3 in-context learning for biomedical IE? Think again. arXiv; 2022 [cited 2023 May 26]. http://arxiv.org/abs/2203.08410

Kim Y. Convolutional neural networks for sentence classification. arXiv; 2014 [cited 2022 Sep 6]. http://arxiv.org/abs/1408.5882

Cho K, van Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, et al. Learning phrase representations using RNN encoder–decoder for statistical machine translation. arXiv; 2014 [cited 2022 Sep 6]. http://arxiv.org/abs/1406.1078

Huang XS, Perez F, Ba J, Volkovs M. Improving transformer optimization through better initialization. In: Proceedings of the 37th international conference on machine learning. PMLR; 2020 [cited 2022 Sep 6]. p. 4475–83. https://proceedings.mlr.press/v119/huang20f.html

Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2019;btz682.

Lehman E, Jain S, Pichotta K, Goldberg Y, Wallace BC. Does BERT pretrained on clinical notes reveal sensitive data? arXiv; 2021 Apr [cited 2022 Jun 2]. Report No. http://arxiv.org/abs/2104.07762

Sholl LM, Do K, Shivdasani P, Cerami E, Dubuc AM, Kuo FC, et al. Institutional implementation of clinical tumor profiling on an unselected cancer population. JCI Insight. 2016;1(19): e87062.

Salton G, Buckley C. Term-weighting approaches in automatic text retrieval. Inf Process Manag. 1988;24(5):513–23.

Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, et al. HuggingFace’s transformers: state-of-the-art natural language processing. arXiv; 2020 [cited 2022 Sep 6]. http://arxiv.org/abs/1910.03771

Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, et al. PyTorch: an imperative style, high-performance deep learning library. arXiv; 2019 [cited 2022 Sep 6]. http://arxiv.org/abs/1912.01703

Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, et al. TensorFlow: large-scale machine learning on heterogeneous distributed systems. arXiv; 2016 [cited 2022 Sep 6]. http://arxiv.org/abs/1603.04467

Zhang S, Roller S, Goyal N, Artetxe M, Chen M, Chen S, et al. OPT: open pre-trained transformer language models. arXiv; 2022 [cited 2023 May 30]. http://arxiv.org/abs/2205.01068

Sanh V, Webson A, Raffel C, Bach SH, Sutawika L, Alyafeai Z, et al. Multitask prompted training enables zero-shot task generalization. arXiv; 2022 [cited 2023 May 30]. http://arxiv.org/abs/2110.08207

Lu Q, Dou D, Nguyen T. ClinicalT5: a generative language model for clinical text. In: Findings of the association for computational linguistics: EMNLP 2022. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics; 2022 [cited 2023 May 30]. p. 5436–43. https://aclanthology.org/2022.findings-emnlp.398

Lehman E, Hernandez E, Mahajan D, Wulff J, Smith MJ, Ziegler Z, et al. Do we still need clinical language models? arXiv; 2023 [cited 2023 May 30]. http://arxiv.org/abs/2302.08091

Phan LN, Anibal JT, Tran H, Chanana S, Bahadroglu E, Peltekian A, et al. SciFive: a text-to-text transformer model for biomedical literature. arXiv; 2021 [cited 2023 May 30]. http://arxiv.org/abs/2106.03598

Loshchilov I, Hutter F. Decoupled weight decay regularization. 2017 Nov 14 [cited 2022 Sep 6]; https://arxiv.org/abs/1711.05101v3

Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics. 2020;21(1):6.

Johnson AEW, Pollard TJ, Shen L, Lehman L, Wei H, Feng M, Ghassemi M, et al. MIMIC-III, a freely accessible critical care database. Sci Data. 2016;3(1):160035.

Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, et al. RoBERTa: a robustly optimized BERT pretraining approach. arXiv; 2019 [cited 2023 Jun 5]. http://arxiv.org/abs/1907.11692