Integration of IDPC Clustering Analysis and Interpretable Machine Learning for Survival Risk Prediction of Patients with ESCC

Dan Ling1, Anhao Liu1, Junwei Sun1, Yanfeng Wang1, Lidong Wang2, Xin Song2, Xueke Zhao2
1Henan Key Lab of Information-based Electrical Appliances, Zhengzhou University of Light Industry, Zhengzhou, China
2State Key Laboratory of Esophageal Cancer Prevention and Treatment and Henan Key Laboratory for Esophageal Cancer Research of The First Affiliated Hospital, Zhengzhou University, Zhengzhou, China

Tóm tắt

Precise forecasting of survival risk plays a pivotal role in comprehending and predicting the prognosis of patients afflicted with esophageal squamous cell carcinoma (ESCC). The existing methods have the problems of insufficient fitting ability and poor interpretability. To address this issue, this work proposes a novel interpretable survival risk prediction method for ESCC patients based on extreme gradient boosting improved by whale optimization algorithm (WOA-XGBoost) and shapley additive explanations (SHAP). Given the imbalanced nature of the data set, the adaptive synthetic sampling (ADASYN) is first used to generate the samples with high survival risk. Then, an improved clustering by fast search and find of density peaks (IDPC) algorithm based on cosine distance and K nearest neighbors is used to cluster the patients. Next, the prediction model for each cluster is obtained by WOA-XGBoost and the constructed model is visualized with SHAP to uncover the factors hidden in the structured model and improve the interpretability of the black-box model. Finally, the effectiveness of the proposed scheme is demonstrated by analyzing the data collected from the First Affiliated Hospital of Zhengzhou University. The results of the analysis reveal that the proposed methodology exhibits superior performance, as indicated by the area under the receiver operating characteristic curve (AUROC) of 0.918 and accuracy of 0.881.

Tài liệu tham khảo