Integration of IDPC Clustering Analysis and Interpretable Machine Learning for Survival Risk Prediction of Patients with ESCC
Tóm tắt
Precise forecasting of survival risk plays a pivotal role in comprehending and predicting the prognosis of patients afflicted with esophageal squamous cell carcinoma (ESCC). The existing methods have the problems of insufficient fitting ability and poor interpretability. To address this issue, this work proposes a novel interpretable survival risk prediction method for ESCC patients based on extreme gradient boosting improved by whale optimization algorithm (WOA-XGBoost) and shapley additive explanations (SHAP). Given the imbalanced nature of the data set, the adaptive synthetic sampling (ADASYN) is first used to generate the samples with high survival risk. Then, an improved clustering by fast search and find of density peaks (IDPC) algorithm based on cosine distance and K nearest neighbors is used to cluster the patients. Next, the prediction model for each cluster is obtained by WOA-XGBoost and the constructed model is visualized with SHAP to uncover the factors hidden in the structured model and improve the interpretability of the black-box model. Finally, the effectiveness of the proposed scheme is demonstrated by analyzing the data collected from the First Affiliated Hospital of Zhengzhou University. The results of the analysis reveal that the proposed methodology exhibits superior performance, as indicated by the area under the receiver operating characteristic curve (AUROC) of 0.918 and accuracy of 0.881.