Lợi ích của hệ số tương quan Matthews (MCC) so với điểm F1 và độ chính xác trong đánh giá phân loại nhị phân
Tóm tắt
Để đánh giá các phân loại nhị phân và ma trận nhầm lẫn của chúng, các nhà nghiên cứu khoa học có thể sử dụng một số tỷ lệ thống kê, tùy theo mục tiêu của cuộc thí nghiệm mà họ đang điều tra. Mặc dù đây là một vấn đề quan trọng trong học máy, nhưng chưa có sự đồng thuận rộng rãi về một chỉ số lựa chọn thống nhất nào. Độ chính xác và điểm F1 được tính toán trên các ma trận nhầm lẫn đã (và vẫn đang) là một trong những chỉ số phổ biến nhất được áp dụng trong các nhiệm vụ phân loại nhị phân. Tuy nhiên, những đo lường thống kê này có thể một cách nguy hiểm cho thấy kết quả phồng lên quá lạc quan, đặc biệt là trên các tập dữ liệu không cân bằng.
Thay vào đó, hệ số tương quan Matthews (MCC) là một tỷ lệ thống kê đáng tin cậy hơn, chỉ sản xuất điểm số cao nếu dự đoán đạt kết quả tốt trong tất cả bốn loại trong ma trận nhầm lẫn (các dương đúng, các âm sai, các âm đúng và các dương sai), theo tỷ lệ cả về kích thước của các yếu tố dương và kích thước của các yếu tố âm trong tập dữ liệu.
Trong bài viết này, chúng tôi chỉ ra cách mà MCC sản xuất một điểm số thông tin và trung thực hơn trong việc đánh giá phân loại nhị phân so với độ chính xác và điểm F1, bằng cách trước tiên giải thích các tính chất toán học, và sau đó là lợi ích của MCC trong sáu trường hợp sử dụng tổng hợp và trong một kịch bản thực tế về gen. Chúng tôi tin rằng hệ số tương quan Matthews nên được ưu tiên hơn độ chính xác và điểm F1 trong việc đánh giá các nhiệm vụ phân loại nhị phân bởi tất cả các cộng đồng khoa học.
Từ khóa
Tài liệu tham khảo
Chicco D, Rovelli C. Computational prediction of diagnosis and feature selection on mesothelioma patient health records. PLoS ONE. 2019; 14(1):0208737.
Fernandes K, Chicco D, Cardoso JS, Fernandes J. Supervised deep learning embeddings for the prediction of cervical cancer diagnosis. PeerJ Comput Sci. 2018; 4:154.
Maggio V, Chierici M, Jurman G, Furlanello C. Distillation of the clinical algorithm improves prognosis by multi-task deep learning in high-risk neuroblastoma. PLoS ONE. 2018; 13(12):0208924.
Fioravanti D, Giarratano Y, Maggio V, Agostinelli C, Chierici M, Jurman G, Furlanello C. Phylogenetic convolutional neural networks in metagenomics. BMC Bioinformatics. 2018; 19(2):49.
Hearst MA, Dumais ST, Osuna E, Platt J, Scholkopf B. Support vector machines. IEEE Intell Syst Appl. 1998; 13(4):18–28.
Chen T, Guestrin C. XGBoost: a scalable tree boosting system. In: Proceedings of KDD 2016 – the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM: 2016. p. 785–94. https://doi.org/10.1145/2939672.2939785.
Ressom HW, Varghese RS, Zhang Z, Xuan J, Clarke R. Classification algorithms for phenotype prediction in genomics and proteomics. Front Biosci. 2008; 13:691.
Nicodemus KK, Malley JD. Predictor correlation impacts machine learning algorithms: implications for genomic studies. Bioinformatics. 2009; 25(15):1884–90.
Karimzadeh M, Hoffman MM. Virtual ChIP-seq: predicting transcription factor binding by learning from the transcriptome. bioRxiv. 2018; 168419.
Whalen S, Truty RM, Pollard KS. Enhancer–promoter interactions are encoded by complex genomic signatures on looping chromatin. Nat Genet. 2016; 48(5):488.
Ng KLS, Mishra SK. De novo SVM classification of precursor microRNAs from genomic pseudo hairpins using global and intrinsic folding measures. Bioinformatics. 2007; 23(11):1321–30.
Demšar J. Statistical comparisons of classifiers over multiple data sets,. J Mach Learn Res. 2006; 7:1–30.
García S, Herrera F. An extension on ”Statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. J Mach Learn Res. 2008; 9:2677–94.
Sokolova M, Lapalme G. A systematic analysis of performance measures for classification tasks. Informa Process Manag. 2009; 45:427–37.
Ferri C, Hernández-Orallo J, Modroiu R. An experimental comparison of performance measures for classification. Pattern Recogn Lett. 2009; 30:27–38.
Garcia V, Mollineda RA, Sanchez JS. Theoretical analysis of a performance measure for imbalanced data. In: Proceedings of ICPR 2010 – the IAPR 20th International Conference on Pattern Recognition. IEEE: 2010. p. 617–20. https://doi.org/10.1109/icpr.2010.156.
Choi S-S, Cha S-H. A survey of binary similarity and distance measures. J Syst Cybernet Informa. 2010; 8(1):43–8.
Japkowicz N, Shah M. Evaluating Learning Algorithms: A Classification Perspective. Cambridge: Cambridge University Press; 2011.
Powers DMW. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness & correlation. J Mach Learn Technol. 2011; 2(1):37–63.
Vihinen M. How to evaluate performance of prediction methods? Measures and their interpretation in variation effect analysis. BMC Genomics. 2012; 13(4):2.
Shin SJ, Kim H, Han S-T. Comparison of the performance evaluations in classification. Int J Adv Res Comput Commun Eng. 2016; 5(8):441–4.
Branco P, Torgo L, Ribeiro RP. A survey of predictive modeling on imbalanced domains. ACM Comput Surv (CSUR). 2016; 49(2):31.
Ballabio D, Grisoni F, Todeschini R. Multivariate comparison of classification performance measures. Chemom Intell Lab Syst. 2018; 174:33–44.
Tharwat A. Classification assessment methods. Appl Comput Informa. 2018:1–13. https://doi.org/10.1016/j.aci.2018.08.003.
Luque A, Carrasco A, Martín A, de las Heras A. The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recogn. 2019; 91:216–31.
Anagnostopoulos C, Hand DJ, Adams NM. Measuring Classification Performance: the hmeasure Package. Technical report, CRAN. 2019:1–17.
Parker C. An analysis of performance measures for binary classifiers. In: Proceedings of IEEE ICDM 2011 – the 11th IEEE International Conference on Data Mining. IEEE: 2011. p. 517–26. https://doi.org/10.1109/icdm.2011.21.
Wang L, Chu F, Xie W. Accurate cancer classification using expressions of very few genes. IEEE/ACM Trans Comput Biol Bioinforma. 2007; 4(1):40–53.
Sokolova M, Japkowicz N, Szpakowicz S. Beyond accuracy, F-score and ROC: a family of discriminant measures for performance evaluation. In: Proceedings of Advances in Artificial Intelligence (AI 2006), Lecture Notes in Computer Science, vol. 4304. Heidelberg: Springer: 2006. p. 1015–21.
Gu Q, Zhu L, Cai Z. Evaluation measures of the classification performance of imbalanced data sets. In: Proceedings of ISICA 2009 – the 4th International Symposium on Computational Intelligence and Intelligent Systems, Communications in Computer and Information Science, vol. 51. Heidelberg: Springer: 2009. p. 461–71.
Bekkar M, Djemaa HK, Alitouche TA. Evaluation measures for models assessment over imbalanced data sets. J Informa Eng Appl. 2013; 3(10):27–38.
Akosa JS. Predictive accuracy: a misleading performance measure for highly imbalanced data. In: Proceedings of the SAS Global Forum 2017 Conference. Cary, North Carolina: SAS Institute Inc.: 2017. p. 942–2017.
Guilford JP. Psychometric Methods. New York City: McGraw-Hill; 1954.
Cramér H. Mathematical Methods of Statistics. Princeton: Princeton University Press; 1946.
Matthews BW. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta (BBA) Protein Struct. 1975; 405(2):442–51.
Baldi P, Brunak S, Chauvin Y, Andersen CA, Nielsen H. Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics. 2000; 16(5):412–24.
Gorodkin J. Comparing two K-category assignments by a K-category correlation coefficient. Comput Biol Chem. 2004; 28(5–6):367–74.
The MicroArray Quality Control (MAQC) Consortium. The MAQC-II Project: a comprehensive study of common practices for the development and validation of microarray-based predictive models. Nat Biotechnol. 2010; 28(8):827–38.
The SEQC/MAQC-III Consortium. A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequence Quality Control consortium. Nat Biotechnol. 2014; 32:903–14.
Liu Y, Cheng J, Yan C, Wu X, Chen F. Research on the Matthews correlation coefficients metrics of personalized recommendation algorithm evaluation. Int J Hybrid Informa Technol. 2015; 8(1):163–72.
Naulaerts S, Dang CC, Ballester PJ. Precision and recall oncology: combining multiple gene mutations for improved identification of drug-sensitive tumours. Oncotarget. 2017; 8(57):97025.
Boughorbel S, Jarray F, El-Anbari M. Optimal classifier for imbalanced data using Matthews correlation coefficient metric. PLoS ONE. 2017; 12(6):0177678.
Buckland M, Gey F. The relationship between recall and precision. J Am Soc Inform Sci. 1994; 45(1):12–9.
Saito T, Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE. 2015; 10(3):0118432.
Dice LR. Measures of the amount of ecologic association between species ecology. Ecology. 1945; 26(3):297–302.
Sørensen T. A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons. K Dan Vidensk Sels. 1948; 5(4):1–34.
van Rijsbergen CJ, Joost C. Information Retrieval. New York City: Butterworths; 1979.
Chinchor N. MUC-4 evaluation metrics. In: Proceedings of MUC-4 – the 4th Conference on Message Understanding. McLean: Association for Computational Linguistics: 1992. p. 22–9.
Zijdenbos AP, Dawant BM, Margolin RA, Palmer AC. Morphometric analysis of white matter lesions in MR images: method and validation. IEEE Trans Med Imaging. 1994; 13(4):716–24.
Tague-Sutcliffe J. The pragmatics of information retrieval experimentation. In: Information Retrieval Experiment, Chap. 5. Amsterdam: Butterworths: 1981.
Tague-Sutcliffe J. The pragmatics of information retrieval experimentation, revisited. Informa Process Manag. 1992; 28:467–90.
Lewis DD. Evaluating text categorization. In: Proceedings of HLT 1991 – Workshop on Speech and Natural Language. p. 312–8. https://doi.org/10.3115/112405.112471.
Lewis DD, Yang Y, Rose TG, Li F. RCV1: a new benchmark collection for text categorization research. J Mach Learn Res. 2004; 5:361–97.
Tsoumakas G, Katakis I, Vlahavas IP. Random k-labelsets for multilabel classification. IEEE Trans Knowl Data Eng. 2011; 23(7):1079–89.
Pillai I, Fumera G, Roli F. Designing multi-label classifiers that maximize F measures: state of the art. Pattern Recogn. 2017; 61:394–404.
Lipton ZC, Elkan C, Naryanaswamy B. Optimal thresholding of classifiers to maximize F1 measure. In: Proceedings of ECML PKDD 2014 – the 2014 Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Lecture Notes in Computer Science, vol. 8725. Heidelberg: Springer: 2014. p. 225–39.
Sasaki Y. The truth of the F-measure. Teach Tutor Mater. 2007; 1(5):1–5.
Hripcsak G, Rothschild AS. Agreement, the F-measure, and reliability in information retrieval. J Am Med Inform Assoc. 2005; 12(3):296–8.
Powers DMW. What the F-measure doesn’t measure...: features, flaws, fallacies and fixes. arXiv:1503.06410. 2015.
Van Asch V. Macro-and micro-averaged evaluation measures. Technical report. 2013:1–27.
Flach PA, Kull M. Precision-Recall-Gain curves: PR analysis done right. In: Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS 2015). Cambridge: MIT Press: 2015. p. 838–46.
Yedidia A. Against the F-score. 2016. Blogpost: https://adamyedidia.files.wordpress.com/2014/11/f_score.pdf. Accessed 10 Dec 2019.
Hand D, Christen P. A note on using the F-measure for evaluating record linkage algorithms. Stat Comput. 2018; 28:539–47.
Xi W, Beer MA. Local epigenomic state cannot discriminate interacting and non-interacting enhancer–promoter pairs with high accuracy. PLoS Comput Biol. 2018; 14(12):1006625.
Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977; 33(1):159–74.
Powers DMW. The problem with Kappa. In: Proceedings of EACL 2012 – the 13th Conference of the European Chapter of the Association for Computational Linguistics. Avignon: ACL: 2012. p. 345–55.
Delgado R, Tibau X-A. Why Cohen’s Kappa should be avoided as performance measure in classification. PloS ONE. 2019; 14(9):0222916.
Ben-David A. Comparison of classification accuracy using Cohen’s Weighted Kappa. Expert Syst Appl. 2008; 34:825–32.
Barandela R, Sánchez JS, Garca V, Rangel E. Strategies for learning in class imbalance problems. Pattern Recogn. 2003; 36(3):849–51.
Wei J-M, Yuan X-J, Hu Q-H, Wang S-Q. A novel measure for evaluating classifiers. Expert Syst Appl. 2010; 37:3799–809.
Delgado R, Núñez González JD. Enhancing confusion entropy (CEN) for binary and multiclass classification. PLoS ONE. 2019; 14(1):0210264.
Jurman G, Riccadonna S, Furlanello C. A comparison of MCC and CEN error measures in multi-class prediction. PLoS ONE. 2012; 7(8):41882.
Sebastiani F. An axiomatically derived measure for the evaluation of classification algorithms. In: Proceedings of ICTIR 2015 – the ACM SIGIR 2015 International Conference on the Theory of Information Retrieval. New York City: ACM: 2015. p. 11–20.
Espíndola R, Ebecken N. On extending F-measure and G-mean metrics to multi-class problems. WIT Trans Inf Commun Technol. 2005; 35:25–34.
Brodersen KH, Ong CS, Stephan KE, Buhmann JM. The balanced accuracy and its posterior distribution. In: Proceeedings of IAPR 2010 – the 20th IAPR International Conference on Pattern Recognition. IEEE: 2010. p. 3121–4. https://doi.org/10.1109/icpr.2010.764.
Dubey A, Tarar S. Evaluation of approximate rank-order clustering using Matthews correlation coefficient. Int J Eng Adv Technol. 2018; 8(2):106–13.
Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982; 143:29–36.
Bradley AP. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn. 1997; 30:1145–59.
Flach PA. The geometry of ROC space: understanding machine learning metrics through ROC isometrics. In: Proceedings of ICML 2003 – the 20th International Conference on Machine Learning. Palo Alto: AAAI Press: 2003. p. 194–201.
Huang J, Ling CX. Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng. 2005; 17(3):299–310.
Hand DJ. Evaluating diagnostic tests: the area under the ROC curve and the balance of errors. Stat Med. 2010; 29:1502–10.
Suresh Babu N. Various performance measures in binary classification – An overview of ROC study. Int J Innov Sci Eng Technol. 2015; 2(9):596–605.
Lobo JM, Jiménez-Valverde A, Real R. AUC: a misleading measure of the performance of predictive distribution models. Glob Ecol Biogeogr. 2008; 17(2):145–51.
Hanczar B, Hua J, Sima C, Weinstein J, Bittner M, Dougherty ER. Small-sample precision of ROC-related estimates. Bioinformatics. 2010; 26(6):822–30.
Hand DJ. Measuring classifier performance: a coherent alternative to the area under the ROC curve. Mach Learn. 2009; 77(9):103–23.
Ferri C, Hernández-Orallo J, Flach PA. A coherent interpretation of AUC as a measure of aggregated classification performance. In: Proceedings of ICML 2011 – the 28th International Conference on Machine Learning. Norristown: Omnipress: 2011. p. 657–64.
Keilwagen J, Grosse I, Grau J. Area under precision-recall curves for weighted and unweighted data. PLoS ONE. 2014; 9(3):92209.
Chicco D. Ten quick tips for machine learning in computational biology. BioData Min. 2017; 10(35):1–17.
Ozenne B, Subtil F, Maucort-Boulch D. The precision–recall curve overcame the optimism of the receiver operating characteristic curve in rare diseases. J Clin Epidemiol. 2015; 68(8):855–9.
Blagus R, Lusa L. Class prediction for high-dimensional class-imbalanced data. BMC Bioinformatics. 2010; 11:523.
Hauke J, Kossowski T. Comparison of values of Pearson’s and Spearman’s correlation coefficients on the same sets of data. Quaest Geographicae. 2011; 30(2):87–93.
Chicco D, Ciceri E, Masseroli M. Extended Spearman and Kendall coefficients for gene annotation list correlation. In: International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics. Springer: 2014. p. 19–32. https://doi.org/10.1007/978-3-319-24462-4_2.
Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci (PNAS). 1999; 96(12):6745–50.
Boulesteix A-L, Strimmer K. Partial least squares: a versatile tool for the analysis of high-dimensional genomic data. Brief Bioinforma. 2006; 8(1):32–44.
Boulesteix A-L, Durif G, Lambert-Lacroix S, Peyre J, Strimmer K. Package ‘plsgenomics’. 2018. https://cran.r-project.org/web/packages/plsgenomics/index.html. Accessed 10 Dec 2019.
Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ. Data pertaining to the article ‘Broad patterns of gene expression revealed by clustering of tumor and normal colon tissues probed by oligonucleotide arrays’. 2000. http://genomics-pubs.princeton.edu/oncology/affydata/index.html. Accessed 10 Dec 2019.
Timofeev R. Classification and regression trees (CART) theory and applications. Berlin: Humboldt University; 2004.