The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation

Davide Chicco1, Giuseppe Jurman2
1Krembil Research Institute, Toronto, Ontario, Canada
2Fondazione Bruno Kessler, Trento, Italy

Tóm tắt

AbstractBackgroundTo evaluate binary classifications and their confusion matrices, scientific researchers can employ several statistical rates, accordingly to the goal of the experiment they are investigating. Despite being a crucial issue in machine learning, no widespread consensus has been reached on a unified elective chosen measure yet. Accuracy and F1score computed on confusion matrices have been (and still are) among the most popular adopted metrics in binary classification tasks. However, these statistical measures can dangerously show overoptimistic inflated results, especially on imbalanced datasets.ResultsThe Matthews correlation coefficient (MCC), instead, is a more reliable statistical rate which produces a high score only if the prediction obtained good results in all of the four confusion matrix categories (true positives, false negatives, true negatives, and false positives), proportionally both to the size of positive elements and the size of negative elements in the dataset.ConclusionsIn this article, we show how MCC produces a more informative and truthful score in evaluating binary classifications than accuracy and F1score, by first explaining the mathematical properties, and then the asset of MCC in six synthetic use cases and in a real genomics scenario. We believe that the Matthews correlation coefficient should be preferred to accuracy and F1score in evaluating binary classification tasks by all scientific communities.

Từ khóa


Tài liệu tham khảo

Chicco D, Rovelli C. Computational prediction of diagnosis and feature selection on mesothelioma patient health records. PLoS ONE. 2019; 14(1):0208737.

Fernandes K, Chicco D, Cardoso JS, Fernandes J. Supervised deep learning embeddings for the prediction of cervical cancer diagnosis. PeerJ Comput Sci. 2018; 4:154.

Maggio V, Chierici M, Jurman G, Furlanello C. Distillation of the clinical algorithm improves prognosis by multi-task deep learning in high-risk neuroblastoma. PLoS ONE. 2018; 13(12):0208924.

Fioravanti D, Giarratano Y, Maggio V, Agostinelli C, Chierici M, Jurman G, Furlanello C. Phylogenetic convolutional neural networks in metagenomics. BMC Bioinformatics. 2018; 19(2):49.

LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015; 521(7553):436.

Peterson LE. K-nearest neighbor. Scholarpedia. 2009; 4(2):1883.

Hearst MA, Dumais ST, Osuna E, Platt J, Scholkopf B. Support vector machines. IEEE Intell Syst Appl. 1998; 13(4):18–28.

Breiman L. Random forests. Mach Learn. 2001; 45(1):5–32.

Chen T, Guestrin C. XGBoost: a scalable tree boosting system. In: Proceedings of KDD 2016 – the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM: 2016. p. 785–94. https://doi.org/10.1145/2939672.2939785.

Ressom HW, Varghese RS, Zhang Z, Xuan J, Clarke R. Classification algorithms for phenotype prediction in genomics and proteomics. Front Biosci. 2008; 13:691.

Nicodemus KK, Malley JD. Predictor correlation impacts machine learning algorithms: implications for genomic studies. Bioinformatics. 2009; 25(15):1884–90.

Karimzadeh M, Hoffman MM. Virtual ChIP-seq: predicting transcription factor binding by learning from the transcriptome. bioRxiv. 2018; 168419.

Whalen S, Truty RM, Pollard KS. Enhancer–promoter interactions are encoded by complex genomic signatures on looping chromatin. Nat Genet. 2016; 48(5):488.

Ng KLS, Mishra SK. De novo SVM classification of precursor microRNAs from genomic pseudo hairpins using global and intrinsic folding measures. Bioinformatics. 2007; 23(11):1321–30.

Demšar J. Statistical comparisons of classifiers over multiple data sets,. J Mach Learn Res. 2006; 7:1–30.

García S, Herrera F. An extension on ”Statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. J Mach Learn Res. 2008; 9:2677–94.

Sokolova M, Lapalme G. A systematic analysis of performance measures for classification tasks. Informa Process Manag. 2009; 45:427–37.

Ferri C, Hernández-Orallo J, Modroiu R. An experimental comparison of performance measures for classification. Pattern Recogn Lett. 2009; 30:27–38.

Garcia V, Mollineda RA, Sanchez JS. Theoretical analysis of a performance measure for imbalanced data. In: Proceedings of ICPR 2010 – the IAPR 20th International Conference on Pattern Recognition. IEEE: 2010. p. 617–20. https://doi.org/10.1109/icpr.2010.156.

Choi S-S, Cha S-H. A survey of binary similarity and distance measures. J Syst Cybernet Informa. 2010; 8(1):43–8.

Japkowicz N, Shah M. Evaluating Learning Algorithms: A Classification Perspective. Cambridge: Cambridge University Press; 2011.

Powers DMW. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness & correlation. J Mach Learn Technol. 2011; 2(1):37–63.

Vihinen M. How to evaluate performance of prediction methods? Measures and their interpretation in variation effect analysis. BMC Genomics. 2012; 13(4):2.

Shin SJ, Kim H, Han S-T. Comparison of the performance evaluations in classification. Int J Adv Res Comput Commun Eng. 2016; 5(8):441–4.

Branco P, Torgo L, Ribeiro RP. A survey of predictive modeling on imbalanced domains. ACM Comput Surv (CSUR). 2016; 49(2):31.

Ballabio D, Grisoni F, Todeschini R. Multivariate comparison of classification performance measures. Chemom Intell Lab Syst. 2018; 174:33–44.

Tharwat A. Classification assessment methods. Appl Comput Informa. 2018:1–13. https://doi.org/10.1016/j.aci.2018.08.003.

Luque A, Carrasco A, Martín A, de las Heras A. The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recogn. 2019; 91:216–31.

Anagnostopoulos C, Hand DJ, Adams NM. Measuring Classification Performance: the hmeasure Package. Technical report, CRAN. 2019:1–17.

Parker C. An analysis of performance measures for binary classifiers. In: Proceedings of IEEE ICDM 2011 – the 11th IEEE International Conference on Data Mining. IEEE: 2011. p. 517–26. https://doi.org/10.1109/icdm.2011.21.

Wang L, Chu F, Xie W. Accurate cancer classification using expressions of very few genes. IEEE/ACM Trans Comput Biol Bioinforma. 2007; 4(1):40–53.

Sokolova M, Japkowicz N, Szpakowicz S. Beyond accuracy, F-score and ROC: a family of discriminant measures for performance evaluation. In: Proceedings of Advances in Artificial Intelligence (AI 2006), Lecture Notes in Computer Science, vol. 4304. Heidelberg: Springer: 2006. p. 1015–21.

Gu Q, Zhu L, Cai Z. Evaluation measures of the classification performance of imbalanced data sets. In: Proceedings of ISICA 2009 – the 4th International Symposium on Computational Intelligence and Intelligent Systems, Communications in Computer and Information Science, vol. 51. Heidelberg: Springer: 2009. p. 461–71.

Bekkar M, Djemaa HK, Alitouche TA. Evaluation measures for models assessment over imbalanced data sets. J Informa Eng Appl. 2013; 3(10):27–38.

Akosa JS. Predictive accuracy: a misleading performance measure for highly imbalanced data. In: Proceedings of the SAS Global Forum 2017 Conference. Cary, North Carolina: SAS Institute Inc.: 2017. p. 942–2017.

Guilford JP. Psychometric Methods. New York City: McGraw-Hill; 1954.

Cramér H. Mathematical Methods of Statistics. Princeton: Princeton University Press; 1946.

Matthews BW. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta (BBA) Protein Struct. 1975; 405(2):442–51.

Baldi P, Brunak S, Chauvin Y, Andersen CA, Nielsen H. Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics. 2000; 16(5):412–24.

Gorodkin J. Comparing two K-category assignments by a K-category correlation coefficient. Comput Biol Chem. 2004; 28(5–6):367–74.

The MicroArray Quality Control (MAQC) Consortium. The MAQC-II Project: a comprehensive study of common practices for the development and validation of microarray-based predictive models. Nat Biotechnol. 2010; 28(8):827–38.

The SEQC/MAQC-III Consortium. A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequence Quality Control consortium. Nat Biotechnol. 2014; 32:903–14.

Liu Y, Cheng J, Yan C, Wu X, Chen F. Research on the Matthews correlation coefficients metrics of personalized recommendation algorithm evaluation. Int J Hybrid Informa Technol. 2015; 8(1):163–72.

Naulaerts S, Dang CC, Ballester PJ. Precision and recall oncology: combining multiple gene mutations for improved identification of drug-sensitive tumours. Oncotarget. 2017; 8(57):97025.

Brown JB. Classifiers and their metrics quantified. Mol Inform. 2018; 37:1700127.

Boughorbel S, Jarray F, El-Anbari M. Optimal classifier for imbalanced data using Matthews correlation coefficient metric. PLoS ONE. 2017; 12(6):0177678.

Buckland M, Gey F. The relationship between recall and precision. J Am Soc Inform Sci. 1994; 45(1):12–9.

Saito T, Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE. 2015; 10(3):0118432.

Dice LR. Measures of the amount of ecologic association between species ecology. Ecology. 1945; 26(3):297–302.

Sørensen T. A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons. K Dan Vidensk Sels. 1948; 5(4):1–34.

van Rijsbergen CJ. Foundations of evaluation. J Doc. 1974; 30:365–73.

van Rijsbergen CJ, Joost C. Information Retrieval. New York City: Butterworths; 1979.

Chinchor N. MUC-4 evaluation metrics. In: Proceedings of MUC-4 – the 4th Conference on Message Understanding. McLean: Association for Computational Linguistics: 1992. p. 22–9.

Zijdenbos AP, Dawant BM, Margolin RA, Palmer AC. Morphometric analysis of white matter lesions in MR images: method and validation. IEEE Trans Med Imaging. 1994; 13(4):716–24.

Tague-Sutcliffe J. The pragmatics of information retrieval experimentation. In: Information Retrieval Experiment, Chap. 5. Amsterdam: Butterworths: 1981.

Tague-Sutcliffe J. The pragmatics of information retrieval experimentation, revisited. Informa Process Manag. 1992; 28:467–90.

Lewis DD. Evaluating text categorization. In: Proceedings of HLT 1991 – Workshop on Speech and Natural Language. p. 312–8. https://doi.org/10.3115/112405.112471.

Lewis DD, Yang Y, Rose TG, Li F. RCV1: a new benchmark collection for text categorization research. J Mach Learn Res. 2004; 5:361–97.

Tsoumakas G, Katakis I, Vlahavas IP. Random k-labelsets for multilabel classification. IEEE Trans Knowl Data Eng. 2011; 23(7):1079–89.

Pillai I, Fumera G, Roli F. Designing multi-label classifiers that maximize F measures: state of the art. Pattern Recogn. 2017; 61:394–404.

Lipton ZC, Elkan C, Naryanaswamy B. Optimal thresholding of classifiers to maximize F1 measure. In: Proceedings of ECML PKDD 2014 – the 2014 Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Lecture Notes in Computer Science, vol. 8725. Heidelberg: Springer: 2014. p. 225–39.

Sasaki Y. The truth of the F-measure. Teach Tutor Mater. 2007; 1(5):1–5.

Hripcsak G, Rothschild AS. Agreement, the F-measure, and reliability in information retrieval. J Am Med Inform Assoc. 2005; 12(3):296–8.

Powers DMW. What the F-measure doesn’t measure...: features, flaws, fallacies and fixes. arXiv:1503.06410. 2015.

Van Asch V. Macro-and micro-averaged evaluation measures. Technical report. 2013:1–27.

Flach PA, Kull M. Precision-Recall-Gain curves: PR analysis done right. In: Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS 2015). Cambridge: MIT Press: 2015. p. 838–46.

Yedidia A. Against the F-score. 2016. Blogpost: https://adamyedidia.files.wordpress.com/2014/11/f_score.pdf. Accessed 10 Dec 2019.

Hand D, Christen P. A note on using the F-measure for evaluating record linkage algorithms. Stat Comput. 2018; 28:539–47.

Xi W, Beer MA. Local epigenomic state cannot discriminate interacting and non-interacting enhancer–promoter pairs with high accuracy. PLoS Comput Biol. 2018; 14(12):1006625.

Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas. 1960; 20(1):37–46.

Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977; 33(1):159–74.

McHugh ML. Interrater reliability: the Kappa statistic. Biochem Med. 2012; 22(3):276–82.

Flight L, Julious SA. The disagreeable behaviour of the kappa statistic. Pharm Stat. 2015; 14:74–8.

Powers DMW. The problem with Kappa. In: Proceedings of EACL 2012 – the 13th Conference of the European Chapter of the Association for Computational Linguistics. Avignon: ACL: 2012. p. 345–55.

Delgado R, Tibau X-A. Why Cohen’s Kappa should be avoided as performance measure in classification. PloS ONE. 2019; 14(9):0222916.

Ben-David A. Comparison of classification accuracy using Cohen’s Weighted Kappa. Expert Syst Appl. 2008; 34:825–32.

Barandela R, Sánchez JS, Garca V, Rangel E. Strategies for learning in class imbalance problems. Pattern Recogn. 2003; 36(3):849–51.

Wei J-M, Yuan X-J, Hu Q-H, Wang S-Q. A novel measure for evaluating classifiers. Expert Syst Appl. 2010; 37:3799–809.

Delgado R, Núñez González JD. Enhancing confusion entropy (CEN) for binary and multiclass classification. PLoS ONE. 2019; 14(1):0210264.

Jurman G, Riccadonna S, Furlanello C. A comparison of MCC and CEN error measures in multi-class prediction. PLoS ONE. 2012; 7(8):41882.

Sebastiani F. An axiomatically derived measure for the evaluation of classification algorithms. In: Proceedings of ICTIR 2015 – the ACM SIGIR 2015 International Conference on the Theory of Information Retrieval. New York City: ACM: 2015. p. 11–20.

Espíndola R, Ebecken N. On extending F-measure and G-mean metrics to multi-class problems. WIT Trans Inf Commun Technol. 2005; 35:25–34.

Brodersen KH, Ong CS, Stephan KE, Buhmann JM. The balanced accuracy and its posterior distribution. In: Proceeedings of IAPR 2010 – the 20th IAPR International Conference on Pattern Recognition. IEEE: 2010. p. 3121–4. https://doi.org/10.1109/icpr.2010.764.

Dubey A, Tarar S. Evaluation of approximate rank-order clustering using Matthews correlation coefficient. Int J Eng Adv Technol. 2018; 8(2):106–13.

Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982; 143:29–36.

Bradley AP. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn. 1997; 30:1145–59.

Flach PA. The geometry of ROC space: understanding machine learning metrics through ROC isometrics. In: Proceedings of ICML 2003 – the 20th International Conference on Machine Learning. Palo Alto: AAAI Press: 2003. p. 194–201.

Huang J, Ling CX. Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng. 2005; 17(3):299–310.

Fawcett T. An introduction to ROC analysis. Pattern Recogn Lett. 2006; 27(8):861–74.

Hand DJ. Evaluating diagnostic tests: the area under the ROC curve and the balance of errors. Stat Med. 2010; 29:1502–10.

Suresh Babu N. Various performance measures in binary classification – An overview of ROC study. Int J Innov Sci Eng Technol. 2015; 2(9):596–605.

Lobo JM, Jiménez-Valverde A, Real R. AUC: a misleading measure of the performance of predictive distribution models. Glob Ecol Biogeogr. 2008; 17(2):145–51.

Hanczar B, Hua J, Sima C, Weinstein J, Bittner M, Dougherty ER. Small-sample precision of ROC-related estimates. Bioinformatics. 2010; 26(6):822–30.

Hand DJ. Measuring classifier performance: a coherent alternative to the area under the ROC curve. Mach Learn. 2009; 77(9):103–23.

Ferri C, Hernández-Orallo J, Flach PA. A coherent interpretation of AUC as a measure of aggregated classification performance. In: Proceedings of ICML 2011 – the 28th International Conference on Machine Learning. Norristown: Omnipress: 2011. p. 657–64.

Keilwagen J, Grosse I, Grau J. Area under precision-recall curves for weighted and unweighted data. PLoS ONE. 2014; 9(3):92209.

Chicco D. Ten quick tips for machine learning in computational biology. BioData Min. 2017; 10(35):1–17.

Ozenne B, Subtil F, Maucort-Boulch D. The precision–recall curve overcame the optimism of the receiver operating characteristic curve in rare diseases. J Clin Epidemiol. 2015; 68(8):855–9.

Blagus R, Lusa L. Class prediction for high-dimensional class-imbalanced data. BMC Bioinformatics. 2010; 11:523.

Sedgwick P. Pearson’s correlation coefficient. Br Med J (BMJ). 2012; 345:4483.

Hauke J, Kossowski T. Comparison of values of Pearson’s and Spearman’s correlation coefficients on the same sets of data. Quaest Geographicae. 2011; 30(2):87–93.

Chicco D, Ciceri E, Masseroli M. Extended Spearman and Kendall coefficients for gene annotation list correlation. In: International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics. Springer: 2014. p. 19–32. https://doi.org/10.1007/978-3-319-24462-4_2.

Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci (PNAS). 1999; 96(12):6745–50.

Boulesteix A-L, Strimmer K. Partial least squares: a versatile tool for the analysis of high-dimensional genomic data. Brief Bioinforma. 2006; 8(1):32–44.

Boulesteix A-L, Durif G, Lambert-Lacroix S, Peyre J, Strimmer K. Package ‘plsgenomics’. 2018. https://cran.r-project.org/web/packages/plsgenomics/index.html. Accessed 10 Dec 2019.

Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ. Data pertaining to the article ‘Broad patterns of gene expression revealed by clustering of tumor and normal colon tissues probed by oligonucleotide arrays’. 2000. http://genomics-pubs.princeton.edu/oncology/affydata/index.html. Accessed 10 Dec 2019.

Friedman JH. Stochastic gradient boosting. Comput Stat Data Anal. 2002; 38(4):367–78.

Timofeev R. Classification and regression trees (CART) theory and applications. Berlin: Humboldt University; 2004.

Beyer K, Goldstein J, Ramakrishnan R, Shaft U. When is “nearest neighbor” meaningful? In: International Conference on Database Theory. Springer: 1999. p. 217–35. https://doi.org/10.1007/3-540-49257-7_15.