The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets

PLoS ONE - Tập 10 Số 3 - Trang e0118432
Takaya Saito1, Marc Rehmsmeier1
1Computational Biology Unit, Department of Informatics, University of Bergen, P. O. Box 7803, N-5020, Bergen, Norway.

Tóm tắt

Từ khóa


Tài liệu tham khảo

AL Tarca, 2007, Machine learning and its applications to biology, PLoS Comput Biol, 3, e116, 10.1371/journal.pcbi.0030116

A Krogh, 2008, What are artificial neural networks?, Nat Biotechnol, 26, 195, 10.1038/nbt1386

A Ben-Hur, 2008, Support vector machines and kernels for computational biology, PLoS Comput Biol, 4, e1000173, 10.1371/journal.pcbi.1000173

JA Hanley, 1982, The meaning and use of the area under a receiver operating characteristic (ROC) curve, Radiology, 143, 29, 10.1148/radiology.143.1.7063747

H He, 2009, Learning from Imbalanced Data, IEEE Trans Knowl Data Eng, 21, 1263, 10.1109/TKDE.2008.239

N Chawla, 2004, Editorial: Special Issue on Learning from Imbalanced Data Sets, SIGKDD Explor, 6

NV Chawla, 2002, SMOTE: synthetic minority over-sampling technique, J Artif Intell Res, 16, 321, 10.1613/jair.953

RB Rao, 2006, Data mining for improved cardiac care, SIGKDD Explor, 8, 3, 10.1145/1147234.1147236

M Kubat, 1998, Machine Learning for the Detection of Oil Spills in Satellite Radar Images, Mach Learn, 30, 195, 10.1023/A:1007452223027

Provost F. Machine learning from imbalanced data sets 101. Proceedings of the AAAI-2000 Workshop on Imbalanced Data Sets. 2000.

JV Hulse, 2007, Experimental perspectives on learning from imbalanced data. Proceedings of the 24th international conference on, Machine learning, 935

H Guo, 2004, Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach, SIGKDD Explor, 6, 30, 10.1145/1007730.1007736

M Kubat, 1997, Addressing the curse of imbalanced training sets: one-sided selection, In Proceedings of the Fourteenth International Conference on Machine Learning, 179

C Ling, 1998, Data Mining for Direct Marketing: Problems and Solutions, In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, 73

C Elkan, 2001, The foundations of cost-sensitive learning, Proceedings of the 17th international joint conference on Artificial intelligence, Volume 2, 973

Y Sun, 2007, Cost-sensitive boosting for classification of imbalanced data, Pattern Recognit, 40, 3358, 10.1016/j.patcog.2007.04.009

N Japkowicz, 2002, The class imbalance problem: A systematic study, Intell Data Anal, 6, 429, 10.3233/IDA-2002-6504

X Hong, 2007, A kernel-based two-class classifier for imbalanced data sets, IEEE Trans Neural Netw, 18, 28, 10.1109/TNN.2006.882812

Wu G, Chang E. Class-Boundary Alignment for Imbalanced Dataset Learning. Workshop on Learning from Imbalanced Datasets in ICML. 2003.

A Estabrooks, 2004, A Multiple Resampling Method for Learning from Imbalanced Data Sets, Comput Intell, 20, 18, 10.1111/j.0824-7935.2004.t01-1-00228.x

A Ben-Hur, 2010, A user's guide to support vector machines, Methods Mol Biol, 609, 223, 10.1007/978-1-60327-241-4_13

B Mac Namee, 2002, The problem of bias in training data in regression problems in medical decision support, Artif Intell Med, 24, 51, 10.1016/S0933-3657(01)00092-6

K Soreide, 2009, Receiver-operating characteristic curve analysis in diagnostic, prognostic and predictive biomarker research, J Clin Pathol, 62, 1, 10.1136/jcp.2008.061010

T Fawcett, 2006, An introduction to ROC analysis, Pattern Recognit Lett, 27, 861, 10.1016/j.patrec.2005.10.010

JA Swets, 1988, Measuring the accuracy of diagnostic systems, Science, 240, 1285, 10.1126/science.3287615

J Davis, 2006, The relationship between Precision-Recall and ROC curves, Proceedings of the 23rd international conference on Machine learning, 233, 10.1145/1143844.1143874

SJ Swamidass, 2010, A CROC stronger than ROC: measuring, visualizing and optimizing early retrieval, Bioinformatics, 26, 1348, 10.1093/bioinformatics/btq140

C Drummond, 2000, Explicitly Representing Expected Cost: An Alternative to ROC Representation, In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 198, 10.1145/347090.347126

D Berrar, 2012, Caveats and pitfalls of ROC analysis in clinical microarray research (and how to avoid them), Brief Bioinform, 13, 83, 10.1093/bib/bbr008

TH Huang, 2007, MiRFinder: an improved approach and software implementation for genome-wide fast microRNA precursor scans, BMC Bioinformatics, 8, 341, 10.1186/1471-2105-8-341

DG Altman, 1994, Diagnostic tests. 1: Sensitivity and specificity, BMJ, 308, 1552, 10.1136/bmj.308.6943.1552

P Baldi, 2000, Assessing the accuracy of prediction algorithms for classification: an overview, Bioinformatics, 16, 412, 10.1093/bioinformatics/16.5.412

C Goutte, 2005, A probabilistic interpretation of precision, recall and F-score, with implication for evaluation, Advances in Information Retrieval, 345, 10.1007/978-3-540-31865-1_25

M Hall, 2009, The WEKA data mining software: an update, SIGKDD Explor, 11, 10, 10.1145/1656274.1656278

C-C Chang, 2011, LIBSVM: A library for support vector machines, ACM Trans Intell Syst Technol, 2, 1, 10.1145/1961189.1961199

J Hilden, 1991, The area under the ROC curve and its competitors, Med Decis Making, 11, 95, 10.1177/0272989X9101100204

JF Truchon, 2007, Evaluating virtual screening methods: good and bad metrics for the "early recognition" problem, J Chem Inf Model, 47, 488, 10.1021/ci600426e

M Gribskov, 1996, Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching, Comput Chem, 20, 25, 10.1016/S0097-8485(96)80004-0

Macskassy S, Provost F. Confidence bands for ROC curves: Methods and an empirical study. Proceedings of the First Workshop on ROC Analysis in AI. 2004.

T Sing, 2005, ROCR: visualizing classifier performance in R, Bioinformatics, 21, 3940, 10.1093/bioinformatics/bti623

R Ihaka, 1996, R: A Language for Data Analysis and Graphics, J Comput Graph Stat, 5, 299, 10.1080/10618600.1996.10474713

RC Gentleman, 2004, Bioconductor: open software development for computational biology and bioinformatics, Genome Biol, 5, R80, 10.1186/gb-2004-5-10-r80

PE Meyer, 2008, minet: A R/Bioconductor package for inferring large transcriptional networks using mutual information, BMC Bioinformatics, 9, 461, 10.1186/1471-2105-9-461

JN Hirschhorn, 2005, Genome-wide association studies for common diseases and complex traits, Nat Rev Genet, 6, 95, 10.1038/nrg1521

AR Gruber, 2010, RNAz 2.0: improved noncoding RNA detection, Pac Symp Biocomput, 69

A Kozomara, 2011, miRBase: integrating microRNA annotation and deep-sequencing data, Nucleic Acids Res, 39, D152, 10.1093/nar/gkq1027

P Jiang, 2007, MiPred: classification of real and pseudo microRNA precursors using random forest prediction model with combined features, Nucleic Acids Res, 35, W339, 10.1093/nar/gkm368

J Hertel, 2006, Hairpins in a Haystack: recognizing microRNA precursors in comparative genomics data, Bioinformatics, 22, e197, 10.1093/bioinformatics/btl257

JW Nam, 2005, Human microRNA prediction through a probabilistic co-learning model of sequence and structure, Nucleic Acids Res, 33, 3570, 10.1093/nar/gki668

I Hofacker, 1994, Fast Folding and Comparison of RNA Secondary Structures, Monatsh Chem, 125, 167, 10.1007/BF00818163

B Boser, 1992, A training algorithm for optimal margin classifiers, Proceedings of the fifth annual workshop on Computational learning theory, 144, 10.1145/130385.130401

SJ Raudys, 1991, Small Sample Size Effects in Statistical Pattern Recognition: Recommendations for Practitioners, IEEE Trans Pattern Anal Mach Intell, 13, 252, 10.1109/34.75512

DP Bartel, 2004, MicroRNAs: genomics, biogenesis, mechanism, and function, Cell, 116, 281, 10.1016/S0092-8674(04)00045-5

CP Gomes, 2013, A Review of Computational Tools in microRNA Discovery, Front Genet, 4, 81, 10.3389/fgene.2013.00081