Constrained Naïve Bayes with application to unbalanced data classification
Tóm tắt
The Naïve Bayes is a tractable and efficient approach for statistical classification. In general classification problems, the consequences of misclassifications may be rather different in different classes, making it crucial to control misclassification rates in the most critical and, in many realworld problems, minority cases, possibly at the expense of higher misclassification rates in less problematic classes. One traditional approach to address this problem consists of assigning misclassification costs to the different classes and applying the Bayes rule, by optimizing a loss function. However, fixing precise values for such misclassification costs may be problematic in realworld applications. In this paper we address the issue of misclassification for the Naïve Bayes classifier. Instead of requesting precise values of misclassification costs, threshold values are used for different performance measures. This is done by adding constraints to the optimization problem underlying the estimation process. Our findings show that, under a reasonable computational cost, indeed, the performance measures under consideration achieve the desired levels yielding a user-friendly constrained classification procedure.
Tài liệu tham khảo
Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework. J Mult-Valued Logic Soft Comput 17:255–287
Alcalá-Fdez J, Sánchez L, García S, del Jesus MJ, Ventura S, Garrell JM, Otero J, Romero C, Bacardit J, Rivas VM, Fernández JC, Herrera F (2009) KEEL: A Software Tool to Assess Evolutionary Algorithms for Data Mining Problems. Soft Computing 13(3):307–318
Benítez-Peña S, Blanquero R, Carrizosa E, Ramírez-Cobo P (2019) On support vector machines under a multiple-cost scenario. Advances in Data Analysis and Classification 13(3):663–682
Bermejo P, Gámez JA, Puerta JM (2011) Improving the performance of Naive Bayes multinomial in e-mail foldering by introducing distribution-based balance of datasets. Expert Systems with Applications 38(3):2072–2080
Birgin E, Martínez J (2008) Improving ultimate convergence of an augmented Llagrangian method. Optim Methods Softw 23(2):177–195
Blanquero R, Carrizosa E, Molero-Río C, Romero Morales D (2021) Optimal randomized classification trees. Computers & Operations Research 132:105281
Blanquero R, Carrizosa E, Ramírez-Cobo P, Sillero-Denamiel MR (2021) A cost-sensitive constrained lasso. Advances in Data Analysis and Classification 15:121–158
Boullé M (2007) Compression-based Averaging of Selective Naive Bayes Classifiers. Journal of Machine Learning Research 8:1659–1685
Bradford JP, Kunz C, Kohavi R, Brunk C, Brodley CE (1998) Pruning decision trees with misclassification costs. In: Nédellec C, Rouveirol C (eds) Machine learning: ECML-98. Springer, Berlin Heidelberg, Berlin, Heidelberg, pp 131–136
Cao P, Zhao D, Zaïane OR (2013) A PSO-based cost-sensitive neural network for imbalanced data classification. In: Li J, Cao L, Wang C, Tan KC, Liu B, Pei J, Tseng VS (eds) Trends and applications in knowledge discovery and data mining. Springer, Berlin Heidelberg, Berlin, Heidelberg, pp 452–463
Carrizosa E, Martín-Barragán B, Romero Morales D (2008) Multi-group support vector machines with measurement costs: A biobjective approach. Discrete Applied Mathematics 156:950–966
Carrizosa E, Romero Morales D (2013) Supervised classification and mathematical optimization. Computers and Operations Research 40(1):150–165
Chandra B, Gupta M (2011) Robust approach for estimating probabilities in Naïve-Bayes classifier for gene expression data. Expert Systems with Applications 38(3):1293–1298
Datta S, Das S (2015) Near–Bayesian support vector machines for imbalanced data classification with equal or unequal misclassification costs. Neural Netw 70:39–52
Demšar J (2006) Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research 7:1–30
Domingos P, Pazzani M (1997) On the optimality of the simple Bayesian classifier under zero-one loss. Mach Learn 29(2–3):103–130
Freitas A, Costa-Pereira A, Brazdil P (2007) Cost-sensitive decision trees applied to medical data. In: Song IY, Eder J, Nguyen TM (eds) Data Warehousing and Knowledge Discovery. Springer, Berlin Heidelberg, pp 303–312
Guan G, Guo J, Wang H (2014) Varying Naïve Bayes Models With Applications to Classification of Chinese Text Documents. Journal of Business & Economic Statistics 32(3):445–456
Hand DJ, Yu K (2001) Idiot’s Bayes - Not So Stupid After All? International Statistical Review 69(3):385–398
Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. Springer, NY
He H, Yunqian M (2013) Imbalanced Learning: Foundations, Algorithms, and Applications. Wiley, Hoboken
Hogg RV, McKean J, Craig AT (2005) Introduction to Mathematical Statistics. Pearson Education
Jiang L, Wang S, Li C, Zhang L (2016) Structure extended multinomial naive Bayes. Information Sciences 329(Supplement C):346–356
Lee W, Jun CH, Lee JS (2017) Instance categorization by support vector machines to adjust weights in adaboost for imbalanced data classification. Information Sciences 381(Supplement C):92–103
Leevy JL, Khoshgoftaar TM, Bauder RA, Seliya N (2018) A survey on addressing high-class imbalance in big data. J Big Data. https://doi.org/10.1186/s40537-018-0151-6
Lichman, M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml
Ling CX, Yang Q, Wang J, Zhang S (2004) Decision trees with minimal costs. In: Proceedings of the twenty-first international conference on machine learning, ICML ’04, p. 69. New York, NY, USA
Mehra N, Gupta S (2013) Survey on multiclass classification methods. International Journal of Computer Science and Information Technologies 4(4):572–576
Menzies T, Greenwald J, Frank A (2007) Data Mining Static Code Attributes to Learn Defect Predictors. IEEE Transactions on Software Engineering 33(1):2–13
Minnier J, Yuan M, Liu JS, Cai T (2015) Risk Classification With an Adaptive Naive Bayes Kernel Machine Model. Journal of the American Statistical Association 110(509):393–404
Parthiban G, Rajesh A, Srivatsa SK (2011) Diagnosis of Heart Disease for Diabetic Patients using Naive Bayes Method. International Journal of Computer Applications 24(3):0975–8887
Peng L, Zhang H, Yang B, Chen Y (2014) A new approach for imbalanced data classification based on data gravitation. Inf Sci 288(Supplement C):347–373
Prati RC, Batista GE, Silva DF (2015) Class imbalance revisited: a new experimental setup to assess the performance of treatment methods. Knowledge and Information Systems 45:247–270
Romei A, Ruggieri S (2014) A multidisciplinary survey on discrimination analysis. The Knowledge Engineering Review 29(5):582–638
Rosen GL, Reichenberger ER, Rosenfeld AM (2010) NBC: the Naïve Bayes Classification tool webserver for taxonomic classification of metagenomic reads. Bioinformatics 27(1):127–129
Sokolova M, Lapalme G (2009) A systematic analysis of performance measures for classification tasks. Information Processing & Management 45(4):427–437
Sun Y, Kamel MS, Wong AK, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition 40(12):3358–3378
Sun Y, Wong AK, Kamel MS (2009) Classification of imbalanced data: A review. International Journal of Pattern Recognition and Artificial Intelligence 23:687–719
Turhan B, Bener A (2009) Analysis of Naive Bayes’ assumptions on software fault data: An empirical study. Data & Knowledge Engineering 68(2):278–290
Wei W, Visweswaran S, Cooper GF (2011) The application of naive Bayes model averaging to predict Alzheimer’s disease from genome-wide data. Journal of the American Medical Informatics Association 18(4):370–375
Witten DM, Shojaie A, Zhang F (2014) The Cluster Elastic Net for High-Dimensional Regression With Unknown Variable Grouping. Technometrics 56(1):112–122
Wolfson J, Bandyopadhyay S, Elidrisi M, Vazquez-Benitez G, Vock DM, Musgrove D, Adomavicius G, Johnson PE, O’Connor PJ (2015) A Naive Bayes machine learning approach to risk prediction using censored, time-to-event data. Statistics in Medicine 34(21):2941–2957
Wu J, Pan S, Zhu X, Cai Z, Zhang P, Zhang C (2015) Self-adaptive attribute weighting for Naive Bayes classification. Expert Systems with Applications 42(3):1487–1502
Xu QS, Liang YZ (2001) Monte Carlo cross validation. Chemom Intell Lab Syst 56(1):1–11
Yager RR (2006) An extension of the naive Bayesian classifier. Information Sciences 176(5):577–588
Yang Y, Liu X (1999). A re-examination of text categorization methods. In: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval (SIGIR), pp. 42–49. New York, NY, USA
Zhou Zhi-Hua, Liu Xu-Ying (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl Data Eng 18(1):63–77