Constrained Naïve Bayes with application to unbalanced data classification

Central European Journal of Operations Research - Tập 30 - Trang 1403-1425 - 2021
Rafael Blanquero1,2, Emilio Carrizosa1,2, Pepa Ramírez-Cobo3,2, M. Remedios Sillero-Denamiel4,2
1Departamento de Estadística e Investigación Operativa, Universidad de Sevilla, Seville, Spain
2IMUS, Instituto de Matemáticas de la Universidad de Sevilla, Seville, Spain
3Departamento de Estadística e Investigación Operativa, Universidad de Cádiz, Cádiz, Spain
4School of Computer Science and Statistics, Trinity College Dublin (TCD), Dublin, Ireland

Tóm tắt

The Naïve Bayes is a tractable and efficient approach for statistical classification. In general classification problems, the consequences of misclassifications may be rather different in different classes, making it crucial to control misclassification rates in the most critical and, in many realworld problems, minority cases, possibly at the expense of higher misclassification rates in less problematic classes. One traditional approach to address this problem consists of assigning misclassification costs to the different classes and applying the Bayes rule, by optimizing a loss function. However, fixing precise values for such misclassification costs may be problematic in realworld applications. In this paper we address the issue of misclassification for the Naïve Bayes classifier. Instead of requesting precise values of misclassification costs, threshold values are used for different performance measures. This is done by adding constraints to the optimization problem underlying the estimation process. Our findings show that, under a reasonable computational cost, indeed, the performance measures under consideration achieve the desired levels yielding a user-friendly constrained classification procedure.

Tài liệu tham khảo

Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework. J Mult-Valued Logic Soft Comput 17:255–287 Alcalá-Fdez J, Sánchez L, García S, del Jesus MJ, Ventura S, Garrell JM, Otero J, Romero C, Bacardit J, Rivas VM, Fernández JC, Herrera F (2009) KEEL: A Software Tool to Assess Evolutionary Algorithms for Data Mining Problems. Soft Computing 13(3):307–318 Benítez-Peña S, Blanquero R, Carrizosa E, Ramírez-Cobo P (2019) On support vector machines under a multiple-cost scenario. Advances in Data Analysis and Classification 13(3):663–682 Bermejo P, Gámez JA, Puerta JM (2011) Improving the performance of Naive Bayes multinomial in e-mail foldering by introducing distribution-based balance of datasets. Expert Systems with Applications 38(3):2072–2080 Birgin E, Martínez J (2008) Improving ultimate convergence of an augmented Llagrangian method. Optim Methods Softw 23(2):177–195 Blanquero R, Carrizosa E, Molero-Río C, Romero Morales D (2021) Optimal randomized classification trees. Computers & Operations Research 132:105281 Blanquero R, Carrizosa E, Ramírez-Cobo P, Sillero-Denamiel MR (2021) A cost-sensitive constrained lasso. Advances in Data Analysis and Classification 15:121–158 Boullé M (2007) Compression-based Averaging of Selective Naive Bayes Classifiers. Journal of Machine Learning Research 8:1659–1685 Bradford JP, Kunz C, Kohavi R, Brunk C, Brodley CE (1998) Pruning decision trees with misclassification costs. In: Nédellec C, Rouveirol C (eds) Machine learning: ECML-98. Springer, Berlin Heidelberg, Berlin, Heidelberg, pp 131–136 Cao P, Zhao D, Zaïane OR (2013) A PSO-based cost-sensitive neural network for imbalanced data classification. In: Li J, Cao L, Wang C, Tan KC, Liu B, Pei J, Tseng VS (eds) Trends and applications in knowledge discovery and data mining. Springer, Berlin Heidelberg, Berlin, Heidelberg, pp 452–463 Carrizosa E, Martín-Barragán B, Romero Morales D (2008) Multi-group support vector machines with measurement costs: A biobjective approach. Discrete Applied Mathematics 156:950–966 Carrizosa E, Romero Morales D (2013) Supervised classification and mathematical optimization. Computers and Operations Research 40(1):150–165 Chandra B, Gupta M (2011) Robust approach for estimating probabilities in Naïve-Bayes classifier for gene expression data. Expert Systems with Applications 38(3):1293–1298 Datta S, Das S (2015) Near–Bayesian support vector machines for imbalanced data classification with equal or unequal misclassification costs. Neural Netw 70:39–52 Demšar J (2006) Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research 7:1–30 Domingos P, Pazzani M (1997) On the optimality of the simple Bayesian classifier under zero-one loss. Mach Learn 29(2–3):103–130 Freitas A, Costa-Pereira A, Brazdil P (2007) Cost-sensitive decision trees applied to medical data. In: Song IY, Eder J, Nguyen TM (eds) Data Warehousing and Knowledge Discovery. Springer, Berlin Heidelberg, pp 303–312 Guan G, Guo J, Wang H (2014) Varying Naïve Bayes Models With Applications to Classification of Chinese Text Documents. Journal of Business & Economic Statistics 32(3):445–456 Hand DJ, Yu K (2001) Idiot’s Bayes - Not So Stupid After All? International Statistical Review 69(3):385–398 Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. Springer, NY He H, Yunqian M (2013) Imbalanced Learning: Foundations, Algorithms, and Applications. Wiley, Hoboken Hogg RV, McKean J, Craig AT (2005) Introduction to Mathematical Statistics. Pearson Education Jiang L, Wang S, Li C, Zhang L (2016) Structure extended multinomial naive Bayes. Information Sciences 329(Supplement C):346–356 Lee W, Jun CH, Lee JS (2017) Instance categorization by support vector machines to adjust weights in adaboost for imbalanced data classification. Information Sciences 381(Supplement C):92–103 Leevy JL, Khoshgoftaar TM, Bauder RA, Seliya N (2018) A survey on addressing high-class imbalance in big data. J Big Data. https://doi.org/10.1186/s40537-018-0151-6 Lichman, M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml Ling CX, Yang Q, Wang J, Zhang S (2004) Decision trees with minimal costs. In: Proceedings of the twenty-first international conference on machine learning, ICML ’04, p. 69. New York, NY, USA Mehra N, Gupta S (2013) Survey on multiclass classification methods. International Journal of Computer Science and Information Technologies 4(4):572–576 Menzies T, Greenwald J, Frank A (2007) Data Mining Static Code Attributes to Learn Defect Predictors. IEEE Transactions on Software Engineering 33(1):2–13 Minnier J, Yuan M, Liu JS, Cai T (2015) Risk Classification With an Adaptive Naive Bayes Kernel Machine Model. Journal of the American Statistical Association 110(509):393–404 Parthiban G, Rajesh A, Srivatsa SK (2011) Diagnosis of Heart Disease for Diabetic Patients using Naive Bayes Method. International Journal of Computer Applications 24(3):0975–8887 Peng L, Zhang H, Yang B, Chen Y (2014) A new approach for imbalanced data classification based on data gravitation. Inf Sci 288(Supplement C):347–373 Prati RC, Batista GE, Silva DF (2015) Class imbalance revisited: a new experimental setup to assess the performance of treatment methods. Knowledge and Information Systems 45:247–270 Romei A, Ruggieri S (2014) A multidisciplinary survey on discrimination analysis. The Knowledge Engineering Review 29(5):582–638 Rosen GL, Reichenberger ER, Rosenfeld AM (2010) NBC: the Naïve Bayes Classification tool webserver for taxonomic classification of metagenomic reads. Bioinformatics 27(1):127–129 Sokolova M, Lapalme G (2009) A systematic analysis of performance measures for classification tasks. Information Processing & Management 45(4):427–437 Sun Y, Kamel MS, Wong AK, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition 40(12):3358–3378 Sun Y, Wong AK, Kamel MS (2009) Classification of imbalanced data: A review. International Journal of Pattern Recognition and Artificial Intelligence 23:687–719 Turhan B, Bener A (2009) Analysis of Naive Bayes’ assumptions on software fault data: An empirical study. Data & Knowledge Engineering 68(2):278–290 Wei W, Visweswaran S, Cooper GF (2011) The application of naive Bayes model averaging to predict Alzheimer’s disease from genome-wide data. Journal of the American Medical Informatics Association 18(4):370–375 Witten DM, Shojaie A, Zhang F (2014) The Cluster Elastic Net for High-Dimensional Regression With Unknown Variable Grouping. Technometrics 56(1):112–122 Wolfson J, Bandyopadhyay S, Elidrisi M, Vazquez-Benitez G, Vock DM, Musgrove D, Adomavicius G, Johnson PE, O’Connor PJ (2015) A Naive Bayes machine learning approach to risk prediction using censored, time-to-event data. Statistics in Medicine 34(21):2941–2957 Wu J, Pan S, Zhu X, Cai Z, Zhang P, Zhang C (2015) Self-adaptive attribute weighting for Naive Bayes classification. Expert Systems with Applications 42(3):1487–1502 Xu QS, Liang YZ (2001) Monte Carlo cross validation. Chemom Intell Lab Syst 56(1):1–11 Yager RR (2006) An extension of the naive Bayesian classifier. Information Sciences 176(5):577–588 Yang Y, Liu X (1999). A re-examination of text categorization methods. In: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval (SIGIR), pp. 42–49. New York, NY, USA Zhou Zhi-Hua, Liu Xu-Ying (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl Data Eng 18(1):63–77