Multiple Bayesian discriminant functions for high-dimensional massive data classification

Data Mining and Knowledge Discovery - Tập 31 - Trang 465-501 - 2016
Jianfei Zhang1, Shengrui Wang1, Lifei Chen2, Patrick Gallinari3
1ProspectUS Laboratoire, Département d’Informatique, Université de Sherbrooke, Sherbrooke, Canada
2School of Mathematics and Computer Science, Fujian Normal University, Fuzhou, China
3Laboratoire d'Informatique de Paris 6 (LIP6), Université Pierre et Marie Curie, Paris, France

Tóm tắt

The presence of complex distributions of samples concealed in high-dimensional, massive sample-size data challenges all of the current classification methods for data mining. Samples within a class usually do not uniformly fill a certain (sub)space but are individually concentrated in certain regions of diverse feature subspaces, revealing the class dispersion. Current classifiers applied to such complex data inherently suffer from either high complexity or weak classification ability, due to the imbalance between flexibility and generalization ability of the discriminant functions used by these classifiers. To address this concern, we propose a novel representation of discriminant functions in Bayesian inference, which allows multiple Bayesian decision boundaries per class, each in its individual subspace. For this purpose, we design a learning algorithm that incorporates the naive Bayes and feature weighting approaches into structural risk minimization to learn multiple Bayesian discriminant functions for each class, thus combining the simplicity and effectiveness of naive Bayes and the benefits of feature weighting in handling high-dimensional data. The proposed learning scheme affords a recursive algorithm for exploring class density distribution for Bayesian estimation, and an automated approach for selecting powerful discriminant functions while keeping the complexity of the classifier low. Experimental results on real-world data characterized by millions of samples and features demonstrate the promising performance of our approach.

Tài liệu tham khảo

Aggarwal CC, Han J, Wang J, Yu PS (2004) A framework for projected clustering of high dimensional data streams. In: Proceedings of the thirtieth international conference on very large data bases (VLDB), pp 852–863 Aha D, Kibler D (1991) Instance-based learning algorithms. Mach Learn 6:37–66 Albert MK, Aha DW (1991) Analyses of instance-based learning algorithms. In: Proceedings of the ninth national conference on artificial intelligence (AAAI), pp 553–558 Atashpaz-Gargari E, Sima C, Braga-Neto UM, Dougherty ER (2013) Relationship between the accuracy of classifier error estimation and complexity of decision boundary. Pattern Recognit 46(5):1315–1322 Bengio Y, Bengio S (1999) Modeling high-dimensional discrete data with multi-layer neural networks. In: The twelfth advances in neural information processing systems (NIPS), pp 400–406 Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press, Cambridge Breiman L (2001) Random forests. Mach Learn 45(1):5–32 Chang CC, Lin CJ (2011) Libsvm: a library of the support vector machines. ACM Trans Intell Syst Technol 2:27 Chen L, Wang S (2012a) Automated feature weighting in naive Bayes for high-dimensional data classification. In: Proceedings of the twenty-first ACM international conference on information and knowledge management (CIKM), pp 1243–1252 Chen L, Wang S (2012b) Semi-naive Bayesian classification by weighted kernel density estimation. In: Proceedings of the eighth international conference on advanced data mining and applications (ADMA), pp 260–270 Chen L, Jiang Q, Wang S (2012) Model-based method for projective clustering. IEEE Trans Knowl Data Eng 24(7):1291–1305 Dahl GE, Stokes JW, Deng L, Yu D (2013) Large-scale malware classification using random projections and neural networks. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 3422–3426 Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39(1):1–38 Domeniconi C, Gunopulos D, Ma S, Yan B, Al-Razgan M, Papadopoulos D (2007) Locally adaptive metrics for clustering high dimensional data. Data Min Knowl Discov 14(1):63–97 Domingos P, Pazzani M (1997) On the optimality of the simple Bayesian classifier under zero-one loss. Mach Learn 29(2–3):103–130 Fan W, Bifet A (2013) Mining big data: current status, and forecast to the future. ACM SIGKDD Explor Newslett 14(2):1–5. doi:10.1145/2481244.2481246 Fawcett T (2006) An introduction to ROC analysis. Pattern Recognit Lett 27(8):861–874 Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of the thirteenth international joint conference on artificial intelligence (IJCAI), pp 1022–1029 Fortuny EJ, Martens D, Provost F (2013) Predictive modeling with big data: is bigger really better? Big Data 1(4):215–226 Frank E, Hall M, Pfahringer B (2003) Locally weighted naive Bayes. In: Proceedings of the nineteenth annual conference on uncertainty in artificial intelligence (UAI), pp 249–256 Friedman JH (1997) On bias, variance, 0/1-loss, and the curse-of-dimensionality. Data Min Knowl Discov 1(1):55–77 Gopal S, Yang Y, Bai B, Mizil AN (2012) Bayesian models for large-scale hierarchical classification. In: Proceedings of the twenty-sixth annual conference on neural information processing systems (NIPS), pp 2420–2428 Han EHS, Karypis G (2000) Centroid-based document classification: analysis and experimental results. In: Proceedings of the fourth European conference on principles of knowledge discovery and data mining, pp 424–431 Hinneburg A, Aggarwal CC, Keim DA (2000) What is the nearest neighbor in high dimensional spaces? In: The twenty-sixth international conference on very large databases (VLDB), pp 506–515 Hsu CN, Huang HJ, Wong TT (2003) Implications of the dirichlet assumption for discretization of continuous variables in naive Bayesian classifiers. Mach Learn 53(3):235–263 Ifrim G, Bakır G, Weikum G (2008) Fast logistic regression for text categorization with variable-length n-grams. In: Proceedings of the fourteenth ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 354–362 Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the thirtieth annual ACM symposium on theory of computing (STOC), pp 604–613 Jing L, Ng MK, Huang JZ (2007) An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data. IEEE Trans Knowl Data Eng 19(8):1026–1041 Joachims T (1996) A probabilistic analysis of the rocchio algorithm with tfidf for text categorization. Technical Reports, DTIC Document Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. Springer, Berlin John GH, Langley P (1995) Estimating continuous distributions in Bayesian classifiers. In: Proceedings of the eleventh conference on uncertainty in artificial intelligence (UAI), pp 338–345 Kang DK, Silvescu A, Honavar V (2006) RNBL-MN: a recursive naive Bayes learner for sequence classification. In: Proceedings of the eighteenth Pacific-Asia conference on knowledge discovery and data mining (PAKDD), pp 45–54 Kim J, Le DX, Thoma GR (2008) Naive Bayes classifier for extracting bibliographic information from biomedical online articles. In: Proceedings of the international conference on data mining (DMIN), pp 373–378 Kohavi R, Langley P, Yun Y (1997) The utility of feature weighting in nearest-neighbor algorithms. In: Proceedings of the ninth European conference on machine learning (ECML), pp 85–92 Kooij AJ, et al (2007) Chapter 4: regularization with ridge penalties, the lasso, and the elastic net for regression with optimal scaling transformations. In: Prediction accuracy and stability of regression with optimal scaling transformations, pp 65–90 Lee CH, Gutierrez F, Dou D (2011) Calculating feature weights in naive Bayes with kullback–leibler measure. In: Proceedings of the eleventh IEEE international conference on data mining (ICDM), pp 1146–1151 Li P, Shrivastava A, Moore JL, König AC (2011) Hashing algorithms for large-scale learning. In: Proceedings of the twenty-fifth annual conference on neural information processing systems (NIPS), pp 2672–2680 Lin C, Weng RC, Keerthi SS (2007) Trust region Newton methods for large-scale logistic regression. In: Proceedings of the twenty-fourth international conference on machine learning (ICML), pp 561–568 Lin G, Shen C, Shi Q, van den Hengel A, Suter D (2014) Fast supervised hashing with decision trees for high-dimensional data. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 1963–1970 Liu S, Trenkler G (2008) Hadamard, khatri-rao, kronecker and other matrix products. Int J Inf Syst Sci 4(1):160–177 Manevitz LM, Yousef M (2002) One-class svms for document classification. J Mach Learn Res 2:139–154 Marchiori E (2013) Class dependent feature weighting and k-nearest neighbor classification. In: Proceedings of the eighth IAPR international conference on pattern recognition in bioinformatics (PRIB), pp 69–78 Martens D, Provost F (2011) Pseudo-social network targeting from consumer transaction data. Faculty of Applied Economics, University of Antwerp, Belgium Martínez AM, Kak AC (2001) PCA versus LDA. IEEE Trans Pattern Anal Mach Intell 23(2):228–233 Masud MM, Khan L, Thuraisingham B (2008) A scalable multi-level feature extraction technique to detect malicious executables. Inf Syst Front 10(1):33–45 Nakajima S, Watanabe S (2005) Generalization error of linear neural networks in an empirical Bayes approach. In: Proceedings of the nineteenth international joint conference on artificial intelligence (IJCAI), pp 804–810 Navon IM, Phua PK, Ramamurthy M (1988) Vectorization of conjugate-gradient methods for large-scale minimization. In: Proceedings of the second ACM/IEEE conference on supercomputing, pp 410–418 Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newslett 6(1):90–105 Rani P, Pudi V (2008) RBNBC: Repeat based naive Bayes classifier for biological sequences. In: Proceedings of the eighth IEEE international conference on data mining (ICDM), pp 989–994 Rao J, Wu C (2010) Bayesian pseudo-empirical-likelihood intervals for complex surveys. J R Stat Soc Ser B 72(4):533–544 Seeger M (2006) Bayesian modelling in machine learning: a tutorial review. Technical Reports Shabtai A, Moskovitch R, Elovici Y, Glezer C (2009) Detection of malicious code by applying machine learning classifiers on static features: a state-of-the-art survey. Inf Secur Tech Rep 14(1):16–29 Straub WO (2009) A brief look at gaussian integrals. Article, Pasadena California. http://www.weylmann.com/gaussian.pdf Su J, Shirab JS, Matwin S (2011) Large scale text classification using semisupervised multinomial naive Bayes. In: Proceedings of the twenty-eighth international conference on machine learning (ICML), pp 97–104 Tan M, Wang L, Tsang IW (2010) Learning sparse SVM for feature selection on very high dimensional datasets. In: Proceedings of the twenty-seventh international conference on machine learning (ICML), pp 1047–1054 Tan M, Tsang IW, Wang L (2014) Towards ultrahigh dimensional feature selection for big data. J Mach Learn Res 15(1):1371–1429 Tan S (2008) An improved centroid classifier for text categorization. Exp Syst Appl 35(1–2):279–285 Teh YW, Jordan MI, Beal MJ, Blei DM (2006) Hierarchical dirichlet processes. J Am Stat Assoc 101(476):1566–1581 Vapnik VN (1992) Principles of risk minimization for learning theory. In: Proceedings of the fifth annual conference on neural information processing systems (NIPS), pp 831–838 Vapnik VN (1999) An overview of statistical learning theory. IEEE Trans Neural Netw Learn Syst 10(5):988–999 Vapnik VN (2000) The nature of statistical learning theory. Springer, Berlin Vapnik VN, Levin E, Le Cun Y (1994) Measuring the VC-dimension of a learning machine. Neural Comput 6(5):851–876 Verweij PJ, Van Houwelingen HC (1994) Penalized likelihood in cox regression. Stat Med 13(23–24):2427–2436 Vilalta R, Rish I (2003) A decomposition of classes via clustering to explain and improve naive Bayes. In: Proceedings of the fourteenth European conference on machine learning (ECML), pp 444–455 Vilalta R, Achari MK, Eick CF (2003) Class decomposition via clustering: a new framework for low-variance classifiers. In: Proceedings of the third IEEE international conference on data mining (ICDM), pp 673–676 Weinberger K, Dasgupta A, Langford J, Smola A, Attenberg J (2009) Feature hashing for large scale multitask learning. In: Proceedings of the twenty-sixth annual international conference on machine learning (ICML), pp 1113–1120 Xu B, Huang JZ, Williams G, Wang Q, Ye Y (2012) Classifying very high-dimensional data with random forests built from small subspaces. Int J Data Warehous Min 8(2):44–63 Yu S, Fung G, Rosales R, Krishnan S, Rao RB, Dehing-Oberije C, Lambin P (2008) Privacy-preserving Cox regression for survival analysis. In: Proceedings of the fourteenth ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 1034–1042 Yuan YC (2010) Multiple imputation for missing data: concepts and new development (version 9.0). SAS Institute Inc, Rockville Zaidi NA, Cerquides J, Carman MJ, Webb GI (2013) Alleviating naive Bayes attribute independence assumption by attribute weighting. J Mach Learn Res 14(1):1947–1988 Zhang J, Chen L, Guo G (2013) Projected-prototype based classifier for text categorization. Knowl Based Syst 49:179–189 Zhou Z, Chen Z (2002) Hybrid decision tree. Knowl Based Syst 15(8):515–528