A support vector machine (SVM) approach to imbalanced datasets of customer responses: comparison with other customer response models

Gitae Kim1, Bongsug Kevin Chae2, David L. Olson3
1Department of Industrial and Manufacturing Systems Engineering, Kansas State University, Manhattan, USA
2Department of Management, Kansas State University, Manhattan, USA
3Department of Management, University of Nebraska, Lincoln, USA

Tóm tắt

Customer response is a crucial aspect of service business. The ability to accurately predict which customer profiles are productive has proven invaluable in customer relationship management. An area that has received little attention in the literature on direct marketing is the class imbalance problem (the very low response rate). We propose a customer response predictive model approach combining recency, frequency, and monetary variables and support vector machine analysis. We have identified three sets of direct marketing data with a different degree of class imbalance (little, moderate, high) and used random undersampling method to reduce the degree of the imbalance problem. We report the empirical results in terms of gain values and prediction accuracy and the impact of random undersampling on customer response model performance. We also discuss these empirical results with the findings of previous studies and the implications for industry practice and future research.

Từ khóa


Tài liệu tham khảo

Baesens B, Viaene S, Van den Poel D, Vanthienen J, Dedene G (2002) Bayesian neural network learning for repeat purchase modelling in direct marketing. Eur J Oper Res 138:191–211

Blattberg R, Kim B, Neslin S (2008) Database marketing: analyzing and managing customers, Chapt. 2 RFM analysis. Springer, New York

Bose I, Chen X (2009) Quantitative models for direct marketing: a review from systems perspective. Eur J Oper Res 195:1–16

Burez J, Van den Poel D (2009) Handling class imbalance in customer churn prediction. Expert Syst Appl 36:4626–4636

Clarke R, Ressom H, Wang A, Xuan J, Liu M, Gehan E, Wang Y (2008) The properties of high-dimensional data spaces: implications for exploring gene and protein expression data. Nat Rev 8:37–49

Cui D, Curry D (2005) Prediction in marketing using the support vector machine. Mark Sci 24:595–615

Cui G, Wong M, Zhang G, Li L (2008) Model selection for direct marketing: performance criteria and validation methods. Mark Intell Plan 26:275–292

Drummond C, Holte R (2003) C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Workshop on learning from imbalanced data sets at the 17th international conference on machine learning. Washington, DC, pp 1–8

Ha K, Cho S, Maclachlan D (2005) Response models based on bagging neural networks. J Interactive Mark 19:17–30

Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques, 3rd edn. Morgan Kaufmann, San Francisco

He H, Garcia E (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21:1263–1284

Hughes A (2005) Strategic database marketing, 3rd edn. McGraw-Hill, New York

Joo Y, Kim Y, Yang S (2011) Valuing customers for social network services. J Bus Res 64:1239–1244

Khoshgoftaar T, Van Hulse J, Napolitano A (2010) Supervised neural network modeling: an empirical investigation into learning from imbalanced data with labeling errors. IEEE Trans Neural Netw 21:813–830

Khoshgoftaar T, Van Hulse J, Napolitano A (2011) Comparing boosting and bagging techniques with noisy and imbalanced data. IEEE Trans Syst Man Cybern Part A 41:552–568. doi:10.1109/Tsmca.2010.2084081

Lessmann S, Voß S (2009) A reference model for customer-centric data mining with support vector machines. Eur J Oper Res 199:520–530

Ling C, Li C (1998) Data mining for direct marketing: problems and solutions. In: Proceeding of 4th international conference on knowledge discovery and data mining (KDD’98). AAAI Press, New York, pp 73–79

Linoff G, Berry M (2011) Data mining techniques, 3rd edn. Wiley, Indianapolis

McCarthy J, Hastak M (2007) Segmentation approaches in data-mining: a comparison of RFM, CHAID, and logistic regression. J Bus Res 60:656–662

Ngai E, Xiu L, Chau D (2009) Application of data mining techniques in customer relationship management: a literature review and classification. Expert Syst Appl 36:2592–2602. doi:10.1016/j.eswa.2008.02.021

Olson D (2007) Data mining in business services. Serv Bus 1:181–193. doi:10.1007/s11628-006-0014-7

Olson D, Delen D (2008) Advanced data mining techniques. Springer, Heidelberg

Olson D, Cao Q, Gu C, Lee D (2009) Comparison of customer response models. Serv Bus 3:117–130

Schölkopf B, Smola A, Williamson R, Bartlett P (2000) New support vector algorithms. Neural Comput 12:1207–1245

Vapnik V (1995) The nature of statistical learning theory. Springer, New York

Verhaert G, Van den Poel D (2011) Empathy as added value in predicting donation behavior. J Bus Res 64:1288–1295

Verhoef P, Spring P, Hoekstra J, Leeflang P (2003) The commerical use of segmentation and predictive modeling techniques for database marketing in the Netherlands. Decis Support Syst 34:471–481

Verhoef P, Venkatesan R, McAlister L, Malthouse E, Krafft M, Ganesan S (2010) CRM in data-rich multichannel retailing environments: a review and future research directions. J Interactive Mark 24:121–137

Viaene S, Baesens B, Van Gestel T, Suykens J, Van den Poel D, Vanthienen J, De Moor B, Dedene G (2001) Knowledge discovery in a direct marketing case using least squares support vector machines. Int J Intell Syst 16:1023–1036

Wang K, Zhou S, Yang Q, Yeung J (2005) Mining customer value: from association rules to direct marketing. Data Min Knowl Disc 11:57–79. doi:10.1007/s10618-005-1355-x

Weiss G (2004) Mining with rarity: a unifying framework. ACM SIGKDD Explor Newsl 6:7–19

Wu J, Roy J, Stewart W (2010) Prediction modeling using EHR data. Med Care 48:S106–S113