CatBoost for big data: an interdisciplinary review
Tóm tắt
Gradient Boosted Decision Trees (GBDT’s) are a powerful tool for classification and regression tasks in Big Data. Researchers should be familiar with the strengths and weaknesses of current implementations of GBDT’s in order to use them effectively and make successful contributions. CatBoost is a member of the family of GBDT machine learning ensemble techniques. Since its debut in late 2018, researchers have successfully used CatBoost for machine learning studies involving Big Data. We take this opportunity to review recent research on CatBoost as it relates to Big Data, and learn best practices from studies that cast CatBoost in a positive light, as well as studies where CatBoost does not outshine other techniques, since we can learn lessons from both types of scenarios. Furthermore, as a Decision Tree based algorithm, CatBoost is well-suited to machine learning tasks involving categorical, heterogeneous data. Recent work across multiple disciplines illustrates CatBoost’s effectiveness and shortcomings in classification and regression tasks. Another important issue we expose in literature on CatBoost is its sensitivity to hyper-parameters and the importance of hyper-parameter tuning. One contribution we make is to take an interdisciplinary approach to cover studies related to CatBoost in a single work. This provides researchers an in-depth understanding to help clarify proper application of CatBoost in solving problems. To the best of our knowledge, this is the first survey that studies all works related to CatBoost in a single publication.
Từ khóa
Tài liệu tham khảo
Kotsiantis SB, Zaharakis I, Pintelas P. Supervised machine learning: a review of classification techniques. Emerg Artif Intellig Appl Comput Eng. 2007;160(1):3–24.
Liudmila P, Gleb G, Aleksandr V, Anna Veronika D, Andrey G. Catboost: unbiased boosting with categorical features. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, 2018; pages 6638–6648. Curran Associates, Inc.
Johnson JM, Khoshgoftaar TM. Deep learning and data sampling with imbalanced big data. In: 2019 IEEE 20th international conference on information reuse and integration for data science (IRI). 2019; p. 175–183.
Yasunari M, Takuomi H, Anna O, Kouichi Y, Uesawa Y. Prediction model of aryl hydrocarbon receptor activation by a novel qsar approach, deepsnap-deep learning. Molecules. 2020;25(6):1317.
Johnson JM, Khoshgoftaar TM. Survey on deep learning with class imbalance. J Big Data. 2019;6(1):1–54.
Spadon Gabriel, de Carvalho Andre C P L F, Rodrigues-Jr Jose F, Alves Luiz G A. Reconstructing commuters network using machine learning and urban indicators. Scientific Reports. 2019;9(1):N.PAG.
Anghel A, Papandreou N, Parnell T, Palma A, Pozidis H. Benchmarking and optimization of gradient boosting decision tree algorithms, 2018.
Chen T, Guestrin C. Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining—KDD ’16. 2016.
Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, Liu TY. Lightgbm: a highly efficient gradient boosting decision tree. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, editors. Advances in neural information processing systems. New York: Curran Associates; 2017. p. 3146–54.
Hasanin T, Khoshgoftaar TM, Leevy JL, Bauder RA. Investigating class rarity in big data. J Big Data. 2020;7(1):1–17.
Herland M, Khoshgoftaar TM, Bauder RA. Big data fraud detection using multiple medicare data sources. J Big Data. 2018;5(1):29.
Sheshasaayee A, Lakshmi JVN. An insight into tree based machine learning techniques for big data analytics using apache spark. 2017 International Conference on Intelligent Computing, Instrumentation and Control Technologies (ICICICT), July 2017; pp. 1740–1743.
Matei Z, Reynold XS, Patrick W, Tathagata D, Michael a, A Dave, Xiangrui M, Josh R, Shivaram V, Michael FJ, Ghodsi A, Joseph G, Schenkert S, I Stoica. Apache spark: a unified engine for big data processing. Commun. ACM. 2016;59(11):56–65.
Ensembles—rdd-based api. https://spark.apache.org/docs/latest/mllib-ensembles.html , 2020.
Hasanin T, Khoshgoftaar TM, Leevy JL. A comparison of performance metrics with severely imbalanced network security big data. In: 2019 IEEE 20th international conference on information reuse and integration for data science (IRI). IEEE. 2019; p 83–88.
Sudha P, Gunavathi R. A survey paper on map reduce in big data. International Journal of Science and Research. 2016;5(9).
Khramtsov V, Sergeyev A, Spiniello C, Tortora C, Napolitano NR, Agnello A, Getman F, Jong JTA, Kuijken K, Radovich M, Shan H, Shulga V. KiDS-SQuaD II. Machine learning selection of bright extragalactic objects to search for new gravitationally lensed quasars. Astonomy Astrophys. 2019;2019:632.
Daoud EA. Comparison between xgboost, lightgbm and catboost using a home credit dataset. Int J Comput Inf Eng. 2019;13(1):6–10.
Yufei Xia, Lingyun He, Yinguo Li, Nana Liu, Yanlin Ding. Predicting loan default in peer-to-peer lending using narrative data. J Forecasting. 2020;39(2):260.
Zhang F, Fleyeh H. Short term electricity spot price forecasting using catboost and bidirectional long short term memory neural network. 2019 16th International Conference on the European Energy Market (EEM), Sep. 2019; pp. 1–6.
Zhang Haichao, Zeng Ruishuang, Chen Linling, Zhang Shangfeng. Research on personal credit scoring model based on multi-source data. J Phys Conference Series. 2020;1437:012053.
Adamović S, Miškovic V, Maček N, Milosavljević M, Šarac M, Saračević M, Gnjatović M. An efficient novel approach for iris recognition based on stylometric features and machine learning techniques. Fut Gener Comput Syst. 2020;107:144–57.
Kong SH, Ahn D, Kim B, Srinivasan K, Ram S, Kim H, Hong AR, Kim JH, Cho NH, Shin CS. A novel fracture prediction model using machine learning in a community-based cohort. JBMR Plus. 2020;4(3):1.
Saifur R, Muhammad I, Mohsin R, Khawaja M-G, Shumayla Y, Muhammad A. Performance analysis of boosting classifiers in recognizing activities of daily living. Int J Environ Res Public Health. 2020;17(3):1082.
Yang H, Bath PA. The use of data mining methods for the prediction of dementia: evidence from the english longitudinal study of aging. IEEE J Biomed Health Inform. 2020;24(2):345–53.
Kolesnikov AA, Kikin PM, Portnov AM. Diseases spread prediction in tropical areas by machine learning methods ensembling and spatial analysis techniques. ISPRS. 2019;8XLII–3/W:221–6.
Lin F, Cui EM, Lei Y, Luo L. Ct-based machine learning model to predict the fuhrman nuclear grade of clear cell renal cell carcinoma. Abdominal Radiol. 2019;44(7):2528–34.
Coma-Puig B, Carmona J. Bridging the gap between energy consumption and distribution through non-technical loss detection. Energies. 2019;12(9):1748.
Ghori KM, Ayaz A Rabeeh, Awais M, Imran M, Ullah A, Szathmary L. Impact of feature selection on non-technical loss detection. In: 2020 6th conference on data science and machine learning applications (CDMA). 2020; p 19–24.
Punmiya R, Choe S. Energy theft detection using gradient boosting theft detector with feature engineering-based preprocessing. IEEE Trans Smart Grid. 2019;10(2):2326–9.
Fan Junliang, Wang Xiukang, Zhang Fucang, Ma Xin, Lifeng Wu. Predicting daily diffuse horizontal solar radiation in various climatic regions of china using support vector machine and tree-based soft computing models with local and extrinsic climatic data. J Clean Prod. 2020;248:119264.
Huang G, Lifeng W, Ma X, Zhang W, Fan J, Xiang Y, Zeng W, Zhou H. Evaluation of catboost method for prediction of reference evapotranspiration in humid regions. J Hydrol. 2019;574:1029–41.
Abolfazli A, Brechmann A, Wolff S, Spiliopoulou M. Machine learning identifies the dynamics and influencing factors in an auditory category learning experiment. Sci Rep. 2020;10(1):1.
Arkaprabha S, Ishita B. Screening of anxiety and depression among the seafarers using machine learning technology. Inform Med Unlocked. 2019;16:100149.
Liu W, Deng K, Zhang X, Cheng Y, Zheng Z, Jiang F, Peng J. A semi-supervised tri-catboost method for driving style recognition. Symmetry. 2020;3:336.
Bakhareva N, Shukhman A, Matveev A, Polezhaev P, Ushakov Y, Legashev L. Attack detection in enterprise networks by machine learning methods. In: 2019 international Russian automation conference (RusAutoCon). 2019; pages 1–6.
Yi Hai-Cheng, You Zhu-Hong, Guo Zhen-Hao. Construction and analysis of molecular association network by combining behavior representation and node attributes. Front Genetics. 2019;10:1.
Koehn D, Lessmann S, Schaal M. Predicting online shopping behaviour from clickstream data using deep learning. Expert Syst Appl. 2020;150:113342.
Google.com. Google scholar. http://scholar.google.com , May 2020.
Clarivate. Web of science. 2020. http://login.webofknowledge.com . Accessed 16 Feb 2020.
Sujatha M, Prabhakar S, Lavanya GD. A survey of classification techniques in data mining. Int J Innovations Eng Technol (IJIET). 2013;2(4):1058–2319.
Friedman JH. Greedy function approximation: a gradient boosting machine. Ann Stat. 2001;12:1189–232.
Micci-Barreca Daniele. A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. SIGKDD Explor Newsl. 2001;3(1):27–32.
Yin L, Mikhail O. Bdt: Gradient boosted decision tables for high accuracy and scoring efficiency. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. 2017; pp. 1893–1901.
Hancock JT, Khoshgoftaar TM. Survey on categorical data for neural networks. J Big Data. 2020;7:1–41.
Microsoft Corporation. Advanced topics. 2020. https://lightgbm.readthedocs.io/en/latest/Advanced-Topics.html .
Microsoft Corporation. Lightgbm. 2020. https://github.com/Microsoft/LightGBM/blob/master/src/treelearner/feature_histogram.hpp .
Diao L, Niu D, Zang Z, Chen C. Short-term weather forecast based on wavelet denoising and catboost. In: 2019 Chinese control conference (CCC). 2019; pp. 3760–4.
Ghori KM, Abbasi RA, Awais M, Imran M, Ullah A, Szathmary L. Performance analysis of different types of machine learning classifiers for non-technical loss detection. IEEE Access. 2020;8:16033–48.
de Jong JTA, Kleijn GAV, Kuijken KH, Valentijn EA. The kilo-degree survey. Exp Astron. 2013;35(1–2):25–44.
Abolfathi B, et al. The fourteenth data release of the sloan digital sky survey: first spectroscopic data from the extended Baryon oscillation spectroscopic survey and from the second phase of the apache point observatory galactic evolution experiment. Astrophys J Suppl Series. 2018;235(2):42.
Leevy JL, Khoshgoftaar TM, Bauder RA, Seliya N. A survey on addressing high-class imbalance in big data. J Big Data. 2018;5(1):42.
Tomas M, Ilya S, Kai C, Corrado Greg S, Dean Jeff. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 2013; pp. 3111–3119.
Buitinck L, Louppe G, Blondel M, Pedregosa F, Mueller A, Grisel O, Niculae V, Prettenhofer P, Gramfort A, Grobler J, et al. API design for machine learning software: experiences from the scikit-learn project. In: ECML PKDD workshop: languages for data mining and machine learning. 2013; pp. 108–122.
Hand DJ. Measuring classifier performance: a coherent alternative to the area under the roc curve. Mach Learn. 2009;77(1):103–23.
Schuster M, Paliwal KK. Bidirectional recurrent neural networks. IEEE Trans Signal Process. 1997;45(11):2673–81.
Johnson R, Zhang T. Learning nonlinear functions using regularized greedy forest. IEEE Trans Pattern Anal Mach Intellig. 2014;36(5):942–54.
Steptoe Andrew, Breeze Elizabeth, Banks James, Nazroo James. Cohort profile: the english longitudinal study of ageing. Int J Epidemiol. 2013;42(6):1640–8.
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. synthetic minority over-sampling technique. Smote. 2002;16:321–57.
He H, Bai Y, Garcia EA, Li S. Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence). IEEE, New York. 2008; pp. 1322–8.
Herland M, Bauder RA, Khoshgoftaar TM. The effects of class rarity on the evaluation of supervised healthcare fraud detection models. J Big Data. 2019;6(1):1.
Chollet F, et al. Keras. https://keras.io . 2015.
Lundberg SM, Lee SI. A unified approach to interpreting model predictions. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, editors. Advances in neural information processing systems. New York: Curran Associates; 2017. p. 4765–74.
Witten IH, Frank E. Data mining: practical machine learning tools and techniques with java implementations. ACM Sigmod Record. 2002;31(1):76–7.
Holte RC. Very simple classification rules perform well on most commonly used datasets. Mach Learn. 1993;11(1):63–90.
Quinlan JR. C4. 5: Programs for machine learning. 1993.
Platt J. Sequential minimal optimization: a fast algorithm for training support vector machines. 1998.
Webb GI. Multiboosting: a technique for combining boosting and wagging. Mach Learn. 2000;40(2):159–96.
Barua S, Islam MM, Yao X, Murase K. Mwmote-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng. 2012;26(2):405–25.
Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Mach Learn. 2002;46(1–3):389–422.
Deng H, Runger G. Gene selection with guided regularized random forest. Pattern Recogn. 2013;46(12):3483–9.
Friedman JH. Stochastic gradient boosting. Nonlinear methods and data mining. Comput Stat Data Anal. 2002;38(4):367–78.
Madalina-Mihaela B, Javier T-A, Pedro C-R, Antonio G. Detection of non-technical losses using smart meter data and supervised learning. IEEE Trans Smart Grid. 2018;10(3):2661–70.
Bauder R, da Rosa R, Khoshgoftaar TM. Identifying medicare provider fraud with unsupervised machine learning. In: 2018 IEEE international conference on information Reuse and integration (IRI).
Bauder RA, Khoshgoftaar TM. Medicare fraud detection using machine learning methods. In: 2017 16th IEEE international conference on machine learning and applications (ICMLA).
Hancock J, Khoshgoftaar TM. Medicare fraud detection using catboost. In: 2020 IEEE 21st international conference on information reuse and integration for data science (IRI). IEEE. 2020: pp. 97–103.
Hancock J, Khoshgoftaar TM. Performance of catboost and xgboost in medicare fraud detection. In: 19th IEEE international conference on machine learning and applications (ICMLA); IEEE, New York. 2020.
Hochreiter S, Schmidhuber J. Lstm can solve hard long time lag problems. In: Advances in neural information processing systems. 1997; pp. 473–9.
Ilya S, Oriol V, Le Quoc V. Sequence to sequence learning with neural networks. Advances in neural information processing systems. 2014; pp. 3104–3112.
De Myttenaere A, Golden B, Le Grand B, Rossi F. Mean absolute percentage error for regression models. Neurocomputing. 2016;192:38–48.
Nakagawa Shinichi, Schielzeth Holger. A general and simple method for obtaining r2 from generalized linear mixed-effects models. Methods Ecol Evol. 2013;4(2):133–42.
Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–97.
Goodfellow I, Bengio Y, Courville A. Deep learning. New York: MIT Press; 2016. http://www.deeplearningbook.org .
Brodersen KH, Ong CS, Stephan KE, Buhmann JM. The balanced accuracy and its posterior distribution. In: 2010 20th international conference on pattern recognition. 2010; pp. 3121–4.
Shvai N, Hasnat A, Meicler A, Nakib A. Accurate classification for automatic vehicle-type recognition based on ensemble classifiers. IEEE Trans Intell Transportation Syst. 2020;21(3):1288–97.
Lecun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proc IEEE. 1998;86(11):2278–324.
University of New Brunswick. Intrusion detection evaluation dataset (cicids2017). https://www.unb.ca/cic/datasets/ids-2017.html , 1 2020. (Online). Accessed 18 March 2020.
Maryam MN, Taghi MK, Clifford Kemp, Seliya N, Richard Z. Machine learning for detecting brute force attacks at the network level. In 2014 IEEE International Conference on Bioinformatics and Bioengineering. IEEE, New York. 2014; pp. 379–385.
Najafabadi MM, Khoshgoftaar TM, Napolitano A. Detecting network attacks based on behavioral commonalities. Int J Reliability Quality Safety Eng. 2016;23(01):1650005.
Wald Randall, Khoshgoftaar Taghi M. Richard Zuech, and Amri Napolitano. Network traffic prediction models for near-and long-term predictions. In 2014 IEEE International Conference on Bioinformatics and Bioengineering, 2014; IEEE, New York. pp. 362–368.
Tanase C, Ogrezeanu I, Badiu C. Molecular pathology of pituitary adenomas. Netherlands: Elsevier; 2011.
Fang Y, Fullwood MJ. Roles, functions, and mechanisms of long non-coding rnas in cancer. Genom Proteom Bioinf. 2016;14(1):42–54.
Ou M, Cui P, Pei J, Zhang Z, Zhu W. Asymmetric transitivity preserving graph embedding. In: Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. 2016; p. 1105–1114.
Freund Y, Schapire RE. A desicion-theoretic generalization of on-line learning and an application to boosting. In: European conference on computational learning theory. Springer. 1995; p 23–37.
Wang Y, Ma K, Garcia-Hernandez L, Chen J, Hou Z, Ji K, Chen Z, Abraham A. A clstm-tmn for marketing intention detection. Eng Appl Artificial Intell. 2020;91:103595.
National Cancer Institute. Clear cell renal cell carcinoma. 2020. https://www.cancer.gov/pediatric-adult-rare-tumor/rare-tumors/rare-kidney-tumors/clear-cell-renal-cell-carcinoma . Accessed 9 June 2020.
Sharma N, Aggarwal LM. Automated medical image segmentation techniques. J Med Phys Assoc Med Phys India. 2010;35(1):3.