A survey and analysis of intrusion detection models based on CSE-CIC-IDS2018 Big Data

Joffrey L. Leevy1, Taghi M. Khoshgoftaar1
1Florida Atlantic University, 777 Glades Road, Boca Raton, FL, 33431, USA

Tóm tắt

Abstract

The exponential growth in computer networks and network applications worldwide has been matched by a surge in cyberattacks. For this reason, datasets such as CSE-CIC-IDS2018 were created to train predictive models on network-based intrusion detection. These datasets are not meant to serve as repositories for signature-based detection systems, but rather to promote research on anomaly-based detection through various machine learning approaches. CSE-CIC-IDS2018 contains about 16,000,000 instances collected over the course of ten days. It is the most recent intrusion detection dataset that is big data, publicly available, and covers a wide range of attack types. This multi-class dataset has a class imbalance, with roughly 17% of the instances comprising attack (anomalous) traffic. Our survey work contributes several key findings. We determined that the best performance scores for each study, where available, were unexpectedly high overall, which may be due to overfitting. We also found that most of the works did not address class imbalance, the effects of which can bias results in a big data study. Lastly, we discovered that information on the data cleaning of CSE-CIC-IDS2018 was inadequate across the board, a finding that may indicate problems with reproducibility of experiments. In our survey, major research gaps have also been identified.

Từ khóa


Tài liệu tham khảo

Singh AP, Singh MD. Analysis of host-based and network-based intrusion detection system. IJ Comput Netw Inf Secur. 2014;8:41–7.

Patil A, Laturkar A, Athawale S, Takale R, Tathawade P. A multilevel system to mitigate ddos, brute force and sql injection attack for cloud security. In: International Conference on Information, Communication, Instrumentation and Control (ICICIC), 2017. p. 1–7. IEEE.

Saxena AK, Sinha S, Shukla P. General study of intrusion detection system and survey of agent based intrusion detection system. In: 2017 International Conference on Computing, Communication and Automation (ICCCA), 2017. p. 471–421. IEEE.

CNBC: Cyberattacks now cost companies $200,000 on average, putting many out of business. https://www.cnbc.com/2019/10/13/cyberattacks-cost-small-companies-200k-putting-many-out-of-business.html.

Sharafaldin I, Lashkari AH, Ghorbani AA. Toward generating a new intrusion detection dataset and intrusion traffic characterization. In: ICISSP, 2018. p. 108–116.

D’hooge L, Wauters T, Volckaert B, De Turck F. In-depth comparative evaluation of supervised machine learning approaches for detection of cybersecurity threats. In: Proceedings of the 4th International Conference on Internet of Things, Big Data and Security; 2019.

Shiravi A, Shiravi H, Tavallaee M, Ghorbani AA. Toward developing a systematic approach to generate benchmark datasets for intrusion detection. Computers Secur. 2012;31(3):357–74.

Bouteraa I, Derdour M, Ahmim A. Intrusion detection using data mining: A contemporary comparative study. In: 2018 3rd International Conference on Pattern Analysis and Intelligent Systems (PAIS), 2018. p. 1–8. IEEE.

Leevy JL, Khoshgoftaar TM, Bauder RA, Seliya N. A survey on addressing high-class imbalance in big data. J Big Data. 2018;5(1):42.

He H, Garcia EA. Learning from imbalanced data. IEEE Trans knowl Data Eng. 2009;21(9):1263–84.

Thakkar A, Lohiya R. A review of the advancement in intrusion detection datasets. Procedia Comput Sci. 2020;167:636–45.

Groff Z, Schwartz S. Data preprocessing and feature selection for an intrusion detection system dataset. In: 34th Annual Conference of The Pennsylvania Association of Computer and Information Science Educators, 2019. p. 103–110.

Menon AK, Williamson RC. The cost of fairness in binary classification. In: Conference on Fairness, Accountability and Transparency, 2018. p. 107–118.

Atefinia R, Ahmadi M. Network intrusion detection using multi-architectural modular deep neural network. J Supercomput. 2020. https://doi.org/10.1007/s11227-020-03410-y

Basnet RB, Shash R, Johnson C, Walgren L, Doleck T. Towards detecting and classifying network intrusion traffic using deep learning frameworks. J Internet Serv Inf Secur. 2019;9(4):1–17.

Catillo M, Rak M, Villano U. 2l-zed-ids: A two-level anomaly detector for multiple attack classes. In: Workshops of the International Conference on Advanced Information Networking and Applications. 2020. p. 687–696.

Chadza T, Kyriakopoulos KG, Lambotharan S. Contemporary sequential network attacks prediction using hidden markov model. In: 2019 17th International Conference on Privacy, Security and Trust (PST), 2019. p. 1–3.

Chastikova V, Sotnikov V. Method of analyzing computer traffic based on recurrent neural networks. J Phys. 2019;1353:012133.

D’hooge L, Wauters T, Volckaert B, De Turck F. Inter-dataset generalization strength of supervised machine learning methods for intrusion detection. J Inf Secur Appl. 2020;54:102564.

Ferrag MA, Maglaras L, Moschoyiannis S, Janicke H. Deep learning for cyber security intrusion detection: approaches, datasets, and comparative study. J Inf Secur Appl. 2020;50:102419.

Lima Filho FSd, Silveira FA, de Medeiros Brito Junior A, Vargas-Solar G, Silveira LF. Smart detection: an online approach for dos/ddos attack detection using machine learning. Security and Communication Networks 2019; 2019.

Fitni QRS, Ramli K. Implementation of ensemble learning and feature selection for performance improvements in anomaly-based intrusion detection systems. In: 2020 IEEE International Conference on Industry 4.0, Artificial Intelligence, and Communications Technology (IAICT), 2020. p. 118–124.

Gamage S, Samarabandu J. Deep learning methods in network intrusion detection: a survey and an objective comparison. J Netw Comput Appl. 2020;169:102767.

Hua Y. An efficient traffic classification scheme using embedded feature selection and lightgbm. In: 2020 Information Communication Technologies Conference (ICTC), 2020. p. 125–130.

Huancayo Ramos KS, Sotelo Monge MA, Maestre Vidal J. Benchmark-based reference model for evaluating botnet detection tools driven by traffic-flow analytics. Sensors. 2020;20(16):4501.

Kanimozhi V, Jacob TP. Artificial intelligence based network intrusion detection with hyper-parameter optimization tuning on the realistic cyber dataset cse-cic-ids2018 using cloud computing. In: 2019 International Conference on Communication and Signal Processing (ICCSP), 2019, p. 0033–0036.

Kanimozhi V, Jacob TP. Calibration of various optimized machine learning classifiers in network intrusion detection system on the realistic cyber dataset cse-cic-ids2018 using cloud computing. Int J Eng Appl Sci Technol. 2019;4(6):2143–455.

Karatas G, Demir O, Sahingoz OK. Increasing the performance of machine learning-based idss on an imbalanced and up-to-date dataset. IEEE Access. 2020;8:32150–62.

Kim J, Kim J, Kim H, Shim M, Choi E. Cnn-based network intrusion detection against denial-of-service attacks. Electronics. 2020;9(6):916.

Li X, Chen W, Zhang Q, Wu L. Building auto-encoder intrusion detection system based on random forest feature selection. Comput Secur. 2020;95:101851.

Lin P, Ye K, Xu C-Z. Dynamic network anomaly detection system by using deep learning techniques. In: International Conference on Cloud Computing. Springer; 2019, 161–176.

Zhao F, Zhang H, Peng J, Zhuang X, Na S-G. A semi-self-taught network intrusion detection system. Neural Comput Appl. 2020;32:17169–79.

Happel BL, Murre JM. Design and evolution of modular neural network architectures. Neural Netw. 1994;7(6–7):985–1004.

Lu N, Li T, Ren X, Miao H. A deep learning scheme for motor imagery classification based on restricted Boltzmann machines. IEEE Trans Neural Syst Rehab Eng. 2016;25(6):566–76.

Varsamopoulos S, Criger B, Bertels K. Decoding small surface codes with feedforward neural networks. Quantum Sci Technol. 2017;3(1):015004.

De Mulder W, Bethard S, Moens M-F. A survey on the application of recurrent neural networks to statistical language modeling. Comput Speech Lang. 2015;30(1):61–98.

Madan A, George AM, Singh A, Bhatia M. Redaction of protected health information in ehrs using crfs and bi-directional lstms. In: 2018 7th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions)(ICRITO), 2018. p. 513–517.

Lee K, Filannino M, Uzuner Ö. An empirical test of grus and deep contextualized word representations on de-identification. Stud Health Technol Inform. 2019;264:218–22.

Chaudhary A, Kolhe S, Kamal R. An improved random forest classifier for multi-class classification. Inf Process Agric. 2016;3(4):215–22.

Rynkiewicz J. Asymptotic statistics for multilayer perceptron with Relu hidden units. Neurocomputing. 2019;342:16–23.

Chen J, Xie B, Zhang H, Zhai J. Deep autoencoders in pattern recognition: A survey. Bio-inspired Computing Models And Algorithms. World Scientific;2019. 229–55.

Joshi J, Kumar T, Srivastava S, Sachdeva D. Optimisation of hidden Markov model using Baum-Welch algorithm for prediction of maximum and minimum temperature over Indian Himalaya. J Earth Syst Sci. 2017;126(1):3.

Lember J, Sova J. Regenerativity of viterbi process for pairwise markov models. J Theor Probab. 2020;. https://doi.org/10.1007/s10959-020-01022-z.

Shah SAR, Issac B. Performance comparison of intrusion detection systems and application of machine learning to snort system. Future Gener Comput Syst. 2018;80:157–70.

Pasupa K, Vatathanavaro S, Tungjitnob S. Convolutional neural networks based focal loss for class imbalance problem: A case study of canine red blood cells morphology classification. J Ambient Intell Human Comput. 2020;. https://doi.org/10.1007/s12652-020-01773-x.

Chen W, Zhang S, Li R, Shahabi H. Performance evaluation of the gis-based data mining techniques of best-first decision tree, random forest, and naïve bayes tree for landslide susceptibility modeling. Sci Total Environ. 2018;644:1006–188.

Ahmad I, Basheri M, Iqbal MJ, Rahim A. Performance comparison of support vector machine, random forest, and extreme learning machine for intrusion detection. IEEE Access. 2018;6:33789–95.

Taşer PY, Birant KU, Birant D. Comparison of ensemble-based multiple instance learning approaches. In: 2019 IEEE International Symposium on INnovations in Intelligent SysTems and Applications (INISTA), 2019. p. 1–5.

Ayyadevara VK. Gradient boosting machine. In: Pro Machine Learning Algorithms. Berkeley, CA: Apress; 2018. https://doi.org/10.1007/978-1-4842-3564-5_6.

Wang R, Zeng S, Wang X, Ni J. Machine learning for hierarchical prediction of elastic properties in fe-cr-al system. Comput Mater Sci. 2019;166:119–23.

Baig MM, Awais MM, El-Alfy E-SM. Adaboost-based artificial neural network learning. Neurocomputing. 2017;248:120–6.

Chen T, Guestrin C. Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, 2016. p. 785–794.

Vajda S, Santosh K. A fast k-nearest neighbor classifier using unsupervised clustering. In: International Conference on Recent Trends in Image Processing and Pattern Recognition, 2016. p. 185–193.

Saikia T, Brox T, Schmid C. Optimized generic feature learning for few-shot classification across domains. arXiv preprint arXiv:2001.07926 2020.

Sulaiman S, Wahid RA, Ariffin AH, Zulkifli CZ. Question classification based on cognitive levels using linear svc. Test Eng Manag. 2020;83:6463–70.

Rahman MA, Hossain MA, Kabir MR, Sani MH, Awal MA et al.. Optimization of sleep stage classification using single-channel eeg signals. In: 2019 4th International Conference on Electrical Information and Communication Technology (EICT), 2019. p. 1–6.

Rymarczyk T, Kozłowski E, Kłosowski G, Niderla K. Logistic regression for machine learning in process tomography. Sensors. 2019;19(15):3400.

Koroniotis N, Moustafa N, Sitnikova E, Turnbull B. Towards the development of realistic botnet dataset in the internet of things for network forensic analytics: Bot-iot dataset. Future Gener Comput Syst. 2019;100:779–96.

Liu W, Wang Z, Liu X, Zeng N, Liu Y, Alsaadi FE. A survey of deep neural network architectures and their applications. Neurocomputing. 2017;234:11–26.

Li J, Xi B, Li Y, Du Q, Wang K. Hyperspectral classification based on texture feature enhancement and deep belief networks. Remote Sensing. 2018;10(3):396.

Zhao Y, Li H, Wan S, Sekuboyina A, Hu X, Tetteh G, Piraud M, Menze B. Knowledge-aided convolutional neural network for small organ segmentation. IEEE J Biomed Health Inform. 2019;23(4):1363–73.

Taherkhani A, Cosma G, McGinnity TM. Deep-fs: A feature selection algorithm for deep boltzmann machines. Neurocomputing. 2018;322:22–37.

Jazi HH, Gonzalez H, Stakhanova N, Ghorbani AA. Detecting http-based application layer dos attacks on web servers in the presence of sampling. Comput Netw. 2017;121:25–36.

Akhtar F, Li J, Pei Y, Xu Y, Rajput A, Wang Q. Optimal features subset selection for large for gestational age classification using gridsearch based recursive feature elimination with cross-validation scheme. In: International Conference on Frontier Computing, 2019. p. 63–71.

Scikit-learn: SGDClassifier. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html

Fadlil A, Riadi I, Aji S. Ddos attacks classification using numeric attribute based Gaussian Naive Bayes. Int J Adv Comput Sci Appl. 2017;8(8):42–50.

Elkhalil K, Kammoun A, Couillet R, Al-Naffouri TY, Alouini M-S. Asymptotic performance of regularized quadratic discriminant analysis based classifiers. In: 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP), 2017. p. 1–6.

Abd Elrahman SM, Abraham A. A review of class imbalance problem. J Netw Innov Comput. 2013;1(2013):332–40.

Zhang W-Y, Wei Z-W, Wang B-H, Han X-P. Measuring mixing patterns in complex networks by spearman rank correlation coefficient. Phys A Stat Mech Appl. 2016;451:440–50.

Shi D, DiStefano C, McDaniel HL, Jiang Z. Examining chi-square test statistics under conditions of large model size and ordinal data. Struct Equ Model. 2018;25(6):924–45.

Hancock J, Khoshgoftaar TM. Medicare fraud detection using catboost. In: 2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science (IRI), 2020. p. 97–103. IEEE Computer Society.

Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, Liu T-Y. Lightgbm: A highly efficient gradient boosting decision tree. In: Advances in Neural Information Processing Systems, 2017. p. 3146–3154.

Bentéjac C, Csörgő A, Martínez-Muñoz G. A comparative analysis of gradient boosting algorithms. Artif Int Rev. 2020;1–31.

KDD: KDD Cup. https://kdd.ics.uci.edu/databases/kddcup99/task.html/.

Tavallaee M, Bagheri E, Lu W, Ghorbani AA. A detailed analysis of the kdd cup 99 data set. In: 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications, 2009. p. 1–6. IEEE.

Yap BW, Abd Rani K, Abd Rahman HA, Fong S, Khairudin Z, Abdullah NN. An application of oversampling, undersampling, bagging and boosting in handling imbalanced datasets. In: Proceedings of the First International Conference on Advanced Data and Information Engineering (DaEng-2013), 2014. p. 13–22. Springer.

Saritas MM, Yasar A. Performance analysis of ann and Naive Bayes classification algorithm for data classification. Int J Intell Syst Appl Eng. 2019;7(2):88–91.

Alenazi A, Traore I, Ganame K, Woungang I. Holistic model for http botnet detection based on dns traffic analysis. In: International Conference on Intelligent, Secure, and Dependable Systems in Distributed and Cloud Environments, 2017. p. 1–18.

Gupta V, Bhavsar A. Random forest-based feature importance for hep-2 cell image classification. In: Annual Conference on Medical Image Understanding and Analysis, 2017. p. 922–934. Springer.

Yuanyuan S, Yongming W, Lili G, Zhongsong M, Shan J. The comparison of optimizing svm by ga and grid search. In: 2017 13th IEEE International Conference on Electronic Measurement & Instruments (ICEMI), 2017. p. 354–360.

Ranjan G, Verma AK, Radhika S. K-nearest neighbors and grid search cv based real time fault monitoring system for industries. In: 2019 IEEE 5th International Conference for Convergence in Technology (I2CT), 2019. p. 1–5.

Bilgic B, Chatnuntawech I, Fan AP, Setsompop K, Cauley SF, Wald LL, Adalsteinsson E. Fast image reconstruction with l2-regularization. J Magn Reson Imaging. 2014;40(1):181–91.

Meyer H, Reudenbach C, Hengl T, Katurji M, Nauss T. How to detect and avoid overfitting in spatio-temporal machine learning applications. In: EGU General Assembly Conference Abstracts, vol. 20, 2018. p. 8365.

Yadav S, Shukla S. Analysis of k-fold cross-validation over hold-out validation on colossal datasets for quality classification. In: 2016 IEEE 6th International Conference on Advanced Computing (IACC), 2016. p. 78–83.

Fernández A, Garcia S, Herrera F, Chawla NV. Smote for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J Artif Intell Res. 2018;61:863–905.

Negi S, Kumar Y, Mishra V. Feature extraction and classification for emg signals using linear discriminant analysis. In: 2016 2nd International Conference on Advances in Computing, Communication, & Automation (ICACCA)(Fall), 2016. p. 1–6.

Wei Z, Wang Y, He S, Bao J. A novel intelligent method for bearing fault diagnosis based on affinity propagation clustering and adaptive feature selection. Knowl Based Syst. 2017;116:1–12.

Mirsky Y, Doitshman T, Elovici Y, Shabtai A. Kitsune: an ensemble of autoencoders for online network intrusion detection. arXiv preprint arXiv:1802.09089 2018.

Chorowski JK, Bahdanau D, Serdyuk D, Cho K, Bengio Y. Attention-based models for speech recognition. In: Advances in Neural Information Processing Systems, 2015. p. 577–585.

Zhang Z. Improved adam optimizer for deep neural networks. In: 2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS), 2018. p. 1–2.

Sharma A. Guided stochastic gradient descent algorithm for inconsistent datasets. Appl Soft Comput. 2018;73:1068–80.

Chiang H-T, Hsieh Y-Y, Fu S-W, Hung K-H, Tsao Y, Chien S-Y. Noise reduction in ECG signals using fully convolutional denoising autoencoders. IEEE Access. 2019;7:60806–133.

Deng Z-H, Qiao H-H, Song Q, Gao L. A complex network community detection algorithm based on label propagation and fuzzy c-means. Phys A Stat Mech Appl. 2019;519:217–26.

Zhu X, Wu X, Chen Q. Eliminating class noise in large datasets. In: Proceedings of the 20th International Conference on Machine Learning (ICML-03), 2003. p. 920–927.

Lee J-S. Auc4. 5: Auc-based c4. 5 decision tree algorithm for imbalanced data classification. IEEE Access. 2019;7:106034–42.

Sulam J, Ben-Ari R, Kisilev P. Maximizing auc with deep learning for classification of imbalanced mammogram datasets. In: VCBM, 2017. p. 131–135.

Buda M, Maki A, Mazurowski MA. A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw. 2018;106:249–59.

Iversen GR, Wildt AR, Norpoth H, Norpoth HP. Analysis of Variance. Thousand Oaks: Sage; 1987.

Tukey JW. Comparing individual means in the analysis of variance. Biometrics. 1949;5:99–114.

Del Río S, López V, Benítez JM, Herrera F. On the use of map reduce for imbalanced big data using random forest. Inf Sci. 2014;285:112–37.

Triguero I, Galar M, Merino D, Maillo J, Bustince H, Herrera F. Evolutionary undersampling for extremely imbalanced big data classification under apache spark. In: 2016 IEEE Congress on Evolutionary Computation (CEC), 2016. p. 640–647. IEEE.

Moreno-Torres JG, Raeder T, Alaiz-RodríGuez R, Chawla NV, Herrera F. A unifying view on dataset shift in classification. Pattern Recogn. 2012;45(1):521–30.

Zhuang F, Qi Z, Duan K, Xi D, Zhu Y, Zhu H, Xiong H, He Q. A comprehensive survey on transfer learning. Proceedings of the IEEE. 2020.

Singla A, Bertino E, Verma D. Overcoming the lack of labeled data: training intrusion detection models using transfer learning. In: 2019 IEEE International Conference on Smart Computing (SMARTCOMP), 2019. p. 69–74.