Performance analysis of machine learning models for intrusion detection system using Gini Impurity-based Weighted Random Forest (GIWRF) feature selection technique

Raisa Abedin Disha1, Sajjad Waheed2
1Department of Information and Communication Technology, Bangladesh University of Professionals, Mirpur Cantonment, Dhaka, 1216, Bangladesh
2Department of Information and Communication Technology, Mawlana Bhashani Science and Technology University, Santosh, Tangail 1902, Bangladesh

Tóm tắt

AbstractTo protect the network, resources, and sensitive data, the intrusion detection system (IDS) has become a fundamental component of organizations that prevents cybercriminal activities. Several approaches have been introduced and implemented to thwart malicious activities so far. Due to the effectiveness of machine learning (ML) methods, the proposed approach applied several ML models for the intrusion detection system. In order to evaluate the performance of models, UNSW-NB 15 and Network TON_IoT datasets were used for offline analysis. Both datasets are comparatively newer than the NSL-KDD dataset to represent modern-day attacks. However, the performance analysis was carried out by training and testing the Decision Tree (DT), Gradient Boosting Tree (GBT), Multilayer Perceptron (MLP), AdaBoost, Long-Short Term Memory (LSTM), and Gated Recurrent Unit (GRU) for the binary classification task. As the performance of IDS deteriorates with a high dimensional feature vector, an optimum set of features was selected through a Gini Impurity-based Weighted Random Forest (GIWRF) model as the embedded feature selection technique. This technique employed Gini impurity as the splitting criterion of trees and adjusted the weights for two different classes of the imbalanced data to make the learning algorithm understand the class distribution. Based upon the importance score, 20 features were selected from UNSW-NB 15 and 10 features from the Network TON_IoT dataset. The experimental result revealed that DT performed well with the feature selection technique than other trained models of this experiment. Moreover, the proposed GIWRF-DT outperformed other existing methods surveyed in the literature in terms of the F1 score.

Từ khóa


Tài liệu tham khảo

Abirami S, Chitra P (2020) Energy-efficient edge based real-time healthcare support system. In: Advances in computers. Elsevier, pp 339–368

Aboueata N, Alrasbi S, Erbad A, Kassler A, Bhamare D (2019) Supervised machine learning techniques for efficient network intrusion detection. In: 2019 28th international conference on computer communication and networks (ICCCN). IEEE, pp 1–8

Alazzam H, Sharieh A, Sabri KE (2020) A feature selection algorithm for intrusion detection system based on pigeon inspired optimizer. Expert Syst Appl 148:113249

Belgrana FZ, Benamrane N, Hamaida MA et al (2021) Network intrusion detection system using neural network and condensed nearest neighbors with selection of NSL-KDD influencing features. In: 2020 IEEE international conference on internet of things and intelligence system (IoTaIS). IEEE, pp 23–29

Breiman L (2001) Random forests. Mach Learn 45:5–32

Catania CA, Garino CG (2012) Automatic network intrusion detection: current techniques and open issues. Comput Electr Eng 38:1062–1072

Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41:1–58

Cho K, Van Merriënboer B, Gulcehre C et al (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv14061078

Dharmik (2019) Response coding for categorical data. https://medium.com/@thewingedwolf.winterfell/response-coding-for-categorical-data-7bb8916c6dc. Accessed 23 July 2021

Di Mauro M, Galatro G, Liotta A (2020) Experimental review of neural-based approaches for network intrusion management. IEEE Trans Netw Serv Manag 17:2480–2495

Divekar A, Parekh M, Savla V, et al (2018) Benchmarking datasets for anomaly-based network intrusion detection: KDD CUP 99 alternatives. In: 2018 IEEE 3rd international conference on computing, communication and security (ICCCS). IEEE, pp 1–8

Dong G, Liu H (2018) Feature engineering for machine learning and data analytics. CRC Press

Felix AY, Sasipraba T (2019) Flood detection using gradient boost machine learning approach. In: 2019 international conference on computational intelligence and knowledge economy (ICCIKE). IEEE, pp 779–783

Garcia-Teodoro P, Diaz-Verdejo J, Maciá-Fernández G, Vázquez E (2009) Anomaly-based network intrusion detection: techniques, systems and challenges. Comput Secur 28:18–28

Gu J, Lu S (2021) An effective intrusion detection approach using SVM with naïve Bayes feature embedding. Comput Secur 103:102158

Harrington P (2012) Machine learning in action. Simon and Schuster

Hick P, Aben E, Claffy K, Polterock J (2007) The CAIDA DDoS attack 2007 dataset. 2012) [2015-07-10]. http//www. caida. org

Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780

Ingre B, Yadav A (2015) Performance analysis of NSL-KDD dataset using ANN. In: 2015 international conference on signal processing and communication engineering systems. IEEE, pp 92–96

Injadat M, Moubayed A, Nassif AB, Shami A (2020) Multi-stage optimized machine learning framework for network intrusion detection. IEEE Trans Netw Serv Manag

Jing D, Chen H-B (2019) SVM based network intrusion detection for the UNSW-NB15 dataset. In: 2019 IEEE 13th international conference on ASIC (ASICON). IEEE, pp 1–4

Kasongo SM, Sun Y (2020) Performance analysis of intrusion detection systems using a feature selection method on the UNSW-NB15 dataset. J Big Data 7:1–20

Khan NM, Negi A, Thaseen IS (2018) Analysis on improving the performance of machine learning models using feature selection technique. In: International conference on intelligent systems design and applications. Springer, pp 69–77

Khraisat A, Gondal I, Vamplew P, Kamruzzaman J (2019) Survey of intrusion detection systems: techniques, datasets and challenges. Cybersecurity 2:1–22

Krawczyk B (2016) Learning from imbalanced data: open challenges and future directions. Prog Artif Intell 5:221–232

Kumar G (2014) Evaluation metrics for intrusion detection systems-a study. Evaluation 2:11–17

Labonne M (2020) Anomaly-based network intrusion detection using machine learning. https://tel.archives-ouvertes.fr/tel-02988296/. Accessed 30 Sept 2021

Lee J, Pak J, Lee M (2020) Network intrusion detection system using feature extraction based on deep sparse autoencoder. In: 2020 international conference on information and communication technology convergence (ICTC). IEEE, pp 1282–1287

Liao H-J, Lin C-HR, Lin Y-C, Tung K-Y (2013) Intrusion detection system: a comprehensive review. J Netw Comput Appl 36:16–24

Liu H, Yan X, Wu Q (2019) An improved pigeon-inspired optimisation algorithm and its application in parameter inversion. Symmetry (basel) 11:1291

Mason L, Baxter J, Bartlett P, Frean M (1999) Boosting algorithms as gradient descent in function space. In: Proc. NIPS, pp 512–518

Meftah S, Rachidi T, Assem N (2019) Network based intrusion detection using the UNSW-NB15 dataset. Int J Comput Digit Syst 8:478–487

Mohammadi S, Mirvaziri H, Ghazizadeh-Ahsaee M, Karimipour H (2019) Cyber intrusion detection by combined feature selection algorithm. J Inf Secur Appl 44:80–88

Moustafa N (2021) A new distributed architecture for evaluating AI-based security systems at the edge: network TON_IoT datasets. Sustain Cities Soc 72:102994

Moustafa N, Slay J (2015) UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set). In: 2015 military communications and information systems conference (MilCIS). IEEE, pp 1–6

Moustafa N, Slay J (2016) The evaluation of network anomaly detection systems: statistical analysis of the UNSW-NB15 data set and the comparison with the KDD99 data set. Inf Secur J A Glob Perspect 25:18–31

Moustafa N, Turnbull B, Choo K-KR (2018) An ensemble intrusion detection technique based on proposed statistical flow features for protecting network traffic of internet of things. IEEE Internet Things J 6:4815–4830

El Naqa I, Murphy MJ (2015) What is machine learning? In: Machine learning in radiation oncology. Springer, pp 3–11

Osanaiye O, Cai H, Choo K-KR, Dehghantanha A, Xu Z, Dlodlo M (2016) Ensemble-based multi-filter feature selection method for DDoS detection in cloud computing. EURASIP J Wirel Commun Netw 2016:1–10

Quinlan JR (1986) Induction of decision trees. Mach Learn 1:81–106

Rosenblatt F (1961) Principles of neurodynamics. Perceptrons and the theory of brain mechanisms. Cornell Aeronautical Lab Inc, Buffalo

Safavian SR, Landgrebe D (1991) A survey of decision tree classifier methodology. IEEE Trans Syst Man Cybern 21:660–674

Scarfone K, Mell P (2007) Guide to intrusion detection and prevention systems (idps). NIST Spec Publ 800:94

Schapire RE (2003) The boosting approach to machine learning: an overview. Nonlinear Estim Classif 149–171

Scikit Learn, Machine Learning in Python. https://scikit-learn.org/stable. Accessed 6 July 2021

Sethi (2020) One-hot encoding vs. label encoding using scikit-learn. https://www.analyticsvidhya.com/blog/2020/03/one-hot-encoding-vs-label-encoding-using-scikit-learn/. Accessed 30 Sept 2021

Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27:379–423

Sharafaldin I, Lashkari AH, Ghorbani AA (2018) Toward generating a new intrusion detection dataset and intrusion traffic characterization. Icissp 1:108–116

Shiravi A, Shiravi H, Tavallaee M, Ghorbani AA (2012) Toward developing a systematic approach to generate benchmark datasets for intrusion detection. Comput Secur 31:357–374

Song J, Takakura H, Okabe Y, et al (2011) Statistical analysis of honeypot data and building of Kyoto 2006+ dataset for NIDS evaluation. In: Proceedings of the first workshop on building analysis datasets and gathering experience returns for security, pp 29–36

Tama BA, Rhee K-H (2019) An in-depth experimental study of anomaly detection using gradient boosted machine. Neural Comput Appl 31:955–965

Yin C, Zhu Y, Fei J, He X (2017) A deep learning approach for intrusion detection using recurrent neural networks. IEEE Access 5:21954–21961

Zaman S, Karray F (2009) Features selection for intrusion detection systems based on support vector machines. In: 2009 6th IEEE consumer communications and networking conference. IEEE, pp 1–8