An empirical study of text-based machine learning models for vulnerability detection

Empirical Software Engineering - Tập 28 - Trang 1-45 - 2023
Kollin Napier1, Tanmay Bhowmik1, Shaowei Wang2
1Department of Computer Science and Engineering, Mississippi State University, Mississippi State, USA
2Department of Computer Science, University of Manitoba, Winnipeg, Canada

Tóm tắt

With an increase in complexity and severity, it is becoming harder to identify and mitigate vulnerabilities. Although traditional tools remain useful, machine learning models are being adopted to expand efforts. To help explore methods of vulnerability detection, we present an empirical study on the effectiveness of text-based machine learning models by utilizing 344 open-source projects, 2,182 vulnerabilities and 38 vulnerability types. With the availability of vulnerabilities being presented in forms such as code snippets, we construct a methodology based on extracted source code functions and create equal pairings. We conduct experiments using seven machine learning models, five natural language processing techniques and three data processing methods. First, we present results based on full context function pairings. Next, we introduce condensed functions and conduct a statistical analysis to determine if there is a significant difference between the models, techniques, or methods. Based on these results, we answer research questions regarding model prediction for testing within and across projects and vulnerability types. Our results show that condensed functions with fewer features may achieve greater prediction results when testing within rather than across. Overall, we conclude that text-based machine learning models are not effective in detecting vulnerabilities within or across projects and vulnerability types.

Tài liệu tham khảo

Abdalkareem R, Shihab E, Rilling J (2017) On code reuse from stackoverflow: An exploratory study on android Apps. Inf Softw Technol 88:148–158. https://doi.org/10.1016/j.infsof.2017.04.005 Ban X, Liu S, Chen C, Chua C (2019) A performance evaluation of deep-learnt features for software vulnerability detection. Concurr Comput Pract Experience 31(19):e5103. https://doi.org/10.1002/cpe.5103 Bates S, Cozby P (2017) Methods in behavioral research. McGraw-Hill Education, New York Cavusoglu H, Mishra B, Raghunathan S (2004) The effect of internet security breach announcements on market value: Capital market reactions for breached firms and internet security developers. Int J Electron Commer 9(1):70–104. https://doi.org/10.1080/10864415.2004.11044320 Chen Y (2015) Convolutional neural network for sentence classification. Master’s thesis, University of Waterloo. http://hdl.handle.net/10012/9592 Chernis B, Verma R (2018) Machine learning methods for software vulnerability detection. In: Proceedings of the 4th ACM international workshop on security and privacy analytics, pp 31–39. https://doi.org/10.1145/3180445.3180453 Cor K, Sood G (2018) Pwned: How often are Americans’ online accounts breached? arXiv:1808.01883 Czerwonka J, Greiler M, Tilford J (2015) Code reviews do not find bugs. How the current code review best practice slows us down. In: 2015 IEEE/ACM 37th IEEE international conference on software engineering, vol 2. IEEE, pp 27–28. https://doi.org/10.1109/ICSE.2015.131 Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:181004805, https://doi.org/10.48550/arXiv.1810.04805 Dowd M, McDonald J, Schuh J (2006) The art of software security assessment: Identifying and preventing software vulnerabilities. Pearson Education Duan X, Wu J, Ji S, Rui Z, Luo T, Yang M, Wu Y (2019) Vulsniper: Focus your attention to shoot fine-grained vulnerabilities. In: IJCAI, pp 4665–4671. https://doi.org/10.24963/ijcai.2019/648 Dunn OJ (1961) Multiple comparisons among means. J Am Stat Assoc 56(293):52–64. https://doi.org/10.1080/01621459.1961.10482090 Egele M, Scholte T, Kirda E, Kruegel C (2008) A survey on automated dynamic malware-analysis techniques and tools. ACM Comput Surv (CSUR) 44 (2):1–42. https://doi.org/10.1145/2089125.2089126 Fan J, Li Y, Wang S, Nguyen TN (2020) AC/C++ code vulnerability dataset with code changes and CVE summaries. In: Proceedings of the 17th international conference on mining software repositories, pp 508–512. https://doi.org/10.1145/3379597.3387501 Fischer F, Böttinger K, Xiao H, Stransky C, Acar Y, Backes M, Fahl S (2017) Stack overflow considered harmful? the impact of copy&paste on android application security. In: 2017 IEEE symposium on security and privacy (SP). https://doi.org/10.1109/SP.2017.31. IEEE, pp 121–136 Ghaffarian SM, Shahriari HR (2017) Software vulnerability analysis and discovery using machine-learning and data-mining techniques: A survey. ACM Comput Surv (CSUR) 50(4):1–36. https://doi.org/10.1145/3092566 Grieco G, Grinblat GL, Uzal L, Rawat S, Feist J, Mounier L (2016) Toward large-scale vulnerability discovery using machine learning. In: Proceedings of the 6th ACM conference on data and application security and privacy, pp 85–96. https://doi.org/10.1145/2857705.2857720 Harer JA, Kim LY, Russell RL, Ozdemir O, Kosta LR, Rangamani A, Hamilton LH, Centeno GI, Key JR, Ellingwood PM et al (2018) Automated software vulnerability detection with machine learning. arXiv:1803.04497 Hovsepyan A, Scandariato R, Joosen W, Walden J (2012) Software vulnerability prediction using text analysis techniques. In: Proceedings of the 4th international workshop on Security measurements and metrics, pp 7–10. https://doi.org/10.1145/2372225.2372230 Huang S, Tang H, Zhang M, Tian J (2010) Text clustering on national vulnerability database. In: 2010 2nd international conference on computer engineering and applications, vol 2. IEEE, pp 295–299. https://doi.org/10.1109/ICCEA.2010.209 Ijaz M, Durad MH, Ismail M (2019) Static and dynamic malware analysis using machine learning. In: 2019 16th international BHURBAN conference on applied sciences and technology (IBCAST). https://doi.org/10.1109/IBCAST.2019.8667136. IEEE, pp 687–691 Jie G, Xiao-Hui K, Qiang L (2016) Survey on software vulnerability analysis method based on machine learning. In: 2016 IEEE 1st international conference on data science in cyberspace (DSC). https://doi.org/10.1109/DSC.2016.33. IEEE, pp 642–647 Kim J, Hubczenko D, Montague P (2019) Towards attention based vulnerability discovery using source code representation. In: International conference on artificial neural networks. https://doi.org/10.1007/978-3-030-30490-4_58. Springer, pp 731–746 Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). https://doi.org/10.3115/v1/D14-1181, https://aclanthology.org/D14-1181. Association for Computational Linguistics, Doha, Qatar, pp 1746–1751 Klock R (2021) Quality of SQL code security on stackoverflow and methods of prevention. PhD thesis, Oberlin College. http://rave.ohiolink.edu/etdc/view?acc_num=oberlin1625831198110328 Koroteev M (2021) Bert: A review of applications in natural language processing and understanding. arXiv:2103.11943 Kruskal WH, Wallis WA (1952) Use of ranks in one-criterion variance analysis. J Am Stat Assoc 47(260):583–621. https://doi.org/10.1080/01621459.1952.10483441 Layton R, Watters PA (2014) A methodology for estimating the tangible cost of data breaches. J Inf Secur Appl 19(6):321–330. https://doi.org/10.1016/j.jisa.2014.10.012 Le QV, Mikolov T (2014) Distributed representations of sentences and documents. https://doi.org/10.48550/arXiv.1405.4053 Li P, Cui B (2010) A comparative study on software vulnerability static analysis techniques and tools. In: 2010 IEEE international conference on information theory and information security. https://doi.org/10.1109/ICITIS.2010.5689543. IEEE, pp 521–524 Li X, Chang X, Board JA, Trivedi KS (2017) A novel approach for software vulnerability classification. In: 2017 annual reliability and maintainability symposium (RAMS). https://doi.org/10.1109/RAM.2017.7889792. IEEE, pp 1–7 Li Z, Zou D, Xu S, Ou X, Jin H, Wang S, Deng Z, Zhong Y (2018) Vuldeepecker: A deep learning-based system for vulnerability detection. arXiv:180101681, https://doi.org/10.14722/ndss.2018.23158 Li Z, Zou D, Xu S, Chen Z, Zhu Y, Jin H (2021a) Vuldeelocator: a deep learning-based fine-grained vulnerability detector. IEEE Trans Dependable Sec Comput. https://doi.org/10.1109/TDSC.2021.3076142 Li Z, Zou D, Xu S, Jin H, Zhu Y, Chen Z (2021b) Sysevr: A framework for using deep learning to detect software vulnerabilities. IEEE Trans Dependable Secur Comput. https://doi.org/10.1109/TDSC.2021.3051525 Lin G, Zhang J, Luo W, Pan L, Xiang Y (2017) Poster: Vulnerability discovery with function representation learning from unlabeled projects. In: Proceedings of the 2017 ACM SIGSAC conference on computer and communications security, pp 2539–2541. https://doi.org/10.1145/3133956.3138840 Lin G, Zhang J, Luo W, Pan L, De Vel O, Montague P, Xiang Y (2019) Software vulnerability discovery via learning multi-domain knowledge bases. IEEE Trans Dependable Sec Comput. https://doi.org/10.1109/TDSC.2019.2954088 Lin G, Wen S, Han QL, Zhang J, Xiang Y (2020) Software vulnerability detection using deep neural networks: A survey. Proc IEEE 108(10):1825–1848. https://doi.org/10.1109/JPROC.2020.2993293 Liu B, Shi L, Cai Z, Li M (2012) Software vulnerability discovery techniques: A survey. In: 2012 4th international conference on multimedia information networking and security. https://doi.org/10.1109/MINES.2012.202. IEEE, pp 152–156 Liu S, Lin G, Han QL, Wen S, Zhang J, Xiang Y (2019) Deepbalance: Deep-learning and fuzzy oversampling for vulnerability detection. IEEE Trans Fuzzy Syst 28(7):1329–1343. https://doi.org/10.1109/TFUZZ.2019.2958558 Liu S, Lin G, Qu L, Zhang J, De Vel O, Montague P, Xiang Y (2020) CD-VulD: Cross-domain vulnerability discovery based on deep domain adaptation. IEEE Trans Dependable Secur Comput. https://doi.org/10.1109/TDSC.2020.2984505 Mäntylä V, Lassenius C (2008) What types of defects are really discovered in code reviews? IEEE Trans Softw Eng 35(3):430–448. https://doi.org/10.1109/TSE.2008.71 McQueen MA, McQueen TA, Boyer WF, Chaffin MR (2009) Empirical estimates and observations of 0day vulnerabilities. In: 2009 42nd Hawaii international conference on system sciences. https://doi.org/10.1109/HICSS.2009.186. IEEE, pp 1–12 Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. https://doi.org/10.48550/arXiv.1301.3781 Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013b) Distributed representations of words and phrases and their compositionality. https://doi.org/10.48550/arXiv.1310.4546 Mokbal FMM, Dan W, Imran A, Jiuchuan L, Akhtar F, Xiaoxi W (2019) MLPXSS: an integrated XSS-based attack detection scheme in web applications using multilayer perceptron technique. IEEE Access 7:100567–100580. https://doi.org/10.1109/ACCESS.2019.2927417 Mubarek AM, Adalı E (2017) Multilayer perceptron neural network technique for fraud detection. In: 2017 international conference on computer science and engineering (UBMK). https://doi.org/10.1109/UBMK.2017.8093417. IEEE, pp 383–387 Perl H, Dechand S, Smith M, Arp D, Yamaguchi F, Rieck K, Fahl S, Acar Y (2015) VCCFinder: Finding potential vulnerabilities in open-source projects to assist code audits. In: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pp 426–437. https://doi.org/10.1145/2810103.2813604 Pham NH, Nguyen TT, Nguyen HA, Nguyen TN (2010) Detection of recurring software vulnerabilities. In: Proceedings of the IEEE/ACM international conference on automated software engineering, pp 447–456. https://doi.org/10.1145/1858996.1859089 Piessens F (2002) A taxonomy of causes of software vulnerabilities in internet software. In: Supplementary Proceedings of the 13th international symposium on software reliability engineering. Citeseer, pp 47–52 Plachkinova M, Maurer C (2018) Security breach at target. J Inf Syst Educ 29(1):11–20. https://aisel.aisnet.org/jise/vol29/iss1/7 Qiu X, Sun T, Xu Y, Shao Y, Dai N, Huang X (2020) Pre-trained models for natural language processing: A survey. Sci China Technol Sci 63(10):1872–1897. https://doi.org/10.1007/s11431-020-1647-3 Scandariato R, Walden J, Hovsepyan A, Joosen W (2014) Predicting vulnerable software components via text mining. IEEE Trans Softw Eng 40(10):993–1006. https://doi.org/10.1109/TSE.2014.2340398 Shar LK, Briand LC, Tan HBK (2014) Web application vulnerability prediction using hybrid program analysis and machine learning. IEEE Trans Dependable Secur Comput 12(6):688–707. https://doi.org/10.1109/TDSC.2014.2373377 Shin Y, Williams L (2008) An empirical model to predict security vulnerabilities using code complexity metrics. In: Proceedings of the 2nd ACM-IEEE international symposium on Empirical software engineering and measurement, pp 315–317. https://doi.org/10.1145/1414004.1414065 Shu X, Tian K, Ciambrone A, Yao D (2017) Breaking the target: An analysis of target data breach and lessons learned. arXiv:1701.04940 Spanos G, Angelis L, Toloudis D (2017) Assessment of vulnerability severity using text mining. In: Proceedings of the 21st Pan-Hellenic conference on informatics, pp 1–6. https://doi.org/10.1145/3139367.3139390 Spreitzenbarth M, Schreck T, Echtler F, Arp D, Hoffmann J (2015) Mobile-sandbox: combining static and dynamic analysis with machine-learning techniques. Int J Inf Secur 14(2):141–153. https://doi.org/10.1007/s10207-014-0250-0 Su W, Yuan Y, Zhu M (2015) A relationship between the average precision and the area under the ROC curve. In: Proceedings of the 2015 international conference on the theory of information retrieval, pp 349–352. https://doi.org/10.1145/2808194.2809481 Sultana KZ, Deo A, Williams BJ (2016) A preliminary study examining relationships between nano-patterns and software security vulnerabilities. In: 2016 IEEE 40th annual computer software and applications conference (COMPSAC). https://doi.org/10.1109/COMPSAC.2016.34, vol 1. IEEE, pp 257–262 Tang G, Meng L, Wang H, Ren S, Wang Q, Yang L, Cao W (2020) A comparative study of neural network techniques for automatic software vulnerability detection. In: 2020 international symposium on theoretical aspects of software engineering (TASE). IEEE, pp 1–8. https://doi.org/10.1109/TASE49443.2020.00010 Telang R, Wattal S (2007) An empirical analysis of the impact of software vulnerability announcements on firm stock price. IEEE Trans Softw Eng 33(8):544–557. https://doi.org/10.1109/TSE.2007.70712 Wang H, Ye G, Tang Z, Tan SH, Huang S, Fang D, Feng Y, Bian L, Wang Z (2020) Combining graph-based learning with automated data collection for code vulnerability detection. IEEE Trans Inf Forensics Secur 16:1943–1958. https://doi.org/10.1109/TIFS.2020.3044773 Wang P, Johnson C (2018) Cybersecurity incident handling: A case study of the equifax data breach. Issues Inf Syst 19(3). https://doi.org/10.48009/3_iis_2018_150-159 Wijayasekara D, Manic M, McQueen M (2014) Vulnerability identification and classification via text mining bug databases. In: IECON 2014-40th annual conference of the IEEE industrial electronics society. https://doi.org/10.1109/IECON.2014.7049035. IEEE, pp 3612–3618 Yamaguchi F, Lindner F, Rieck K (2011) Vulnerability extrapolation: Assisted discovery of vulnerabilities using machine learning. In: Proceedings of the 5th USENIX conference on Offensive technologies, pp 13–13. https://dl.acm.org/doi/10.5555/2028052.2028065 Zhang H, Wang S, Li H, Chen THP, Hassan AE (2021) A study of C/C++ code weaknesses on stack overflow. IEEE Trans Softw Eng. https://doi.org/10.1109/TSE.2021.3058985 Zhu M (2004) Recall, precision and average precision. Department of Statistics and Actuarial Science, University of Waterloo. Waterloo 2(30):6 Zhu Y, Kiros R, Zemel R, Salakhutdinov R, Urtasun R, Torralba A, Fidler S (2015) Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In: Proceedings of the IEEE international conference on computer vision, pp 19–27. https://doi.org/10.1109/ICCV.2015.11 Zou D, Wang S, Xu S, Li Z, Jin H (2019) μ vuldeepecker: A deep learning-based system for multiclass vulnerability detection. IEEE Trans Dependable Secur Comput. https://doi.org/10.1109/TDSC.2019.2942930