Nội dung được dịch bởi AI, chỉ mang tính chất tham khảo

Phát hiện mã độc chưa biết và vấn đề mất cân bằng

Springer Science and Business Media LLC - Tập 5 - Trang 295-308 - 2009

Robert Moskovitch¹, Dima Stopel¹, Clint Feher¹, Nir Nissim¹, Nathalie Japkowicz², Yuval Elovici¹

¹Deutsche Telekom Laboratories, Department of Information Systems Engineering, Ben Gurion University, Be’er Sheva, Israel

²School of Information Technology and Engineering, University of Ottawa, Ottawa, Canada

Tóm tắt

Sự gia tăng gần đây về việc sử dụng mạng đã thúc đẩy việc tạo ra mã độc mới với nhiều mục đích khác nhau. Các phần mềm diệt virus dựa trên chữ ký hiện nay rất chính xác đối với mã độc đã biết, nhưng không thể phát hiện mã độc mới. Gần đây, các thuật toán phân loại đã được áp dụng thành công để phát hiện mã độc chưa biết. Tuy nhiên, các nghiên cứu này thường liên quan đến một bộ dữ liệu thử nghiệm có kích thước giới hạn và tỷ lệ tệp độc hại: tệp an toàn giống nhau trong cả tập huấn luyện và tập thử nghiệm, một tình huống không phản ánh đúng điều kiện thực tế. Chúng tôi trình bày một phương pháp phát hiện mã độc chưa biết, dựa trên các khái niệm phân loại văn bản, dựa trên việc trích xuất n-gram từ mã nhị phân và lựa chọn đặc trưng. Chúng tôi đã thực hiện một đánh giá rộng rãi, với một bộ dữ liệu thử nghiệm gồm hơn 30.000 tệp, trong đó chúng tôi đã điều tra vấn đề mất cân bằng lớp. Trong các kịch bản thực tế, nội dung tệp độc hại được mong đợi là thấp, khoảng 10% tổng số tệp. Đối với các mục đích thực tiễn, không rõ tỷ lệ tương ứng trong tập huấn luyện nên là bao nhiêu. Kết quả của chúng tôi cho thấy độ chính xác lớn hơn 95% có thể đạt được thông qua việc sử dụng một tập huấn luyện có nội dung tệp độc hại dưới 33,3%.

Từ khóa

#mã độc chưa biết #phát hiện mã độc #thuật toán phân loại #trích xuất n-gram #mất cân bằng lớp

Tài liệu tham khảo

Filiol E., Josse S.: A statistical model for undecidable viral detection. J. Comput. Virol. 3, 65–74 (2007) Filiol E.: Malware pattern scanning schemes secure against black-box analysis. J. Comput. Virol. 2, 35–50 (2006) Gryaznov, D.: Scanners of the year 2000: Heuritics. In: Proceedings of the 5th International Virus Bulletin (1999) Schultz, M., Eskin, E., Zadok, E., Stolfo, S.: Data mining methods for detection of new malicious executables. In: Proceedings of the IEEE Symposium on Security and Privacy, 178–184 (2001) Abou-Assaleh, T., Cercone, N., Keselj, V., Sweidan, R.: N-gram based detection of new malicious code. In: Proceedings of the 28th Annual International Computer Software and Applications Conference (COMPSAC’04) (2004) Kolter, J.Z., Maloof, M.A.: Learning to detect malicious executables in the wild. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 470–478. ACM Press, New York (2004) Mitchell T.: Machine Learning. McGraw-Hill, New York (1997) Henchiri, O., Japkowicz, N.: A feature selection and evaluation scheme for computer virus detection. In: Proceedings of ICDM-2006, pp. 891–895. Hong Kong (2006) Reddy D., Pujari A.: N-gram analysis for computer virus detection. J. Comput. Virol. 2, 231–239 (2006) Kubat, M., Matwin, S.: Addressing the curse of imbalanced data sets: one-sided sampling. In: Proceedings of the Fourteenth International Conference on Machine Learning, pp. 179–186 (1997) Fawcett T.E., Provost F.: Adaptive fraud detection. Data Min. Knowl. Discov. 1(3), 291–316 (1997) Ling, C.X., Li, C.: Data mining for direct marketing: problems and solutions. In: Proceedings of the Fourth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 73–79 (1998) Chawla N.V., Japkowicz N., Kotcz A.: Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor. Newsl. 6(1), 1–6 (2004) Japkowicz N., Stephen S.: The class imbalance problem: a systematic study. Intel. Data Anal. J. 6, 5 (2002) Chawla N.V., Bowyer K.W., Hall L.O., Kegelmeyer W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intel. Res. (JAIR) 16, 321–357 (2002) Lawrence S., Burns I., Back A.D., Tsoi A.C., Giles C.L.: Neural network classification and unequal prior class probabilities. In: Orr, G., Muller, R.-R., Caruana, R.(eds) Tricks of the Trade. Lecture Notes in Computer Science State-of-the-Art Surveys, pp. 299–314. Springer, Heidelberg (1998) Chen, C., Liaw, A., Breiman, L.: Using random forest to learn unbalanced data. Technical Report 666, Statistics Department, University of California at Berkeley (2004) Morik, K., Brockhausen, P., Joachims, T.: Combining statistical learning with a knowledge-based approach—a case study in intensive care monitoring. In: Proceedings of the International Conference of Machine Learning, pp. 268–277 (1999) Weiss G., Provost F.: Learning when training data are costly: the effect of class distribution on tree induction. J. Artif. Intel. Res. 19, 315–354 (2003) Salton G., Wong A., Yang C.S.: A vector space model for automatic indexing. Commun. ACM 18, 613–620 (1975) Golub T., Slonim D., Tamaya P., Huard C., Gaasenbeek M., Mesirov J., Coller H., Loh M., Downing J., Caligiuri M., Bloomfield C., Lander E.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999) Bishop C.: Neural Networks for Pattern Recognition. Clarendon Press, Oxford (1995) Quinlan J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, Inc., San Francisco (1993) Witten I.H., Frank E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann Publishers, Inc., San Francisco (2005) Domingos P., Pazzani M.: On the optimality of simple Bayesian classifier under zero-one loss. Mach. Learn. 29, 103–130 (1997) Freund Y., Schapire R.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 119–139 (1997) Burges C.J.C.: A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2(2), 955–974 (1998) Joachims, T.: Making large-scale support vector machine learning practical. Schölkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods: Support Vector Machines. MIT Press, Cambridge (1998) Chang, C., Lin, C.: LIBSVM: a library for support vector machines (2001) Provost, F., Fawcett, T.: Robust classification systems for imprecise environments. In: Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98) (1998) Kubat M., Holte R., Matwin S.: Machine learning for the detection of oil spills in satellite radar images. Mach. Learn. 30, 195–215 (1998) Karim Md., Walenstein A., Lakhotia A., Parida L.: Malware phylogeny generation using permutations of code. J. Comput. Virol. 1, 13–23 (2005)

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích ảnh hưởng của các bài báo, công bố khoa học Việt Nam và Quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ SciBase

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Hệ thống hội thảo khoa học Việt Nam

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA

Thông tin liên hệ & hỗ trợ

Đơn vị chủ quản, phát triển và vận hành: Công ty Cổ phần Metis

Địa chỉ liên hệ: 26A Lê Đức Thọ, Phường Từ Liêm, Thành phố Hà Nội

Số giấy chứng nhận ĐKKD: 0109293202 cấp ngày 03/08/2020 tại Sở Kế hoạch và Đầu tư thành phố Hà Nội

Người quản lý và chịu trách nhiệm nội dung: Nguyễn Ngọc Sơn

Hotline: 0566.685.688

Email: [email protected]