On the use of textual feature extraction techniques to support the automated detection of refactoring documentation

Innovations in Systems and Software Engineering - Tập 18 - Trang 233-249 - 2021
Licelot Marmolejos1, Eman Abdullah AlOmar1, Mohamed Wiem Mkaouer1, Christian Newman1, Ali Ouni2
1Rochester Institute of Technology, Rochester, USA
2ETS Montreal, University of Quebec, Quebec City, Canada

Tóm tắt

Refactoring is the art of improving the internal structure of a program without altering its external behavior, and it is an important task when it comes to software maintainability. While existing studies have focused on the detection of refactoring operations by mining software repositories, little was done to understand how developers document their refactoring activities. Therefore, there is recent trend trying to detect developers documentation of refactoring, by manually analyzing their internal and external software documentation. However, these techniques are limited by their manual process, which hinders their scalability. Hence, in this study, we tackle the detection of refactoring documentation as binary classification problem. We focus on the automatic detection of refactoring activities in commit messages by relying on text-mining, natural language preprocessing, and supervised machine learning techniques. We design our tool to overcome the limitation of the manual process, previously proposed by existing studies, through exploring the transformation of commit messages into features that are used to train various models. For our evaluation, we use and compare five different binary classification algorithms, and we test the effectiveness of these models using an existing dataset of manually curated messages that are known to be documenting refactoring activities in the source code. The experiments are carried out with different data sizes and number of bits. As per our results, the combination of Chi-Squared with Bayes point machine and Fisher score with Bayes point machine could be the most efficient when it comes to automatically identifying refactoring text patterns in commit messages, with an accuracy of 0.96, and an F-score of 0.96.

Tài liệu tham khảo

bekvon/Residence. https://github.com/bekvon/residence/commit/76c364ea47e5a28b2041a0bb3323cb48bab180c9. Accessed 3 Jan 2021 AlOmar EA, Mkaouer MW, Ouni A (2019) Can refactoring be self-affirmed? an exploratory study on how developers document their refactoring activities in commit messages. In: Proceedings of the 3nd international workshop on refactoring-accepted. IEEE AlOmar EA, Mkaouer MW, Ouni A, Kessentini M (2019) On the impact of refactoring on the relationship between quality attributes and design metrics. In: 2019 ACM/IEEE international symposium on empirical software engineering and measurement (ESEM), pp 1–11. IEEE AlOmar EA, Peruma A, Mkaouer MW, Newman C, Ouni A, Kessentini M (2020) How we refactor and how we document it? On the use of supervised machine learning algorithms to classify refactoring documentation. Expert Syst Appl 167:114176 AlOmar EA, Rodriguez PT, Bowman J, Wang T, Adepoju B, Lopez K, Newman CD, Ouni A, Mkaouer MW (2020) How do developers refactor code to improve code reusability? In: International conference on software and systems reuse. Springer Alrubaye H, Mkaouer MW, Ouni A (2019) Migrationminer: an automated detection tool of third-party java library migration at the method level. In: 2019 IEEE international conference on software maintenance and evolution (ICSME), pp 414–417. IEEE Alrubaye H, Mkaouer MW, Ouni A (2019) On the use of information retrieval to automate the detection of third-party java library migration at the method level. In: Proceedings of the 27th international conference on program comprehension, pp 347–357. IEEE Press Bavota G, Dit B, Oliveto R, Di Penta M, Poshyvanyk D, De Lucia A (2013) An empirical study on the developers’ perception of software coupling. In: Proceedings of the 2013 international conference on software engineering, pp 692–701. IEEE Press Bavota G, Panichella S, Tsantalis N, Di Penta M, Oliveto R, Canfora G (2014)Recommending refactorings based on team co-maintenance patterns. In: Proceedings of the 29th ACM/IEEE international conference on automated software engineering, pp 337–342. ACM Chávez A, Ferreira I, Fernandes E, Cedrim D, Garcia A (2017) How does refactoring affect internal quality attributes?: a multi-project study. In: Proceedings of the 31st Brazilian symposium on software engineering, pp 74–83. ACM Chidamber SR, Kemerer CF (1994) A metrics suite for object oriented design. IEEE Trans Soft Eng 20(6):476–493 Demeyer S, Ducasse S, Nierstrasz O (2000) Finding refactorings via change metrics. In: ACM SIGPLAN notices, vol 35, pp 166–177. ACM Di Z, Li B, Li Z, Liang P (2018) A preliminary investigation of self-admitted refactorings in open source software (S). In: The 30th international conference on software engineering and knowledge engineering, Hotel Pullman, Redwood City, California, USA, July 1–3, [13], pp 165–164. https://doi.org/10.18293/SEKE2018-081 Dig D, Comertoglu C, Marinov D, Johnson R (2006) Automated detection of refactorings in evolving components. In: European conference on object-oriented programming, pp 404–428. Springer Freund Y, Schapire RE (1999) Large margin classification using the perceptron algorithm. Mach Learn 37(3):277–296 Hattori LP, Lanza M (2008) On the nature of commits. In: Proceedings of the 23rd IEEE/ACM international conference on automated software engineering, pp III–63. IEEE Press Hayashi S, Tsuda Y, Saeki M (2010) Search-based refactoring detection from source code revisions. IEICE Trans Inf Syst 93(4):754–762 Herbrich R, Graepel T, Campbell C (2001) Bayes point machines. J Mach Learn Res 1:245–279 Hindle A, German DM, Holt R (2008) What do large commits tell us?: a taxonomical study of large commits. In: Proceedings of the 2008 international working conference on Mining software repositories, pp 99–108. ACM Howe NR, Rath TM, Manmatha R (2005) Boosted decision trees for word recognition in handwritten document retrieval. In: Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval, pp 377–383. ACM Kehrer T, Kelter U, Taentzer G (2011) A rule-based approach to the semantic lifting of model differences in the context of model versioning. In: Proceedings of the 2011 26th IEEE/ACM international conference on automated software engineering, pp 163–172. IEEE Computer Society Kim M, Gee M, Loh A, Rachatasumrit N (2010) Ref-finder: a refactoring reconstruction tool based on logic query templates. In: Proceedings of the 18th ACM SIGSOFT international symposium on foundations of software engineering, pp 371–372. ACM Kowsari K, Jafari Meimandi K, Heidarysafa M, Mendu S, Barnes L, Brown D (2019) Text classification algorithms: a survey. Information 10(4):150 Mahouachi R, Kessentini M, Cinnéide MÓ (2013) Search-based refactoring detection using software metrics variation. In: International symposium on search based software engineering, pp 126–140. Springer Mansouri MM (2018) Detection of rename local variable refactoring instances in commit history. PhD thesis, Concordia University Mkaouer MW, Kessentini M, Bechikh S, Deb K, Ó Cinnéide M (2014) Recommendation system for software refactoring using innovization and interactive dynamic optimization. In: Proceedings of the 29th ACM/IEEE international conference on automated software engineering, pp 331–336. ACM Mkaouer MW, Kessentini M, Cinnéide MÓ, Hayashi S, Deb K (2017) A robust multi-objective approach to balance severity and importance of refactoring opportunities. Emp Softw Eng 22(2):894–927 Mund S (2015) Microsoft azure machine learning. Packt Publishing Ltd Murphy-Hill E, Parnin C, Black AP (2011) How we refactor, and how we know it. IEEE Trans Softw Eng 38(1):5–18 Opdyke WF (1992) Refactoring object-oriented frameworks. University of Illinois at Urbana-Champaign, Champaign Pan B, Tian Y, Zhou TS, Wang F, Li JS (2015) Study on image encryption method in clinical data exchange. In: 2015 7th international conference on information technology in medicine and education (ITME), pp 252–255. IEEE Ratzinger J, Sigmund T, Gall HC (2008) On the relation of refactorings and software defect prediction. In: Proceedings of the 2008 international working conference on mining software repositories, pp 35–38. ACM Saif H, Fernandez M, He Y, Alani H (2014) On stopwords, filtering and data sparsity for sentiment analysis of Twitter. In: Proceedings of the 9th international conference on language resources and evaluation (LREC’14), pp 810–817. European language resources association (ELRA), Reykjavik, Iceland. http://www.lrec-conf.org/proceedings/lrec2014/pdf/292_Paper.pdf Shi Q, Petterson J, Dror G, Langford J, Smola A, Vishwanathan S (2009) Hash kernels for structured data. J Mach Learn Res 10:2615–2637 Silva D, Valente MT (2017) Refdiff: detecting refactorings in version histories. In: Proceedings of the 14th international conference on mining software repositories, pp 269–279. IEEE Press Soares G, Gheyi R, Serey D, Massoni T (2010) Making program refactoring safer. IEEE Softw 27(4):52–57 Soetens QD, Perez J, Demeyer S (2013) An initial investigation into change-based reconstruction of floss-refactorings. In: 2013 IEEE international conference on software maintenance, pp 384–387. IEEE Stroggylos K, Spinellis D (2007) Refactoring–does it improve software quality? In: 15th international workshop on software quality (WoSQ’07: ICSE workshops 2007), p 10. IEEE Taneja K, Dig D, Xie T (2007) Automated detection of api refactorings in libraries. In: Proceedings of the 22nd IEEE/ACM international conference on automated software engineering, pp 377–380. ACM Thangthumachit S, Hayashi S, Saeki M (2011) Understanding source code differences by separating refactoring effects. In: 2011 18th Asia-Pacific software engineering conference, pp 339–347. IEEE Tsantalis N, Chaikalis T, Chatzigeorgiou A (2008) Jdeodorant: identification and removal of type-checking bad smells. In: 2008 12th European conference on software maintenance and reengineering, pp 329–331. IEEE Tsantalis N, Mansouri M, Eshkevari L, Mazinanian D, Dig D (2018) Accurate and efficient refactoring detection in commit history. In: 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE), pp 483–494. IEEE Weinberger K, Dasgupta A, Attenberg J, Langford J, Smola A (2009) Feature hashing for large scale multitask learning. ArXiv preprint arXiv:0902.2206 Weissgerber P, Diehl S (2006) Identifying refactorings from source-code changes. In: 21st IEEE/ACM international conference on automated software engineering (ASE’06), pp 231–240. IEEE Xing Z, Stroulia E (2005) Umldiff: an algorithm for object-oriented design differencing. In: Proceedings of the 20th IEEE/ACM international conference on automated software engineering, pp 54–65. ACM Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: ICML, vol 97, p 35