“Bad smells” in software analytics papers

Information and Software Technology - Tập 112 - Trang 35-47 - 2019
Tim Menzies1, Martin Shepperd2
1Dept. of Computer Science North Carolina State University, USA
2Brunel Software Engineering Lab (BSEL) Dept. of Computer Science Brunel University London UB8 3PH, UK

Tài liệu tham khảo

Agrawal, 2018, What is wrong with topic modeling? And how to fix it using search-based software engineering, Inf. Softw. Technol., 98, 74, 10.1016/j.infsof.2018.02.005 A. Agrawal, T. Menzies, Is “better data” better than “better data miners”? On the benefits of tuning SMOTE for defect prediction, 2018. Proceedings of the 40th International Conference on Software Engineering, ACM1050–1061. Agrawal, 2018, We don't need another hero?: the impact of heroes on software development, 245 Arcuri, 2011, A practical guide for using statistical tests to assess randomized algorithms in software engineering, 1 Beck, 1999, Bad smells in code, 75 A. Begel, T. Zimmermann, Analyze this! 145 questions for data scientists in software engineering, 2014. Proceedings of the 36th ACM International Conference on Software Engineering, ACM, 12–23. Bender, 2001, Adjusting for multiple testing—when and how?, J. Clin. Epidemiol., 54, 343, 10.1016/S0895-4356(00)00314-0 Benjamini, 2001, The control of the false discovery rate in multiple testing under dependency, Ann. Stat., 29, 1165, 10.1214/aos/1013699998 Bergstra, 2012, Random search for hyper-parameter optimization, J. Mach. Learn. Res., 13, 281 Blair, 1985, Comparison of the power of the paired samples t test to that of Wilcoxon’s signed-ranks test under various population shapes, Psychol. Bull., 97, 119, 10.1037/0033-2909.97.1.119 Booth, 1997, The value of structured abstracts in information retrieval from MEDLINE, Health Libr. Rev., 14, 157, 10.1046/j.1365-2532.1997.1430157.x Borenstein, 2009 M. Bosu, S. MacDonell, Data quality in empirical software engineering: a targeted review, 2013. Proceedings of the 17th International Conference on Evaluation and Assessment in Software Engineering ACM, 171–176. Budgen, 2011, Reporting computing projects through structured abstracts: a quasi-experiment, Empir. Softw. Eng., 16, 244, 10.1007/s10664-010-9139-3 Button, 2013, Power failure: why small sample size undermines the reliability of neuroscience, Nat. Rev. Neurosci., 14, 365, 10.1038/nrn3475 Carpenter, 2000, Bootstrap confidence intervals: when, which, what? a practical guide for medical statisticians, Stat. Med., 19, 1141, 10.1002/(SICI)1097-0258(20000515)19:9<1141::AID-SIM479>3.0.CO;2-F Carver, 1993, The case against statistical significance testing, revisited, J. Exp. Educ., 61, 287, 10.1080/00220973.1993.10806591 Cawley, 2010, On over-fitting in model selection and subsequent selection bias in performance evaluation, J. Mach. Learn. Res., 11, 2079 Chen, 2018, “Sampling” as a baseline optimizer for search-based software engineering, IEEE Trans. Software Eng., 10.1109/TSE.2018.2790925 Chen, 2005, Finding the right data for software cost modeling, IEEE Softw., 22, 38, 10.1109/MS.2005.151 Cohen, 1988 Cohen, 1992, A power primer, Pyschol. Bull., 112, 155, 10.1037/0033-2909.112.1.155 Colquhoun, 2014, An investigation of the false discovery rate and the misinterpretation of p-values, R. Soc. Open Sci., 1 Courtney, 1993, Shotgun correlations in software measures, Softw. Eng. J., 8, 5, 10.1049/sej.1993.0002 Cruzes, 2011, Research synthesis in software engineering: a tertiary study, Inf. Softw. Technol., 53, 440, 10.1016/j.infsof.2011.01.004 De Veaux, 2005, How to lie with bad data, Stat. Sci., 20, 231, 10.1214/088342305000000269 Deb, 2014, An evolutionary many-objective optimization algorithm using reference-point-based nondominated sorting approach, part i: solving problems with box constraints, IEEE Trans. Evol. Comput., 18, 577, 10.1109/TEVC.2013.2281535 P. Devanbu, T. Zimmermann, C. Bird, Belief & evidence in empirical software engineering, 2016. IEEE/ACM 38th International Conference on Software Engineering (ICSE), IEEE, 108–119. T. Dybå, T. Dingsøyr, Strength of evidence in systematic reviews in software engineering, 2008. Proceedings of the 2nd ACM-IEEE international Symposium on Empirical Software Engineering and Measurement, ACM, 178–187. Dybå, 2006, A systematic review of statistical power in software engineering experiments, Inf. Softw. Technol., 48, 745, 10.1016/j.infsof.2005.08.009 Earp, 2015, Replication, falsification, and the crisis of confidence in social psychology, Front. Psychol., 6, 621, 10.3389/fpsyg.2015.00621 Efron, 1993 Ellis, 2010 Erceg-Hurn, 2008, Modern robust statistical methods: an easy way to maximize the accuracy and power of your research, Am. Psychol., 63, 591, 10.1037/0003-066X.63.7.591 Fu, 2017, Easy over hard: A case study on deep learning, 49 W. Fu, T. Menzies, Revisiting unsupervised learning for defect prediction, ACM, 2017b. Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, 72–83. Fu, 2016, Tuning for software analytics, Inf. Softw. Technol., 76, 135, 10.1016/j.infsof.2016.04.017 Fu, 2017, Why is differential evolution better than grid search for tuning defect predictors?, CoRR Gelman, 2013 B. Ghotra, S. McIntosh, A.E. Hassan, Revisiting the impact of classification techniques on the performance of defect prediction models, IEEE Press, 2015. Proceedings of the 37th International Conference on Software Engineering-Volume 1, 789–800. Goodman, 2016, What does research reproducibility mean?, Sci. Transl. Med., 8, 10.1126/scitranslmed.aaf5027 Hall, 2003, Benchmarking attribute selection techniques for discrete class data mining, IEEE Trans. Knowl. Data Eng., 15, 1437, 10.1109/TKDE.2003.1245283 Hawkins, 2004, The problem of overfitting, J. Chem. Inf. Comput. Sci., 44, 1, 10.1021/ci0342472 Healy, 2018 Herodotou, 2011, Starfish: A self-tuning system for big data analytics, volume 11, 261 Hoaglin, 1983 Holte, 1993, Very simple classification rules perform well on most commonly used datasets, Mach. Learn., 11, 63, 10.1023/A:1022631118932 Huang, 2017, Power, false discovery rate and winner’s curse in eQTL studies, bioRxiv, 209171 Q. Huang, X. Xia, D. Lo, Supervised vs unsupervised models: a holistic look at effort-aware just-in-time defect prediction, IEEE, 2017. IEEE International Conference on Software Maintenance and Evolution (ICSME), 159–170. Ioannidis, 2005, Why most published research findings are false, PLoS Med., 2, e124, 10.1371/journal.pmed.0020124 Ivarsson, 2011, A method for evaluating rigor and industrial relevance of technology evaluations, Empir. Softw. Eng., 16, 365, 10.1007/s10664-010-9146-4 Johnson, 2012, Where’s the theory for software engineering?, IEEE Softw., 29, 10.1109/MS.2012.127 Jørgensen, 2016, Incorrect results in software engineering experiments: how to improve research practices, J. Syst. Softw., 116, 133, 10.1016/j.jss.2015.03.065 Kampenes, 2007, A systematic review of effect size in software engineering experiments, Inf. Softw. Technol., 49, 1073, 10.1016/j.infsof.2007.02.015 Kitchenham, 2015 Kitchenham, 2004, Evidence-based software engineering, 273 Kitchenham, 2017, Robust statistical methods for empirical software engineering, Empir. Softw. Eng., 22, 579, 10.1007/s10664-016-9437-5 Kitchenham, 2002, Preliminary guidelines for empirical research in software engineering, IEEE Trans. Softw. Eng., 28, 721, 10.1109/TSE.2002.1027796 Kitchenham, 2008, Length and readability of structured software engineering abstracts, IET Softw., 2, 37, 10.1049/iet-sen:20070044 Ko, 2015, A practical guide to controlled experiments of software engineering tools with human participants, Empir. Softw. Eng., 20, 110, 10.1007/s10664-013-9279-3 Krishna, 2017, What is the connection between issues, bugs, and enhancements?: lessons learned from 800+ software projects, 306 R. Krishna, T. Menzies, Simpler transfer learning (using “bellwethers”), http://arxiv.org/abs/1703.06218. Krishna, 2016, The ’bigSE’ project: lessons learned from validating industrial text mining, 65 Lakens, 2017, Equivalence tests: a practical primer for t tests, correlations, and meta-analyses, Soc. Psychol. Pers. Sci., 8, 355, 10.1177/1948550617697177 Lakens, 2014, Sailing from the seas of chaos into the corridor of stability: practical recommendations to increase the informational value of studies, Perspect. Psychol. Sci., 9, 278, 10.1177/1745691614528520 LeCun, 2015, Deep learning, Nature, 521, 436, 10.1038/nature14539 Lessmann, 2008, Benchmarking classification models for software defect prediction: a proposed framework and novel findings, IEEE Trans. Softw. Eng., 34, 485, 10.1109/TSE.2008.35 Liebchen, 2008, Data sets and data quality in software engineering G. Liebchen, M. Shepperd, Data sets and data quality in software engineering: eight years on, ACM, 2016. Proceedings of the 12th International Conference on Predictive Models and Data Analytics in Software Engineering. D. Lo, N. Nagappan, T. Zimmermann, How practitioners perceive the relevance of software engineering research, ACM, 2015. Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, 415–425. Loken, 2017, Measurement error and the replication crisis, Science, 355, 584, 10.1126/science.aal3618 Madeyski, 2017, Would wider adoption of reproducible research be beneficial for empirical software engineering research?, J. Intell. Fuzzy Syst., 32, 1509, 10.3233/JIFS-169146 Manly, 1997 Maxwell, 2008, Sample size planning for statistical power and accuracy in parameter estimation, Annu. Rev. Psychol., 59, 537, 10.1146/annurev.psych.59.103006.093735 McClelland, 2000, Increasing statistical power without increasing sample size, Am. Psychol., 55, 963, 10.1037/0003-066X.55.8.963 Menzies, 2007, Data mining static code attributes to learn defect predictors, IEEE Trans. Softw. Eng., 33, 2, 10.1109/TSE.2007.256941 Menzies, 2012 Mittas, 2013, Ranking and clustering software cost estimation models through a multiple comparisons algorithm, IEEE Trans. Softw. Eng., 39, 537, 10.1109/TSE.2012.45 Munafò, 2017, A manifesto for reproducible science, Nat. Hum. Behav., 1, 0021, 10.1038/s41562-016-0021 Myatt, 2009 Nair, 2018, Data-driven search-based software engineering Nickerson, 1998, Confirmation bias: a ubiquitous phenomenon in many guises, Rev. Gen. Psychol., 2, 175, 10.1037/1089-2680.2.2.175 Collaboration, 2015, Estimating the reproducibility of psychological science, Science, 349, aac4716, 10.1126/science.aac4716 Petersen, 2015, Guidelines for conducting systematic mapping studies in software engineering: an update, Inf. Softw. Technol., 64, 1, 10.1016/j.infsof.2015.03.007 Rosli, 2013, Can we trust our results? A mapping study on data quality, 116 Runeson, 2009, Guidelines for conducting and reporting case study research in software engineering, Empir. Softw. Eng., 2, 131, 10.1007/s10664-008-9102-8 Saltelli, 2000, Sensitivity analysis as an ingredient of modeling, Stat. Sci., 15, 377, 10.1214/ss/1009213004 A. Sarkar, J. Guo, N. Siegmund, S. Apel, K. Czarnecki, Cost-efficient sampling for performance prediction of configurable systems (t), IEEE, 2015. Automated Software Engineering (ASE), 2015 30th IEEE/ACM International Conference on, 342–352. A.S. Sayyad, H. Ammar, Pareto-optimal search-based software engineering (POSBSE): A literature survey, IEEE, 2013. 2nd International Workshop on Realizing Artificial Intelligence Synergies in Software Engineering (RAISE), 21–27. Schmidt, 2014 Shadish, 2002 M. Shaw, Writing good software engineering research papers, IEEE Computer Society, 2003. 25th IEEE International Conference on Software Engineering, 726–736 Shepperd, 2014, Researcher bias: the use of machine learning in software defect prediction, IEEE Trans. Softw. Eng., 40, 603, 10.1109/TSE.2014.2322358 Shepperd, 2013, Data quality: some comments on the NASA software defect datasets, IEEE Trans. Softw. Eng., 39, 1208, 10.1109/TSE.2013.11 R. Silberzahn, E. Uhlmann, D. Martin, P. Anselmi, F. Aust, E. Awtrey, v. Bahník, F. Bai, C. Bannard, E. Bonnier, et al., Many analysts, one dataset: making transparent how variations in analytical choices affect results, https://osf.io/j5v8f. Simmons, 2011, False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant, Psychol. Sci., 22, 1359, 10.1177/0956797611417632 Sjøberg, 2008, Building theories in software engineering, 312 Smaldino, 2016, The natural selection of bad science, R. Soc. Open Sci., 3, 160384, 10.1098/rsos.160384 Snoek, 2012, Practical Bayesian optimization of machine learning algorithms, 2951 Spence, 2016, Prediction interval: what to expect when you’re expecting...a replication, PLoS ONE, 11, e0162874, 10.1371/journal.pone.0162874 K.J. Stol, B. Fitzgerald, Uncovering theories in software engineering, IEEE, 2013. 2nd SEMAT Workshop on a General Theory of Software Engineering (GTSE), 5–14, K.J. Stol, P. Ralph, B. Fitzgerald, Grounded theory in software engineering research: a critical review and guidelines, IEEE, 2016. Software Engineering (ICSE), 2016 IEEE/ACM 38th International Conference on, 120–131, Storn, 1997, Differential evolution – a simple and efficient heuristic for global optimization over continuous spaces, J. Global Optim., 11, 341, 10.1023/A:1008202821328 Szucs, 2017, Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature, PLoS Biol., 15, e2000797, 10.1371/journal.pbio.2000797 C. Tantithamthavorn, S. McIntosh, A. Hassan, K. Matsumoto, Automated parameter optimization of classification techniques for defect prediction models, IEEE, 2016. IEEE/ACM 38th International Conference on Software Engineering (ICSE 2016), 321–332. Tantithamthavorn, 2018, The impact of automated parameter optimization on defect prediction models, IEEE Trans. Softw. Eng., 10.1109/TSE.2018.2794977 C. Theisen, M. Dunaiski, L. Williams, W. Visser, Writing good software engineering research papers: revisited, IEEE, 2017. IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C), 402–402. D.V. Aken, A. Pavlo, G. Gordon, B. Zhang, Automatic database management system tuning through large-scale machine learning, ACM, 2017. Proceedings of the 2017 ACM International Conference on Management of Data, 1009–1024. Whigham, 2015, A baseline model for software effort estimation, ACM Trans. Softw. Eng. Method. (TOSEM), 24, 20, 10.1145/2738037 Wilcox, 2012 Wohlin, 2012 Zhang, 2007, MOEA/D: a multiobjective evolutionary algorithm based on decomposition, IEEE Trans. Evol. Comput., 11, 712, 10.1109/TEVC.2007.892759 Zimmerman, 1998, Invalidation of parametric and nonparametric statistical tests by concurrent violation of two assumptions, J. Exp. Educ., 67, 55, 10.1080/00220979809598344 Zimmermann, 2004, Mining version histories to guide software changes, 563