Using black-box performance models to detect performance regressions under varying workloads: an empirical study

Lizhi Liao1, Jinfu Chen1, Heng Li2, Yi Zeng1, Weiyi Shang1, Jianmei Guo3, Catalin Sporea4, Andrei Toma4, Sarah Sajedi4
1Department of Computer Science and Software Engineering, Concordia University, Montreal, Canada
2Département de génie informatique et génie logiciel, Polytechnique Montréal, Montreal, Canada
3Alibaba Group, Hangzhou, China
4ERA Environmental Management Solutions, Montreal, Canada

Tóm tắt

Performance regressions of large-scale software systems often lead to both financial and reputational losses. In order to detect performance regressions, performance tests are typically conducted in an in-house (non-production) environment using test suites with predefined workloads. Then, performance analysis is performed to check whether a software version has a performance regression against an earlier version. However, the real workloads in the field are constantly changing, making it unrealistic to resemble the field workloads in predefined test suites. More importantly, performance testing is usually very expensive as it requires extensive resources and lasts for an extended period. In this work, we leverage black-box machine learning models to automatically detect performance regressions in the field operations of large-scale software systems. Practitioners can leverage our approaches to complement or replace resource-demanding performance tests that may not even be realistic in a fast-paced environment. Our approaches use black-box models to capture the relationship between the performance of a software system (e.g., CPU usage) under varying workloads and the runtime activities that are recorded in the readily-available logs. Then, our approaches compare the black-box models derived from the current software version with an earlier version to detect performance regressions between these two versions. We performed empirical experiments on two open-source systems and applied our approaches on a large-scale industrial system. Our results show that such black-box models can effectively and timely detect real performance regressions and injected ones under varying workloads that are unseen when training these models. Our approaches have been adopted in practice to detect performance regressions of a large-scale industry system on a daily basis.

Từ khóa


Tài liệu tham khảo

Apache James (2019) Project-apache james server 3-release notes. http://james.apache.org/server/release-notes.html. Last accessed 10/09/2019 Gridsearchc (2019) https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html. Last accessed 10/11/2019 Pidstat (2019) Report statistics for tasks - linux man page. https://linux.die.net/man/1/pidstat. Last accessed 10/11/2019 Alcocer JPS, Bergel A (2015) Tracking down performance variation against source code evolution. In: Proceedings of the 11th symposium on dynamic languages, DLS 2015. Association for Computing Machinery, New York, pp 129–139 Barna C, Litoiu M, Ghanbari H (2011) Autonomic load-testing framework. In: Proceedings of the 8th international conference on autonomic computing, ICAC 2011, Karlsruhe, Germany, June 14-18, 2011, pp 91–100 Benesty J, Chen J, Huang Y, Cohen I (2009) Pearson correlation coefficient. In Noise reduction in speech processing, pp 1–4. Springer Breiman L, Cutler A, Liaw A, Wiener M (2018) Breiman and cutler’s random forests for classification and regression. R Package Version 4.6–14 Cannon J (2019) Performance degradation affecting salesforce clients. https://marketingland.com/performance-degradation-affecting-salesforce-clients-267699. Last accessed 10/11/2019 Chen T, Shang W, Hassan AE, Nasser MN, Flora P (2016) Cacheoptimizer: helping developers configure caching frameworks for hibernate-based database-centric web applications. In: Proceedings of the 24th ACM SIGSOFT international symposium on foundations of software engineering, FSE 2016, Seattle, WA, USA, November 13-18, 2016, pp 666–677 Cliff N (1996) Ordinal methods for behavioral data analysis Cohen I, Chase JS, Goldszmidt M, Kelly T, Symons J (2004) Correlating instrumentation data to system states: A building block for automated diagnosis and control. In: 6th symposium on operating system design and implementation (OSDI 2004), San Francisco, California, USA, December 6-8, 2004, pp 231–244 Cohen I, Zhang S, Goldszmidt M, Symons J, Kelly T, Fox A (2005) Capturing, indexing, clustering, and retrieving system history. In: Proceedings of the 20th ACM symposium on operating systems principles 2005, SOSP 2005, Brighton, UK, October 23-26, 2005, pp 105–118 Cortez E, Bonde A, Muzio A, Russinovich M, Fontoura M, Bianchini R (2017) Resource central: Understanding and predicting workloads for improved resource management in large cloud platforms. In: Proceedings of the 26th symposium on operating systems principles, Shanghai, China, October 28-31, 2017, pp 153–167 Dacrema MF, Cremonesi P, Jannach D (2019) Are we really making much progress? A worrying analysis of recent neural recommendation approaches. In: Proceedings of the 13th ACM conference on recommender systems, RecSys 2019, Copenhagen, Denmark, September 16-20, 2019., pp 101–109 de Oliveira AB, Fischmeister S, Diwan A, Hauswirth M, Sweeney PF (2013) Why you should care about quantile regression. In: Architectural support for programming languages and operating systems, ASPLOS ’13, Houston, TX, USA - March 16 - 20, 2013, pp 207–218 Didona D, Quaglia F, Romano P, Torre E (2015) Enhancing performance prediction robustness by combining analytical modeling and machine learning. In: Proceedings of the 6th ACM/SPEC international conference on performance engineering, Austin, TX, USA, Jan 31 - Feb 4, 2015, pp 145–156 Farshchi M, Schneider J, Weber I, Grundy JC (2015) Experience report: Anomaly detection of cloud application operations using log and cloud metric correlation analysis. In: 26th IEEE international symposium on software reliability engineering, ISSRE 2015, Gaithersbury, MD, USA, November 2-5, 2015, pp 24–34 Foo KC, Jiang ZM, Adams B, Hassan AE, Zou Y, Flora P (2010) Mining performance regression testing repositories for automated performance analysis. In: Proceedings of the 2010 10th international conference on quality software, QSIC ’10, pp 32–41 Foo KC, Jiang ZMJ, Adams B, Hassan AE, Zou Y, Flora P (2015) An industrial case study on the automated detection of performance regressions in heterogeneous environments. In: Proceedings of the 37th international conference on software engineering - vol 2, ICSE ’15, pp 159–168 Gao R, Jiang ZM, Barna C, Litoiu M (2016) A framework to evaluate the effectiveness of different load testing analysis techniques. In: 2016 IEEE International conference on software testing, verification and validation, ICST 2016, chicago, IL, USA, April 11-15, 2016, pp 22–32 Ghaith S, Wang M, Perry P, Jiang ZM, O’Sullivan P, Murphy J (2016) Anomaly detection in performance regression testing by transaction profile estimation. Softw Test Verif Reliab 26(1):4–39 Gong Z, Gu X, Wilkes J (2010) PRESS: Predictive elastic resource scaling for cloud systems. In: Proceedings of the 6th international conference on network and service management, CNSM 2010, Niagara Falls, Canada, October 25-29, 2010, pp 9–16 Greenberg A, Hamilton J, Maltz DA, Patel P (2008) The cost of a cloud: Research problems in data center networks. SIGCOMM Comput Commun Rev 39(1):68–73 Guo J, Czarnecki K, Apel S, Siegmund N, Wasowski A (2013) Variability-aware performance prediction: a statistical learning approach. In: 2013 28Th IEEE/ACM international conference on automated software engineering, ASE 2013, silicon valley, CA, USA, November 11-15, 2013, pp 301–311 Guo J, Yang D, Siegmund N, Apel S, Sarkar A, Valov P, Czarnecki K, Wasowski A, Yu H (2018) Data-efficient performance learning for configurable systems. Empir Softw Eng 23(3):1826–1867 He S, Lin Q, Lou J, Zhang H, Lyu MR, Zhang D (2018) Identifying impactful service system problems via log analysis. In: Proceedings of the 2018 ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, ESEC/SIGSOFT FSE 2018, Lake Buena Vista, FL, USA, November 04-09, 2018, pp 60-70 Ibidunmoye O, Hernández-rodriguez F, Elmroth E (2015) Performance anomaly detection and bottleneck identification. ACM Comput Surv 48 (1):4:1–4:35 Jiang ZM, Hassan AE (2015) A survey on load testing of large-scale software systems. IEEE Trans Software Eng 41(11):1091–1118 Jiang ZM, Hassan AE, Hamann G, Flora P (2009) Automated performance analysis of load tests. In: 25Th IEEE international conference on software maintenance (ICSM 2009), September 20-26, 2009, Edmonton, Alberta, Canada, pp 125–134 Krasic C, Sinha A, Kirsh L (2007) Priority-progress CPU adaptation for elastic real-time applications. In: Zimmermann R, Griwodz C (eds) Multimedia computing and networking 2007, vol 6504, International Society for Optics and Photonics, SPIE, pp 172–183 Krishnamurthy D, Rolia JA, Majumdar S (2006) A synthetic workload generation technique for stress testing session-based systems. IEEE Trans Software Eng 32(11):868–882 Lazowska ED, Zahorjan J, Graham GS, Sevcik KC (1984) Quantitative system performance - computer system analysis using queueing network models. Prentice Hall Lim M, Lou J, Zhang H, Fu Q, Teoh ABJ, Lin Q, Ding R, Zhang D (2014) Identifying recurrent and unknown performance issues. In: 2014 IEEE International conference on data mining, ICDM 2014, Shenzhen, China, December 14-17, 2014, pp 320–329 Malik H, Jiang ZM, Adams B, Hassan AE, Flora P, Hamann G (2010) Automatic comparison of load tests to support the performance analysis of large enterprise systems. In: 14Th european conference on software maintenance and reengineering, CSMR 2010, 15-18 March 2010, Madrid, Spain, pp 222–231 Malik H, Hemmati H, Hassan AE (2013) Automatic detection of performance deviations in the load testing of large scale systems. In: 35Th international conference on software engineering, ICSE ’13, san francisco, CA, USA, May 18-26, 2013, pp 1012–1021 Nachar N et al (2008) The mann-whitney u: A test for assessing whether two independent samples come from the same distribution. Tutorials in Quantitative Methods for Psychology 4(1):13–20 Nguyen THD, Adams B, Jiang ZM, Hassan AE, Nasser MN, Flora P (2011) Automated verification of load tests using control charts. In: 18Th asia pacific software engineering conference, APSEC 2011, ho chi minh, Vietnam, December 5-8, 2011, pp 282–289 Nguyen THD, Adams B, Jiang ZM, Hassan AE, Nasser MN, Flora P (2012) Automated detection of performance regressions using statistical process control techniques. In: Third joint WOSP/SIPEW international conference on performance engineering, ICPE’12, boston, MA, USA - April 22 - 25, 2012, pp 299–310 Romano J, Kromrey JD, Coraggio J, Skowronek J (2006) Appropriate statistics for ordinal level data: Should we really be using t-test and cohen’sd for evaluating group differences on the nsse and other surveys. In: annual meeting of the Florida association of institutional research, pp 1–33 Sato D (2014) Canary release. MartinFowler. com Shang W, Hassan AE, Nasser MN, Flora P (2015) Automated detection of performance regressions using regression models on clustered performance counters. In: Proceedings of the 6th ACM/SPEC international conference on performance engineering, Austin, TX, USA, January 31 - February 4, 2015, pp 15–26 Sullivan GM, Feinn R (2012) Using effect size—or why the p value is not enough. Journal of Graduate Medical Education 4(3):279–282 Syer MD, Jiang ZM, Nagappan M, Hassan AE, Nasser MN, Flora P (2013) Leveraging performance counters and execution logs to diagnose memory-related performance issues. In: 2013 IEEE International conference on software maintenance, eindhoven, The Netherlands, September 22-28, 2013, pp 110–119 Syer MD, Jiang ZM, Nagappan M, Hassan AE, Nasser MN, Flora P (2014) Continuous validation of load test suites. In: ACM/SPEC International conference on performance engineering, ICPE’14, dublin, ireland, March 22-26, 2014, pp 259–270 Syer MD, Shang W, Jiang ZM, Hassan AE (2017) Continuous validation of performance test workloads. Autom Softw Eng 24(1):189–231 Syncsort (2018) White paper: Assessing the financial impact of downtime Tan J, Kavulya S, Gandhi R, Narasimhan P (2010) Visual, log-based causal tracing for performance debugging of mapreduce systems. In: 2010 International conference on distributed computing systems, ICDCS 2010, genova, italy, june 21-25, 2010, pp 795–806 Valov P, Petkovich J, Guo J, Fischmeister S, Czarnecki K (2017) Transferring performance prediction models across different hardware platforms. In: Proceedings of the 8th ACM/SPEC on international conference on performance engineering, ICPE 2017, L’Aquila, Italy, April 22-26, 2017, pp 39–50 Weyuker EJ, Vokolos FI (2000) Experience with performance testing of software systems: Issues, an approach, and case study. IEEE Trans Software Eng 26(12):1147–1156 Xiong P, Pu C, Zhu X, Griffith R (2013) vperfguard: an automated model-driven framework for application performance diagnosis in consolidated cloud environments. In: ACM/SPEC international conference on performance engineering, ICPE’13, Prague, Czech Republic, pp 271–282 Xu W, Huang L, Fox A, Patterson DA, Jordan MI (2009) Detecting large-scale system problems by mining console logs. In: Proceedings of the 22nd ACM Symposium on Operating Systems Principles 2009, SOSP 2009, Big Sky, Montana, USA, October 11-14, 2009, pp 117–132 Xu Y, Chen N, Fernandez A, Sinno O, Bhasin A (2015) From infrastructure to culture: A/b testing challenges in large scale social networks. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’15. Association for Computing Machinery, New York, pp 2227–2236 Yadwadkar NJ, Bhattacharyya C, Gopinath K, Niranjan T, Susarla S (2010) Discovery of application workloads from network file traces. In: 8Th USENIX conference on file and storage technologies, san jose, CA, USA, February 23-26, 2010, pp 183–196 Yao KB, de Pádua G, Shang W, Sporea S, Toma A, Sajedi S (2018) Log4perf: Suggesting logging locations for web-based systems’ performance monitoring. In: Proceedings of the 2018 ACM/SPEC international conference on performance engineering, ICPE ’18, pp 127–138 Zhou M, Chen J, Hu H, Yu J, Li Z, Hu H (2019) Deeptle: Learning code-level features to predict code performance before it runs. In: 2019 26th Asia-Pacific software engineering conference (APSEC), pp 252–259