Using black-box performance models to detect performance regressions under varying workloads: an empirical study
Tóm tắt
Performance regressions of large-scale software systems often lead to both financial and reputational losses. In order to detect performance regressions, performance tests are typically conducted in an in-house (non-production) environment using test suites with predefined workloads. Then, performance analysis is performed to check whether a software version has a performance regression against an earlier version. However, the real workloads in the field are constantly changing, making it unrealistic to resemble the field workloads in predefined test suites. More importantly, performance testing is usually very expensive as it requires extensive resources and lasts for an extended period. In this work, we leverage black-box machine learning models to automatically detect performance regressions in the field operations of large-scale software systems. Practitioners can leverage our approaches to complement or replace resource-demanding performance tests that may not even be realistic in a fast-paced environment. Our approaches use black-box models to capture the relationship between the performance of a software system (e.g., CPU usage) under varying workloads and the runtime activities that are recorded in the readily-available logs. Then, our approaches compare the black-box models derived from the current software version with an earlier version to detect performance regressions between these two versions. We performed empirical experiments on two open-source systems and applied our approaches on a large-scale industrial system. Our results show that such black-box models can effectively and timely detect real performance regressions and injected ones under varying workloads that are unseen when training these models. Our approaches have been adopted in practice to detect performance regressions of a large-scale industry system on a daily basis.
Từ khóa
Tài liệu tham khảo
Apache James (2019) Project-apache james server 3-release notes. http://james.apache.org/server/release-notes.html. Last accessed 10/09/2019
Gridsearchc (2019) https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html. Last accessed 10/11/2019
Pidstat (2019) Report statistics for tasks - linux man page. https://linux.die.net/man/1/pidstat. Last accessed 10/11/2019
Alcocer JPS, Bergel A (2015) Tracking down performance variation against source code evolution. In: Proceedings of the 11th symposium on dynamic languages, DLS 2015. Association for Computing Machinery, New York, pp 129–139
Barna C, Litoiu M, Ghanbari H (2011) Autonomic load-testing framework. In: Proceedings of the 8th international conference on autonomic computing, ICAC 2011, Karlsruhe, Germany, June 14-18, 2011, pp 91–100
Benesty J, Chen J, Huang Y, Cohen I (2009) Pearson correlation coefficient. In Noise reduction in speech processing, pp 1–4. Springer
Breiman L, Cutler A, Liaw A, Wiener M (2018) Breiman and cutler’s random forests for classification and regression. R Package Version 4.6–14
Cannon J (2019) Performance degradation affecting salesforce clients. https://marketingland.com/performance-degradation-affecting-salesforce-clients-267699. Last accessed 10/11/2019
Chen T, Shang W, Hassan AE, Nasser MN, Flora P (2016) Cacheoptimizer: helping developers configure caching frameworks for hibernate-based database-centric web applications. In: Proceedings of the 24th ACM SIGSOFT international symposium on foundations of software engineering, FSE 2016, Seattle, WA, USA, November 13-18, 2016, pp 666–677
Cliff N (1996) Ordinal methods for behavioral data analysis
Cohen I, Chase JS, Goldszmidt M, Kelly T, Symons J (2004) Correlating instrumentation data to system states: A building block for automated diagnosis and control. In: 6th symposium on operating system design and implementation (OSDI 2004), San Francisco, California, USA, December 6-8, 2004, pp 231–244
Cohen I, Zhang S, Goldszmidt M, Symons J, Kelly T, Fox A (2005) Capturing, indexing, clustering, and retrieving system history. In: Proceedings of the 20th ACM symposium on operating systems principles 2005, SOSP 2005, Brighton, UK, October 23-26, 2005, pp 105–118
Cortez E, Bonde A, Muzio A, Russinovich M, Fontoura M, Bianchini R (2017) Resource central: Understanding and predicting workloads for improved resource management in large cloud platforms. In: Proceedings of the 26th symposium on operating systems principles, Shanghai, China, October 28-31, 2017, pp 153–167
Dacrema MF, Cremonesi P, Jannach D (2019) Are we really making much progress? A worrying analysis of recent neural recommendation approaches. In: Proceedings of the 13th ACM conference on recommender systems, RecSys 2019, Copenhagen, Denmark, September 16-20, 2019., pp 101–109
de Oliveira AB, Fischmeister S, Diwan A, Hauswirth M, Sweeney PF (2013) Why you should care about quantile regression. In: Architectural support for programming languages and operating systems, ASPLOS ’13, Houston, TX, USA - March 16 - 20, 2013, pp 207–218
Didona D, Quaglia F, Romano P, Torre E (2015) Enhancing performance prediction robustness by combining analytical modeling and machine learning. In: Proceedings of the 6th ACM/SPEC international conference on performance engineering, Austin, TX, USA, Jan 31 - Feb 4, 2015, pp 145–156
Farshchi M, Schneider J, Weber I, Grundy JC (2015) Experience report: Anomaly detection of cloud application operations using log and cloud metric correlation analysis. In: 26th IEEE international symposium on software reliability engineering, ISSRE 2015, Gaithersbury, MD, USA, November 2-5, 2015, pp 24–34
Foo KC, Jiang ZM, Adams B, Hassan AE, Zou Y, Flora P (2010) Mining performance regression testing repositories for automated performance analysis. In: Proceedings of the 2010 10th international conference on quality software, QSIC ’10, pp 32–41
Foo KC, Jiang ZMJ, Adams B, Hassan AE, Zou Y, Flora P (2015) An industrial case study on the automated detection of performance regressions in heterogeneous environments. In: Proceedings of the 37th international conference on software engineering - vol 2, ICSE ’15, pp 159–168
Gao R, Jiang ZM, Barna C, Litoiu M (2016) A framework to evaluate the effectiveness of different load testing analysis techniques. In: 2016 IEEE International conference on software testing, verification and validation, ICST 2016, chicago, IL, USA, April 11-15, 2016, pp 22–32
Ghaith S, Wang M, Perry P, Jiang ZM, O’Sullivan P, Murphy J (2016) Anomaly detection in performance regression testing by transaction profile estimation. Softw Test Verif Reliab 26(1):4–39
Gong Z, Gu X, Wilkes J (2010) PRESS: Predictive elastic resource scaling for cloud systems. In: Proceedings of the 6th international conference on network and service management, CNSM 2010, Niagara Falls, Canada, October 25-29, 2010, pp 9–16
Greenberg A, Hamilton J, Maltz DA, Patel P (2008) The cost of a cloud: Research problems in data center networks. SIGCOMM Comput Commun Rev 39(1):68–73
Guo J, Czarnecki K, Apel S, Siegmund N, Wasowski A (2013) Variability-aware performance prediction: a statistical learning approach. In: 2013 28Th IEEE/ACM international conference on automated software engineering, ASE 2013, silicon valley, CA, USA, November 11-15, 2013, pp 301–311
Guo J, Yang D, Siegmund N, Apel S, Sarkar A, Valov P, Czarnecki K, Wasowski A, Yu H (2018) Data-efficient performance learning for configurable systems. Empir Softw Eng 23(3):1826–1867
He S, Lin Q, Lou J, Zhang H, Lyu MR, Zhang D (2018) Identifying impactful service system problems via log analysis. In: Proceedings of the 2018 ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, ESEC/SIGSOFT FSE 2018, Lake Buena Vista, FL, USA, November 04-09, 2018, pp 60-70
Ibidunmoye O, Hernández-rodriguez F, Elmroth E (2015) Performance anomaly detection and bottleneck identification. ACM Comput Surv 48 (1):4:1–4:35
Jiang ZM, Hassan AE (2015) A survey on load testing of large-scale software systems. IEEE Trans Software Eng 41(11):1091–1118
Jiang ZM, Hassan AE, Hamann G, Flora P (2009) Automated performance analysis of load tests. In: 25Th IEEE international conference on software maintenance (ICSM 2009), September 20-26, 2009, Edmonton, Alberta, Canada, pp 125–134
Krasic C, Sinha A, Kirsh L (2007) Priority-progress CPU adaptation for elastic real-time applications. In: Zimmermann R, Griwodz C (eds) Multimedia computing and networking 2007, vol 6504, International Society for Optics and Photonics, SPIE, pp 172–183
Krishnamurthy D, Rolia JA, Majumdar S (2006) A synthetic workload generation technique for stress testing session-based systems. IEEE Trans Software Eng 32(11):868–882
Lazowska ED, Zahorjan J, Graham GS, Sevcik KC (1984) Quantitative system performance - computer system analysis using queueing network models. Prentice Hall
Lim M, Lou J, Zhang H, Fu Q, Teoh ABJ, Lin Q, Ding R, Zhang D (2014) Identifying recurrent and unknown performance issues. In: 2014 IEEE International conference on data mining, ICDM 2014, Shenzhen, China, December 14-17, 2014, pp 320–329
Malik H, Jiang ZM, Adams B, Hassan AE, Flora P, Hamann G (2010) Automatic comparison of load tests to support the performance analysis of large enterprise systems. In: 14Th european conference on software maintenance and reengineering, CSMR 2010, 15-18 March 2010, Madrid, Spain, pp 222–231
Malik H, Hemmati H, Hassan AE (2013) Automatic detection of performance deviations in the load testing of large scale systems. In: 35Th international conference on software engineering, ICSE ’13, san francisco, CA, USA, May 18-26, 2013, pp 1012–1021
Nachar N et al (2008) The mann-whitney u: A test for assessing whether two independent samples come from the same distribution. Tutorials in Quantitative Methods for Psychology 4(1):13–20
Nguyen THD, Adams B, Jiang ZM, Hassan AE, Nasser MN, Flora P (2011) Automated verification of load tests using control charts. In: 18Th asia pacific software engineering conference, APSEC 2011, ho chi minh, Vietnam, December 5-8, 2011, pp 282–289
Nguyen THD, Adams B, Jiang ZM, Hassan AE, Nasser MN, Flora P (2012) Automated detection of performance regressions using statistical process control techniques. In: Third joint WOSP/SIPEW international conference on performance engineering, ICPE’12, boston, MA, USA - April 22 - 25, 2012, pp 299–310
Romano J, Kromrey JD, Coraggio J, Skowronek J (2006) Appropriate statistics for ordinal level data: Should we really be using t-test and cohen’sd for evaluating group differences on the nsse and other surveys. In: annual meeting of the Florida association of institutional research, pp 1–33
Sato D (2014) Canary release. MartinFowler. com
Shang W, Hassan AE, Nasser MN, Flora P (2015) Automated detection of performance regressions using regression models on clustered performance counters. In: Proceedings of the 6th ACM/SPEC international conference on performance engineering, Austin, TX, USA, January 31 - February 4, 2015, pp 15–26
Sullivan GM, Feinn R (2012) Using effect size—or why the p value is not enough. Journal of Graduate Medical Education 4(3):279–282
Syer MD, Jiang ZM, Nagappan M, Hassan AE, Nasser MN, Flora P (2013) Leveraging performance counters and execution logs to diagnose memory-related performance issues. In: 2013 IEEE International conference on software maintenance, eindhoven, The Netherlands, September 22-28, 2013, pp 110–119
Syer MD, Jiang ZM, Nagappan M, Hassan AE, Nasser MN, Flora P (2014) Continuous validation of load test suites. In: ACM/SPEC International conference on performance engineering, ICPE’14, dublin, ireland, March 22-26, 2014, pp 259–270
Syer MD, Shang W, Jiang ZM, Hassan AE (2017) Continuous validation of performance test workloads. Autom Softw Eng 24(1):189–231
Syncsort (2018) White paper: Assessing the financial impact of downtime
Tan J, Kavulya S, Gandhi R, Narasimhan P (2010) Visual, log-based causal tracing for performance debugging of mapreduce systems. In: 2010 International conference on distributed computing systems, ICDCS 2010, genova, italy, june 21-25, 2010, pp 795–806
Valov P, Petkovich J, Guo J, Fischmeister S, Czarnecki K (2017) Transferring performance prediction models across different hardware platforms. In: Proceedings of the 8th ACM/SPEC on international conference on performance engineering, ICPE 2017, L’Aquila, Italy, April 22-26, 2017, pp 39–50
Weyuker EJ, Vokolos FI (2000) Experience with performance testing of software systems: Issues, an approach, and case study. IEEE Trans Software Eng 26(12):1147–1156
Xiong P, Pu C, Zhu X, Griffith R (2013) vperfguard: an automated model-driven framework for application performance diagnosis in consolidated cloud environments. In: ACM/SPEC international conference on performance engineering, ICPE’13, Prague, Czech Republic, pp 271–282
Xu W, Huang L, Fox A, Patterson DA, Jordan MI (2009) Detecting large-scale system problems by mining console logs. In: Proceedings of the 22nd ACM Symposium on Operating Systems Principles 2009, SOSP 2009, Big Sky, Montana, USA, October 11-14, 2009, pp 117–132
Xu Y, Chen N, Fernandez A, Sinno O, Bhasin A (2015) From infrastructure to culture: A/b testing challenges in large scale social networks. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’15. Association for Computing Machinery, New York, pp 2227–2236
Yadwadkar NJ, Bhattacharyya C, Gopinath K, Niranjan T, Susarla S (2010) Discovery of application workloads from network file traces. In: 8Th USENIX conference on file and storage technologies, san jose, CA, USA, February 23-26, 2010, pp 183–196
Yao KB, de Pádua G, Shang W, Sporea S, Toma A, Sajedi S (2018) Log4perf: Suggesting logging locations for web-based systems’ performance monitoring. In: Proceedings of the 2018 ACM/SPEC international conference on performance engineering, ICPE ’18, pp 127–138
Zhou M, Chen J, Hu H, Yu J, Li Z, Hu H (2019) Deeptle: Learning code-level features to predict code performance before it runs. In: 2019 26th Asia-Pacific software engineering conference (APSEC), pp 252–259