How are distributed bugs diagnosed and fixed through system logs?

Information and Software Technology - Tập 119 - Trang 106234 - 2020
Wei Yuan1,2,3, Shan Lu2, Hailong Sun1,3, Xudong Liu1,3
1SKLSDE Lab, School of Computer Science and Engineering, Beihang University, Beijing, China 100191
2University of Chicago, Chicago, USA
3Beijing Advanced Innovation Center for Big Data and Brain Computing, Beijing, China 100191

Tài liệu tham khảo

Wong, 2009, A Survey of Software Fault Localization Lloyd’s, 2018, (Cloud Down - The impacts on the US economy). https://www.lloyds.com/clouddown. Bailis, 2017, Research for practice: tracing and debugging distributed systems; programming by examples, Commun. ACM, 60, 46, 10.1145/3052942 Zhang, 2017, Pensieve: non-intrusive failure reproduction for distributed systems using the event chaining approach, 19 Leesatapornwongsa, 2016, TaxDC: a taxonomy of non-deterministic concurrency bugs in datacenter distributed systems, 517 Beschastnikh, 2016, Debugging distributed systems: challenges and options for validation and debugging, Commun. ACM, 59, 32, 10.1145/2909480 Liu, 2008, D3S: debugging deployed distributed systems, vol. 8, 423 Zhao, 2014, lprof: a non-intrusive request flow profiler for distributed systems, vol. 14, 629 Zhao, 2016, Non-intrusive performance profiling for entire software stacks based on the flow reconstruction principle, 603 Fu, 2009, Execution anomaly detection in distributed systems through unstructured log analysis, 149 Yuan, 2010, SherLog: error diagnosis by connecting clues from run-time logs, vol. 38, 143 Nagaraj, 2012, Structured comparative analysis of systems logs to diagnose performance problems Gunawi, 2014, What bugs live in the cloud? A study of 3000+ issues in cloud systems, 1 Yuan, 2014, Simple testing can prevent most critical failures: an analysis of production failures in distributed data-intensive systems, 249 Lu, 2008, Learning from mistakes: a comprehensive study on real world concurrency bug characteristics, 329 Tan, 2014, Bug characteristics in open source software, Empir. Softw. Eng., 19, 1665, 10.1007/s10664-013-9258-8 Dai, 2018, Understanding real-world timeout problems in cloud server systems, 1 Laprie, 1995, Dependable computing: concepts, limits, challenges, 42 Suminto, 2015, Towards pre-deployment detection of performance failures in cloud distributed systems Zhang, 2019, Understanding and statically detecting synchronization performance bugs in distributed cloud systems, IEEE Access Gao, 2018, An empirical study on crash recovery bugs in large-scale distributed systems, 539 Alquraan, 2018, An analysis of network-partitioning failures in cloud systems Chmiel, 2004, Debugging: from novice to expert, ACM SIGCSE Bull., 36, 17, 10.1145/1028174.971310 Dean, 2009, Designs, lessons and advice from building large distributed systems, vol. 1 Mesbahi, 2017, Cloud dependability analysis: characterizing google cluster infrastructure reliability, 56 Sinha, 2009, Fault localization and repair for java runtime exceptions, 153 Wong, 2014, Boosting bug-report-oriented fault localization with segmentation and stack-trace analysis, 181 Wu, 2014, CrashLocator: locating crashing faults based on crash stacks, 204 Moreno, 2014, On the use of stack traces to improve text retrieval-based bug localization, 151 Wang, 2018, Understanding and auto-adjusting performance-sensitive configurations Xu, 2016, Early detection of configuration errors to reduce failure damage, 619 He, 2016, Experience report: System log analysis for anomaly detection, 207 Chen, 2004, Failure diagnosis using decision trees, 36 Liang, 2007, Failure prediction in IBM BlueGene/L event logs, 583 Bodik, 2010, Fingerprinting the datacenter: automated classification of performance crises, 111 Xu, 2009, Detecting large-scale system problems by mining console logs, 117 Lou, 2010, Mining invariants from console logs for system problem detection Lin, 2016, Log clustering based problem identification for online service systems, 102 Ding, 2015, Log2: a cost-aware logging mechanism for performance diagnosis, 139 Chhajed, 2015 Stearley, 2010, Bridging the gaps: joining information sources with Splunk. Shang, 2012, Bridging the divide between software developers and operators using logs, 1583