How are distributed bugs diagnosed and fixed through system logs?
Tài liệu tham khảo
Wong, 2009, A Survey of Software Fault Localization
Lloyd’s, 2018, (Cloud Down - The impacts on the US economy). https://www.lloyds.com/clouddown.
Bailis, 2017, Research for practice: tracing and debugging distributed systems; programming by examples, Commun. ACM, 60, 46, 10.1145/3052942
Zhang, 2017, Pensieve: non-intrusive failure reproduction for distributed systems using the event chaining approach, 19
Leesatapornwongsa, 2016, TaxDC: a taxonomy of non-deterministic concurrency bugs in datacenter distributed systems, 517
Beschastnikh, 2016, Debugging distributed systems: challenges and options for validation and debugging, Commun. ACM, 59, 32, 10.1145/2909480
Liu, 2008, D3S: debugging deployed distributed systems, vol. 8, 423
Zhao, 2014, lprof: a non-intrusive request flow profiler for distributed systems, vol. 14, 629
Zhao, 2016, Non-intrusive performance profiling for entire software stacks based on the flow reconstruction principle, 603
Fu, 2009, Execution anomaly detection in distributed systems through unstructured log analysis, 149
Yuan, 2010, SherLog: error diagnosis by connecting clues from run-time logs, vol. 38, 143
Nagaraj, 2012, Structured comparative analysis of systems logs to diagnose performance problems
Gunawi, 2014, What bugs live in the cloud? A study of 3000+ issues in cloud systems, 1
Yuan, 2014, Simple testing can prevent most critical failures: an analysis of production failures in distributed data-intensive systems, 249
Lu, 2008, Learning from mistakes: a comprehensive study on real world concurrency bug characteristics, 329
Tan, 2014, Bug characteristics in open source software, Empir. Softw. Eng., 19, 1665, 10.1007/s10664-013-9258-8
Dai, 2018, Understanding real-world timeout problems in cloud server systems, 1
Laprie, 1995, Dependable computing: concepts, limits, challenges, 42
Suminto, 2015, Towards pre-deployment detection of performance failures in cloud distributed systems
Zhang, 2019, Understanding and statically detecting synchronization performance bugs in distributed cloud systems, IEEE Access
Gao, 2018, An empirical study on crash recovery bugs in large-scale distributed systems, 539
Alquraan, 2018, An analysis of network-partitioning failures in cloud systems
Chmiel, 2004, Debugging: from novice to expert, ACM SIGCSE Bull., 36, 17, 10.1145/1028174.971310
Dean, 2009, Designs, lessons and advice from building large distributed systems, vol. 1
Mesbahi, 2017, Cloud dependability analysis: characterizing google cluster infrastructure reliability, 56
Sinha, 2009, Fault localization and repair for java runtime exceptions, 153
Wong, 2014, Boosting bug-report-oriented fault localization with segmentation and stack-trace analysis, 181
Wu, 2014, CrashLocator: locating crashing faults based on crash stacks, 204
Moreno, 2014, On the use of stack traces to improve text retrieval-based bug localization, 151
Wang, 2018, Understanding and auto-adjusting performance-sensitive configurations
Xu, 2016, Early detection of configuration errors to reduce failure damage, 619
He, 2016, Experience report: System log analysis for anomaly detection, 207
Chen, 2004, Failure diagnosis using decision trees, 36
Liang, 2007, Failure prediction in IBM BlueGene/L event logs, 583
Bodik, 2010, Fingerprinting the datacenter: automated classification of performance crises, 111
Xu, 2009, Detecting large-scale system problems by mining console logs, 117
Lou, 2010, Mining invariants from console logs for system problem detection
Lin, 2016, Log clustering based problem identification for online service systems, 102
Ding, 2015, Log2: a cost-aware logging mechanism for performance diagnosis, 139
Chhajed, 2015
Stearley, 2010, Bridging the gaps: joining information sources with Splunk.
Shang, 2012, Bridging the divide between software developers and operators using logs, 1583