LADRA: Log-based abnormal task detection and root-cause analysis in big data processing with Spark

Future Generation Computer Systems - Tập 95 - Trang 392-403 - 2019
Siyang Lu1, Xiang Wei1,2, Bingbing Rao1, Byungchul Tak3, Long Wang4, Liqiang Wang1
1Department of Computer Science, University of Central Florida, Orlando, FL, USA
2School of Software Engineering, Beijing Jiaotong University, China
3Department of Computer Science and Engineering, Kyungpook National University, Republic of Korea
4IBM T.J. Watson Research Center, Yorktown Heights, NY, USA

Tài liệu tham khảo

Dean, 2008, Mapreduce: simplified data processing on large clusters, Commun. ACM, 51, 107, 10.1145/1327452.1327492 Apache Spark website, http://Spark.apache.org/. Apache Hadoop website, http://hadoop.apache.org/. Zaharia, 2012, Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing Zhang, 2017, Mrapid: An efficient short job optimizer on hadoop, 459 Wang, 2009, Atomicity and provenance support for pipelined scientific workflows, Future Gener. Comput. Syst., 25, 568, 10.1016/j.future.2008.06.007 Subramanian, 2010, Rapid processing of synthetic seismograms using windows azure cloud Subramanian, 2011, Rapid 3d seismic source inversion using windows azure and amazon ec2 M. Zaharia, A. Konwinski, A.D. Joseph, R.H. Katz, I. Stoica, Improving mapreduce performance in heterogeneous environments, in: Osdi, vol. 8, 2008, p. 7. Lu, 2017, Log-based abnormal task detection and root cause analysis for spark, 389 G. Ananthanarayanan, S. Kandula, A.G. Greenberg, I. Stoica, Y. Lu, B. Saha, E. Harris, Reining in the outliers in map-reduce clusters using mantri, in: OSDI, vol. 10, 2010, p. 24. Ibidunmoye, 2015, Performance anomaly detection and bottleneck identification, ACM Comput. Surv., 48, 4, 10.1145/2791120 P. Garraghan, X. Ouyang, R. Yang, D. McKee, J. Xu, Straggler root-cause and impact analysis for massive-scale virtualized cloud datacenters, IEEE Transactions on Services Computing. Jayathilaka, 2017, Performance monitoring and root cause analysis for cloud-hosted web applications, 469 Chen, 2002, Pinpoint: Problem determination in large, dynamic internet services, 595 Gu, 2009, Online anomaly prediction for robust cluster systems, 1000 Oliner, 2007, What supercomputers say: A study of five system logs Ryza, 2015 Tan, 2008, Salsa: Analyzing logs as state machines, WASL, 8 Tan, 2010, Visual, log-based causal tracing for performance debugging of mapreduce systems, 795 Chen, 2010, Samr: A self-adaptive mapreduce scheduling algorithm in heterogeneous environment, 2736 Xu, 2009, Detecting large-scale system problems by mining console logs Qi, 2017, Data mining based root-cause analysis of performance bottleneck for big data workload, 254 Fulp, 2008, Predicting computer system failures using support vector machines, WASL, 8 Yadwadkar, 2014, Wrangler: Predictable and faster jobs using fewer resources, 1 Massie, 2004, The ganglia distributed monitoring system: design, implementation, and experience, Parallel Comput., 30, 817, 10.1016/j.parco.2004.04.001 Aguilera, 2003, Performance debugging for distributed systems of black boxes, Oper. Syst. Rev., 37, 74, 10.1145/1165389.945454 H. Zhou, Y. Li, H. Yang, J. Jia, W. Li, Bigroots: An effective approach for root-cause analysis of stragglers in big data system, arXiv preprint arXiv:1801.03314. Shi, 2015, Clash of the titans: Mapreduce vs. spark for large scale data analytics, Proc. VLDB Endow., 8, 2110, 10.14778/2831360.2831365 Huang, 2010, The hibench benchmark suite: Characterization of the mapreduce-based data analysis, 41