Performance models of data parallel DAG workflows for large scale data analytics

Springer Science and Business Media LLC - Tập 41 - Trang 299-329 - 2023
Juwei Shi1, Jiaheng Lu2
1Microsoft STCA, Beijing, China
2University of Helsinki, Helsinki, Finland

Tóm tắt

Directed Acyclic Graph (DAG) workflows are widely used for large-scale data analytics in cluster-based distributed computing systems. The performance model for a DAG on data-parallel frameworks (e.g., MapReduce) is a research challenge because the allocation of preemptable system resources among parallel jobs may dynamically vary during execution. This resource allocation variation during execution makes it difficult to accurately estimate the execution time. In this paper, we tackle this challenge by proposing a new cost model, called Bottleneck Oriented Estimation (BOE), to estimate the allocation of preemptable resources by identifying the bottleneck to accurately predict task execution time. For a DAG workflow, we propose a state-based approach to iteratively use the resource allocation property among stages to estimate the overall execution plan. Furthermore, to handle the skewness of various jobs, we refine the model with the order statistics theory to improve estimation accuracy. Extensive experiments were performed to validate these cost models with HiBench and TPC-H workloads. The BOE model outperforms the state-of-the-art models by a factor of five for task execution time estimation. For the refined skew-aware model, the average prediction error is under $$3\%$$ when estimating the execution time of 51 hybrid analytics (HiBench) and query (TPC-H) DAG workflows.

Tài liệu tham khảo

Project hydrogen: unifying state-of-the-art ai and big data in apache spark. https://databricks.com/session/databricks-keynote-2 Apache tez. https://tez.apache.org/ Assefi, M., Behravesh, E., Liu, G., Tafti, A.P.: Big data machine learning using apache spark mllib. In: Proceedings of the IEEE International Conference on Big Data (Big Data), pp. 3492–3498. IEEE (2017) Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. ACM SIGOPS Oper. Syst. Rev. 41, 59–72 (2007) Lim, H., Herodotou, H., Babu, S.: Stubby: a transformation-based optimizer for mapreduce workflows. VLDB 5(11), 1196–1207 (2012) Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D., Amde, M., Owen, S., et al.: Mllib: machine learning in apache spark. J. Mach. Learn. Res. 17(1), 1235–1241 (2016) Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI (2012) Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. CACM 51(1), 107–113 (2008) Herodotou, H., Babu, S.: Profiling, what-if analysis, and cost-based optimization of mapreduce programs. VLDB 4(11), 1111–1122 (2011) Li, T., Tang, J., Xu, J.: Performance modeling and predictive scheduling for distributed stream data processing. IEEE Trans Big Data 2(4), 353–364 (2016) Taufer, M., Rosenberg, A.L.: Scheduling dag-based workflows on single cloud instances: high-performance and cost effectiveness with a static scheduler. Int. J. High Perform. Comput. Appl. 31(1), 19–31 (2017) Shi, J., Zou, J., Lu, J., Cao, Z., Li, S., Wang, C.: MRTuner: a toolkit to enable holistic optimization for mapreduce jobs. VLDB 7(13), 1319–1330 (2014) Herodotou, H., Dong, F., Babu, S.: No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics. In: SoCC, p. 18 (2011) Morton, K., Balazinska,, M. and Grossman, D.: ParaTimer: a Progress Indicator for MapReduce DAGs. In: SIGMOD, pp. 507–518 (2010) Clifton, B.: Advanced web metrics with google analytics (2012) Kwon, Y., Balazinska, M., Howe, B., Rolia, J.: Skewtune: mitigating skew in mapreduce applications. In: SIGMOD, pp. 25–36 (2012) Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O’Malley, O., Radia, S., Reed, B., Baldeschwieler, E.: Apache Hadoop YARN: yet another resource negotiator. In: SoCC, pp. 5:1–5:16 (2013) Ghodsi, A., Zaharia, M., Hindman, B., Konwinski, A., Shenker, S., Stoica, I.: Dominant resource fairness: fair allocation of multiple resource types. In: NDSI, pp. 323–336 (2011) Thusoo, A., Sarma,, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H., Murthy, R.: Hive-a petabyte scale data warehouse using hadoop. In: ICDE, pp. 996–1005 (2010) Boehm, M., Tatikonda, S., Reinwald, B., Sen, P., Tian, Y., Burdick, D.R., Vaithyanathan, S.: Hybrid parallelization strategies for large-scale machine learning in SystemML. VLDB 7(7), 553–564 (2014) Ghoting, A., Krishnamurthy, R., Pednault, E., Reinwald, B., Sindhwani, V., Tatikonda, S., Tian, Y., Vaithyanathan, S.: SystemML: declarative machine learning on MapReduce. In: ICDE, pp. 231–242 (2011) Shi, J., Qiu, Y., Minhas, U.F., Jiao, L., Wang, C., Reinwald, B., Özcan, F.: Clash of the titans: MapReduce vs. Spark for large scale data analytics. VLDB 8(13), 2110–2121 (2015) Ganguly, S., Hasan, W., Krishnamurthy, R.: Query optimization for parallel execution. In: SIGMOD, pp. 9–18 (1992) Saltzer, J.H., Kaashoek, M.F.: Principles of Computer System Design: An Introduction. Morgan Kaufmann, Burlington (2009) Gufler, B., Augsten, N., Reiser, A., Kemper, A.: Handing data skew in mapreduce. ICCCSS 146, 574–583 (2011) David, H.A., Nagaraja, H.N.: Order statistics. Wiley Online Library (1970) Kamath, G.: Bounds on the expectation of the maximum of samples from a gaussian. http://www.gautamkamath.com/writings/gaussian max. pdf (2015) Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The hibench benchmark suite: characterization of the mapreduce-based data analysis. In: ICDEW, pp. 41–51 (2010) Khan, M., Jin, Y., Li, M., Xiang, Y., Jiang, C.: Hadoop performance modeling for job estimation and resource provisioning. IEEE Trans. Parallel Distrib. Syst. 27(2), 441–454 (2016) Shen, C., Tong, W., Hwang, J.-N., Gao, Q.: Performance modeling of big data applications in the cloud centers. J. Supercomput. 73(5), 2258–2283 (2017) Huang, B., Boehm, M., Tian, Y., Reinwald, B., Tatikonda, S., Reiss, F.R.: Resource elasticity for large-scale machine learning. In: SIGMOD, pp. 137–152 (2015) Venkataraman, S., Yang, Z., Franklin, M.J., Recht, B., Stoica, I.: Ernest: efficient performance prediction for large-scale advanced analytics. In: NSDI, pp. 363–378 (2016) Singhal, R., Singh, P.: Performance assurance model for applications on spark platform. In: Proceedings of the Technology Conference on Performance Evaluation and Benchmarking, pp. 131–146. Springer (2017) Shi, J., Lu, J.: Performance models of data parallel DAG workflows for large scale data analytics. In: 2021 IEEE 37th International Conference on Data Engineering Workshops (ICDEW). IEEE (2021) Chaudhuri, S.: An overview of query optimization in relational systems. In: PODS pp. 34–43 (1998) Li, J., Naughton, J., Nehme, R.V.: Resource bricolage for parallel database systems. VLDB 8(1), 25–36 (2014) Culler, D., Karp, R., Patterson, D., Sahay, A., Schauser, K.E., Santos, E., Subramonian, R., Von Eicken, T.: LogP: towards a realistic model of parallel computation, vol. 28. ACM (1993) Cheatham, T., Fahmy, A., Stefanescu, D., Valiant, L.: Bulk synchronous parallel computing paradigm for transportable software. In: TEPDS, pp. 61–76 (1996) Ousterhout, K., Rasti, R., Ratnasamy, S., Shenker, S., Chun, B.-G., ICSI, V.: Making sense of performance in data analytics frameworks. In: NSDI, vol. 15, pp. 293–307 (2015) Walton, C.B., Dale, A.G., Jenevein, R.M.: A taxonomy and performance model of data skew effects in parallel joins. VLDB 91, 537–548 (1991)