A Profile-Based AI-Assisted Dynamic Scheduling Approach for Heterogeneous Architectures

Tongsheng Geng1, Marcos Amaris2, Stéphane Zuckerman3, Alfredo Goldman4, Guang R. Gao5, Jean-Luc Gaudiot1
1University of California, Irvine, Irvine, USA
2Federal University of Pará, Tucuruí, Brazil
3Laboratoire ETIS, UMR 8051, CY Cergy Paris Universités, ENSEA, CNRS, Cergy, France
4University of São Paulo, São Paulo, Brazil
5University of Delaware, Delaware, USA

Tóm tắt

While heterogeneous architectures are increasing popular with High Performance Computing systems, their effectiveness depends on how efficient the scheduler is at allocating workloads onto appropriate computing devices and how communication and computation can be overlapped. With different types of resources integrated into one system, the complexity of the scheduler correspondingly increases. Moreover, for applications with varying problem sizes on different heterogeneous resources, the optimal scheduling approach may vary accordingly. Thus, we introduce a Profile-based AI-assisted Dynamic Scheduling approach to dynamically and adaptively adjust workloads and efficiently utilize heterogeneous resources. It combines online scheduling, application profile information, hardware mathematical modeling and offline machine learning estimation modeling to implement automatic application-device-specific scheduling for heterogeneous architectures. A hardware mathematical model provides coarse-grain computing resource selection while the profile information and offline machine learning model estimates the performance of a fine-grain workload, and an online scheduling approach dynamically and adaptively distributes the workload. Our scheduling approach is tested on control-regular applications, 2D and 3D Stencil kernels (based on a Jacobi Algorithm), and a data-irregular application, Sparse Matrix-Vector Multiplication, in an event-driven runtime system. Experimental results show that PDAWL is either on-par or far outperforms whichever yields the best results (CPU or GPU).

Tài liệu tham khảo

Agullo, E., Demmel, J., Dongarra, J., Hadri, B., Kurzak, J., Langou, J., Ltaief, H., Luszczek, P., Tomov, S.: Numerical linear algebra on emerging architectures: the plasma and magma projects. J. Phys. Conf. Ser. 180, 012037 (2009) Amaris, M., Cordeiro, D., Goldman, A., De Camargo, R.Y.: A simple BSP-based model to predict execution time in GPU applications. In: 2015 IEEE 22nd International Conference on High Performance Computing (HiPC), pp. 285–294 (2015). https://doi.org/10.1109/HiPC.2015.34 Arteaga, J., Zuckerman, S., Gao, G.R.: Generating fine-grain multithreaded applications using a multigrain approach. ACM Trans. Archit. Code Optim. 14(4), 47:1-47:26 (2017). https://doi.org/10.1145/3155288 Barnes, B.J., Rountree, B., Lowenthal, D.K., Reeves, J., de Supinski, B., Schulz, M.: A regression-based approach to scalability prediction. In: Proceedings of the 22Nd Annual International Conference on Supercomputing, ICS ’08, pp. 368–377. ACM, New York, NY, USA (2008). https://doi.org/10.1145/1375527.1375580 Belviranli, M.E., Bhuyan, L.N., Gupta, R.: A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures. ACM Trans. Archit. Code Optim. 9(4), 57:1-57:20 (2013). https://doi.org/10.1145/2400682.2400716 Chen, J., Choudhary, A., Feldman, S., Hendrickson, B., Johnson, C., Mount, R., Sarkar, V., White, V., Williams, D.: Synergistic Challenges in Data-Intensive Science and Exascale Computing: DOE ASCAC Data Subcommittee Report. Department of Energy Office of Science (2013). Type: Report Chen, Q., Guo, M.: Contention and locality-aware work-stealing for iterative applications in multi-socket computers. IEEE Trans. Comput. 67(6), 784–798 (2018). https://doi.org/10.1109/TC.2017.2783932 Cho, Y., Negele, F., Park, S., Egger, B., Gross, T.R.: On-the-fly workload partitioning for integrated cpu/gpu architectures. In: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques, PACT ’18, pp. 21:1–21:13. ACM, New York, NY, USA (2018). https://doi.org/10.1145/3243176.3243210 Chow, E., Anzt, H., Scott, J., Dongarra, J.: Using Jacobi iterations and blocking for solving sparse triangular systems in incomplete factorization preconditioning. J. Parallel Distrib. Comput. 119, 219–230 (2018) Cole, S.V., Buhler, J.: Mercator: a GPGPU framework for irregular streaming applications. In: 2017 International Conference on High Performance Computing Simulation (HPCS), pp. 727–736 (2017) Danalis, A., Marin, G., McCurdy, C., Meredith, J.S., Roth, P.C., Spafford, K., Tipparaju, V., Vetter, J.S.: The scalable heterogeneous computing (SHOC) benchmark suite. In: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU-3, pp. 63–74. ACM, New York, NY, USA (2010). https://doi.org/10.1145/1735688.1735702 Davis, T.A., Hu, Y.: The University of Florida sparse matrix collection. ACM Trans. Math. Softw. 38(1), 1:1-1:25 (2011). https://doi.org/10.1145/2049662.2049663 De Raedt, H., Jin, F., Willsch, D., Willsch, M., Yoshioka, N., Ito, N., Yuan, S., Michielsen, K.: Massively parallel quantum computer simulator, eleven years later. Comput. Phys. Commun. 237, 47–61 (2019) Dehne, F., Hutchinson, D., Maheshwari, A., Dittrich, W.: Reducing I/O complexity by simulating coarse grained parallel algorithms. In: Parallel Processing, 1999. 13th International and 10th Symposium on Parallel and Distributed Processing, 1999. 1999 IPPS/SPDP. Proceedings, pp. 14–20 (1999). https://doi.org/10.1109/IPPS.1999.760428 Franchetti, F., Low, T.M., Popovici, D.T., Veras, R.M., Spampinato, D.G., Johnson, J.R., Püschel, M., Hoe, J.C., Moura, J.M.F.: Spiral: extreme performance portability. Proc. IEEE 106(11), 1935–1968 (2018) García, V., Gomez-Luna, J., Grass, T., Rico, A., Ayguade, E., Pena, A.J.: Evaluating the effect of last-level cache sharing on integrated GPU–CPU systems with heterogeneous applications. In: 2016 IEEE International Symposium on Workload Characterization (IISWC), pp. 1–10 (2016). https://doi.org/10.1109/IISWC.2016.7581277 Gaster, B.R., Howes, L.: Can GPGPU programming be liberated from the data-parallel bottleneck? Computer 45, 42–52 (2012). https://doi.org/10.1109/MC.2012.257 Geng, T., Zuckerman, S., Monsalve, J., Goldman, A., Habib, S., Gaudiot, J.L., Gao, G.R.: The importance of efficient fine-grain synchronization for many-core systems. In: International Workshop on Languages and Compilers for Parallel Computing, pp. 203–217. Springer (2016) Guo, P., Wang, L., Chen, P.: A performance modeling and optimization analysis tool for sparse matrix-vector multiplication on GPUs. IEEE Trans. Parallel Distrib. Syst. 25(5), 1112–1123 (2014). https://doi.org/10.1109/TPDS.2013.123 Kaleem, R., Barik, R., Shpeisman, T., Hu, C., Lewis, B.T., Pingali, K.: Adaptive heterogeneous scheduling for integrated GPUs. In: 2014 23rd International Conference on Parallel Architecture and Compilation Techniques (PACT), pp. 151–162 (2014). https://doi.org/10.1145/2628071.2628088 Leandro Nesi, L., da Silva Serpa, M., Mello Schnorr, L., Navaux, P.O.A.: Task-based parallel strategies for computational fluid dynamic application in heterogeneous CPU/GPU resources. Concurr. Comput. Pract. Exp. 32(20), e5772 (2020). https://doi.org/10.1002/cpe.5772 Lee, J., Samadi, M., Park, Y., Mahlke, S.: Transparent CPU–GPU collaboration for data-parallel kernels on heterogeneous systems. In: Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, PACT ’13, pp. 245–256. IEEE Press, Piscataway, NJ, USA (2013) Lee, V.W., Kim, C., Chhugani, J., Deisher, M., Kim, D., Nguyen, A.D., Satish, N., Smelyanskiy, M., Chennupaty, S., Hammarlund, P., Singhal, R., Dubey, P.: Debunking the 100x GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. In: Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA ’10, pp. 451–460. ACM, New York, NY, USA (2010). https://doi.org/10.1145/1815961.1816021 Levon, J., Elie, P., Johnson M.: Oprofile: a system profiler for Linux (2004). https://oprofile.sourceforge.io/about/. Accessed 20 June 2020 List, T.S.: http://www.top500.org (2017) Luk, C.K., Hong, S., Kim, H.: Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42, pp. 45–55. ACM, New York, NY, USA (2009). https://doi.org/10.1145/1669112.1669121 Lutz, T., Fensch, C., Cole, M.: Partans: an autotuning framework for stencil computation on multi-GPU systems. ACM Trans. Archit. Code Optim. (TACO) 9(4), 59 (2013) Margiolas, C., O’Boyle, M.F.P.: Portable and transparent software managed scheduling on accelerators for fair resource sharing. In: 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 82–93 (2016) Martinasso, M., Kwasniewski, G., Alam, S.R., Schulthess, T.C., Hoefler, T.: A PCIE congestion-aware performance model for densely populated accelerator servers. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’16, pp. 63:1–63:11. IEEE Press, Piscataway, NJ, USA (2016). http://dl.acm.org/citation.cfm?id=3014904.3014989 Memeti, S., Pllana, S.: Combinatorial optimization of work distribution on heterogeneous systems. In: 2016 45th International Conference on Parallel Processing Workshops (ICPPW), pp. 151–160 (2016) NVIDIA: CUDA C: Programming Guide, Version 10.0. (2019) O’Boyle, M.F.P., Wang, Z., Grewe, D.: Portable mapping of data parallel programs to opencl for heterogeneous systems. In: Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), CGO ’13, pp. 1–10. IEEE Computer Society, Washington, DC, USA (2013). https://doi.org/10.1109/CGO.2013.6494993 Reed, D.A., Dongarra, J.: Exascale computing and big data. Commun. ACM 58(7), 56–68 (2015). https://doi.org/10.1145/2699414 Sant’Ana, L., Cordeiro, D., Camargo, R.: PLB-HeC: a profile-based load-balancing algorithm for heterogeneous CPU–GPU clusters. In: 2015 IEEE International Conference on Cluster Computing, pp. 96–105 (2015). https://doi.org/10.1109/CLUSTER.2015.24 Sant’Ana, L., Cordeiro, D., de Camargo, R.Y.: Plb-hac: dynamic load-balancing for heterogeneous accelerator clusters. In: European Conference on Parallel Processing, pp. 197–209. Springer (2019) Souravlas, S., Anastasiadou, S.: Pipelined dynamic scheduling of big data streams. Appl. Sci. (2020). https://doi.org/10.3390/app10144796 Suettlerlein, J., Zuckerman, S., Gao, G.R.: An implementation of the codelet model. In: Proceedings of the 19th International Conference on Parallel Processing, Euro-Par’13, pp. 633–644. Springer, Berlin, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40047-6_63 Teodoro, G., Kurc, T.M., Pan, T., Cooper, L.A.D., Kong, J., Widener, P., Saltz, J.H.: Accelerating large scale image analyses on parallel, CPU–GPU equipped systems. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium, pp. 1093–1104 (2012). https://doi.org/10.1109/IPDPS.2012.101 Tribbey, W.: Modern database systems. In: Kim, W. (ed.) Modern Database Systems, Chap. Numerical Recipes: The Art of Scientific Computing (3rd Edition) is Written by William H. Press, Saul A. Teukolsky, William T. Vetterling, and Brian P. Flannery, and Published by Cambridge University Press, 2007, Hardback, ISBN 978-0-521-88068-8, 1235 pp., pp. 30–31. ACM Press/Addison-Wesley Publishing Co., New York, NY, USA (1995). https://doi.org/10.1145/1874391.187410 Van Craeynest, K., Jaleel, A., Eeckhout, L., Narvaez, P., Emer, J.: Scheduling heterogeneous multi-cores through performance impact estimation (pie). SIGARCH Comput. Archit. News 40(3), 213–224 (2012). https://doi.org/10.1145/2366231.2337184 van Werkhoven, B., Maassen, J., Seinstra, F.J., Bal, H.E.: Performance models for CPU–GPU data transfers. In: 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp. 11–20 (2014). https://doi.org/10.1109/CCGrid.2014.16 Wang, Z., Tournavitis, G., Franke, B., O’boyle, M.F.P.: Integrating profile-driven parallelism detection and machine-learning-based mapping. ACM Trans. Archit. Code Optim. 11(1), 1–26 (2014). https://doi.org/10.1145/2579561 Wen, Y., O’Boyle, M.F.: Merge or separate?: multi-job scheduling for opencl kernels on CPU/GPU platforms. In: Proceedings of the General Purpose GPUs, GPGPU-10, pp. 22–31. ACM, New York, NY, USA (2017). https://doi.org/10.1145/3038228.3038235 Yang, C., Wang, F., Du, Y., Chen, J., Liu, J., Yi, H., Lu, K.: Adaptive optimization for petascale heterogeneous CPU/GPU computing. In: IEEE International Conference on Cluster Computing, pp. 19–28 (2010). https://doi.org/10.1109/CLUSTER.2010.12 Zhang, F., Wu, B., Zhai, J., He, B., Chen, W.: Finepar: irregularity-aware fine-grained workload partitioning on integrated architectures. In: 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 27–38 (2017). https://doi.org/10.1109/CGO.2017.7863726 Zhang, F., Zhai, J., He, B., Zhang, S., Chen, W.: Understanding co-running behaviors on integrated CPU/GPU architectures. IEEE TPDS 28(3), 905–918 (2017). https://doi.org/10.1109/TPDS.2016.2586074 Zhong, Z., Rychkov, V., Lastovetsky, A.: Data partitioning on heterogeneous multicore and multi-GPU systems using functional performance models of data-parallel applications. In: 2012 IEEE International Conference on Cluster Computing (Cluster 2012) (2012) Zuckerman, S., Suetterlein, J., Knauerhase, R., Gao, G.R.: Using a “codelet” program execution model for exascale machines: position paper. In: Proceedings of the 1st International Workshop on Adaptive Self-tuning Computing Systems for the Exaflop Era, EXADAPT ’11. ACM, New York, NY, USA (2011)