Integrating software and hardware hierarchies in an autotuning method for parallel routines in heterogeneous clusters

Jesús Cámara1, Javier Cuenca1, Domingo Giménez2
1Department of Engineering and Technology of Computers University of Murcia Murcia Spain
2Department of Computing and Systems, University of Murcia, Murcia, Spain#TAB#

Tóm tắt

Từ khóa


Tài liệu tham khảo

Agullo E, Demmel J, Dongarra J, Hadri B, Kurzak J, Langou J, Ltaief H, Luszczek P, Tomov S (2009) Numerical linear algebra on emerging architectures: the PLASMA and MAGMA projects. J Phys: Conf Ser 180(1):012037

Ansel J, Kamil S, Veeramachaneni K, Ragan-Kelley J, Bosboom J, O’Reilly U-M, Amarasinghe S (2014) OpenTuner: An extensible framework for program autotuning. In: 23rd International Conference on Parallel Architectures and Compilation Techniques. Edmonton, Canada, ACM, pp 303–316

Augonnet C, Thibault S, Namyst R, Wacrenier P-A (2011) StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr Comput: Pract Exp 23(2):187–198

Batory D (1992) The design and implementation of hierarchical software systems with reusable components. ACM Trans Softw Eng Methodol 1:355–398

Bernabé G, Cuenca J, García L-P, Giménez D (2015) Auto-tuning techniques for linear algebra routines on hybrid platforms. J Comput Sci 10:299–310

Blackford LS, Choi J, Cleary A, D’Azevedo E, Demmel J, Dhillon I, Dongarra JJ, Hammarling S, Henry G, Petitet A, Stanley K, Walker D, Whaley RC (1997) ScaLAPACK user’s guide. Society for Industrial and Applied Mathematics, Philadelphia

Cámara J, Cuenca J, Giménez D (2019) Hierarchical automatic optimization of high and medium level linear algebra routines. In: 18th International Conference on Computational and Mathematical Methods in Science and Engineering

Chameleon: Dense linear algebra subroutines for heterogeneous and distributed architectures. https://gitlab.inria.fr/solverstack/chameleon. Accessed Sept 2019

cuBLAS. http://docs.nvidia.com/cuda/cublas/. Accessed Sept 2019

Cuenca J, García L-P, Giménez D, Herrera F-J (2017) Guided installation of basic linear algebra routines in a cluster with manycore components. Concurr Comput: Pract Exp 29(15):e4112

Dackland K, Kågström B (1996) A hierarchical approach for performance analysis of ScaLAPACK-based routines using the distributed linear algebra machine. In: Applied Parallel Computing, Industrial Computation and Optimization, Third International Workshop, PARA96. Lyngby, Denmark, pp 186–195

Fatica M (2009) Accelerating Linpack with CUDA on heterogenous clusters. In: 2nd Workshop on General Purpose Processing on Graphics Processing Units. NY, USA, ACM, New York, pp 46–51

Golub G, Van Loan CF (2013) Matrix computations, 4th edn. The John Hopkins University Press, Baltimore

Goto K, van de Geijn RA (2008) Anatomy of high-performance matrix multiplication. ACM Trans Math Softw 34(3):12:1–12:25

Hasanov K, Quintin J-N, Lastovetsky AL (2015) Hierarchical approach to optimization of parallel matrix multiplication on large-scale platforms. J Supercomput 71(11):3991–4014

Intel MKL. http://software.intel.com/en-us/intel-mkl/. Accessed Sept 2019

Ohshima S, Kise K, Katagiri T, Yuba T (2007) Parallel processing of matrix multiplication in a CPU and GPU heterogeneous environment. In: 7th International Conference on High Performance Computing for Computational Science. Springer-Verlag, pp 305–318

Pfaffe P, Grosser T, Tillmann M (2019) Efficient hierarchical online-autotuning: A case study on polyhedral accelerator mapping. In: Proceedings of the ACM International Conference on Supercomputing, ICS ’19, New York, USA, ACM, pp 354–366

PLASMA. http://icl.cs.utk.edu/plasma/. Accessed Sept 2019

Porterfield A, Bhalachandra S, Wang W, Fowler R (2016) Variability: a tuning headache. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp 1069–1072

Stanisic L, Thibault S, Legrand A, Videau B, Méhaut J-F (2015) Faithful performance prediction of a dynamic task-based runtime system for heterogeneous multi-core architectures. Concurr Comput: Pract Exp 27(16):4075–4090

Williams S, Oliker L, Carter J, Shalf J (2011) Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’11, New York, USA, ACM, pp 1–12

Yokota R, Barba L (2012) Hierarchical N-body simulations with autotuning for heterogeneous systems. Comput Sci Eng 14(3):30–39

Zhong Z, Rychkov V, Lastovetsky AL (2015) Data partitioning on multicore and multi-GPU platforms using functional performance models. IEEE Trans Comput 64(9):2506–2518