Low occupancy high performance elemental products in assembly free FEM on GPU

Engineering with Computers - Tập 38 - Trang 2189-2204 - 2021
Nileshchandra K. Pikle1, Shailesh R. Sathe2, Arvind Y. Vyavahare3
1School of Computer Science and Engineering, Vellore Institute of Technology, Amravati, India
2Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology, Nagpur, India
3Department of Applied Mechanics, Visvesvaraya National Institute of Technology, Nagpur, India

Tóm tắt

Assembly free FEM bypasses the assembly step and solves the system of linear equations at the element level using Conjugate Gradient (CG) type iterative solver. The smaller dense Matrix-vector Products (MvPs) are encapsulated within the CG solver and are computed either at element level or degree of freedom (DoF) level. Both these strategies exploit the computing power of GPU effectively, but the performance is lagging due to the uncoalesced global memory access on GPU. This paper proposes an improved MvP strategy in assembly free FEM, which improves the performance by coalesced global memory access using on-chip faster shared memory and using the texture cache memory on GPU. Since GPU has limited shared memory (in few KBs), the proposed technique suffers from a problem known as low occupancy. Despite the low occupancy issue, the proposed strategy outperforms both element based and DoF based MvP strategies on GPU. Numerical experiments compared with element level and DoF level strategies on GPU and found that, GPU instance of proposed MvP outperforms both strategies approximately by factor of 7 and 1.5 respectively.

Tài liệu tham khảo

Nath R, Tullsen D (2015) The CRISP performance model for dynamic voltage and frequency scaling in a GPGPU. In: Proceedings of the 48th international symposium on microarchitecture Owens JD et al (2008) GPU computing. Proc IEEE 96(5):879–899 (Addison-Wesley) Corrigan A et al (2011) Running unstructured grid-based CFD solvers on modern graphics hardware. Int J Numer Meth Fluids 66(2):221–229 Goddeke D et al (2009) Co-processor acceleration of an unmodified parallel solid mechanics code with FEASTGPU. Int J Comput Sci Eng 4(4):254–269 Bathe K-J (2008) Finite element method. Wiley, Hoboken Banas̀ K, Przemysław P, PawełMacioł (2014) Numerical integration on GPUs for higher order finite elements. Comput Math Appl 67(6):1319–1344 Pikle, Sathe, Vyavhare (2018) GPGPU-based parallel computing applied in the FEM using the conjugate gradient algorithm: a review. Sadhana 43:111 Wilbertz B (2012) GPGPUs in computational finance: massive parallel computing for American style options. Concurr Comput Pract Exp 24(8):837–848 Anderson JA, Lorenz CD, Travesset A (2008) General purpose molecular dynamics simulations fully implemented on graphics processing units. J Comput Phys 227(10):5342–5359 Fu Z et al (2014) Architecting the finite element method pipeline for the GPU. J Comput Appl Math 257:195–211 Komatitsch D, Michèa D, Erlebacher G (2009) Porting a high-order finite-element earthquake modeling application to NVIDIA graphics cards using CUDA. J Parall Distrib Comput 69(5):451–460 Cecka C, Lew AJ, Darve E (2011) Assembly of finite element methods on graphics processors. Int J Numer Meth Eng 85(5):640–669 Woz̀niak M (2015) Fast GPU integration algorithm for isogeometric finite element method solvers using task dependency graphs. J Comput Sci 11:145–152 Markall GR et al (2013) Finite element assembly strategies on multi-core and many-core architectures. Int J Numer Meth Fluids 71(1):80–97 Bell N, Garland M (2008) Efficient sparse matrix-vector multiplication on CUDA, pp 2(5). In: Nvidia Technical Report NVR-2008-004, Nvidia Corporation Dziekonski A et al (2012) Finite element matrix generation on a GPU. Progress Electromagn Res 128:249–265 Shewchuk J (1994) An introduction to the conjugate gradient method without the agonizing pain. Technical Report CMUCS-TR-94-125, Carnegie Mellon University Barrett R et al (1994) Templates for the solution of linear systems: building blocks for iterative methods, vol 43, Siam Ament M et al (2010) A parallel preconditioned conjugate gradient solver for the Poisson problem on a multi-gpu platform. In: 18th Euromicro Conference on Parallel. Distributed and Network-based Processing, IEEE, p 2010 Helfenstein R, Koko J (2012) Parallel preconditioned conjugate gradient algorithm on GPU. J Comput Appl Math 236(15):3584–3590 Ali C, Akira N, Satoshi M (2009) Fast conjugate gradients with multiple GPUs. International conference on computational science. Springer, Berlin Heidelberg Harris M (2007) Optimizing parallel reduction in CUDA. In: NVIDIA Developer Technology 2.4 Bell N, Hoberock J (2011) Thrust: a productivity-oriented library for CUDA. GPU Comput Gems Jade Ed 2:359–371 Vàzquez F, Fernàndez J-J, Garzòn EM (2011) A new approach for sparse matrix vector product on NVIDIA GPUs. Concurr Comput Pract Exp 23(8):815–826 Dehnavi MM, Fernandez DM, Giannacopoulos D (2010) Finite-element sparse matrix vector multiplication on graphic processing units. IEEE Trans Mag 46(8):2982–2985 Feng X et al (2014) A segment-based sparse matrix-vector multiplication on CUDA. Concurr Comput Pract Exp 26(1):271–286 Kiss I et al (2012) Parallel realization of the element-by-element FEM technique by CUDA. IEEE Trans Magn 48(2):507–510 Martìnez-Frutos J, Martìnez-Castejòn PJ, Herrero-Pèrez D (2015) Fine-grained GPU implementation of assembly-free iterative solver for finite element problems. Comput Struct 157:9–18 Fernandez DM et al (2012) Alternate parallel processing approach for FEM. IEEE Trans Magn 48(2):399–402 Martìnez-Frutos J, Martìnez-Castejòn PJ, Herrero-Pèrez D (2017) Efficient topology optimization using GPU computing with multilevel granularity. Adv Eng Softw 106:47–62 Volkov V (2010) Better performance at lower occupancy. In: Proceedings of the GPU technology conference, GTC, vol 10 NVIDIA CUDA (2007) Compute unified device architecture programming guide 2.0. Technical Report, NVIDIA Carey GF, Jiang B-N (1986) Element-by-element linear and nonlinear solution schemes. Commun Appl Numer Methods 2(2):145–153 Nvidia CUDA (2008) Cublas library. NVIDIA Corporation, Santa Clara, California, vol 15, p 27 Nvidia CUDA (2010) CUFFT library. https://docs.nvidia.com/cuda/cufft/index.html Jang B et al (2011) Exploiting memory access patterns to improve memory performance in data-parallel architectures. IEEE Trans Parallel Distrib Syst 22(1):105–118 Garcia-Ruiz MJ, Steven GP (1999) Fixed grid finite elements in elasticity problems. Eng Comput 16(2):145–164