Towards dense linear algebra for hybrid GPU accelerated manycore systems
Tài liệu tham khảo
E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, D. Sorensen, LAPACK user’s guide, SIAM, third ed., 1999.
M. Baboulin, J. Dongarra, S. Tomov, Some issues in dense linear algebra for multicore and special purpose architectures, Technical Report UT-CS-08-615, University of Tennessee, 2008, LAPACK Working Note 200.
G. Ballard, J. Demmel, O. Holtz, O. Schwartz, Minimizing communication in linear algebra, Technical Report, LAPACK Working Note 218, May 2009.
S. Barrachina, M. Castillo, F. Igual, R. Mayo, E. Quintana-Ortí, Solving dense linear systems on graphics processors, Technical Report ICC 02-02-2008, Universidad Jaime I, February, 2008.
A. Buttari, J. Dongarra, J. Kurzak, J. Langou, P. Luszczek, S. Tomov, The impact of multicore on math software, PARA 2006, in: B. Kågström et al. (Ed.), Lecture Notes in Computer Science, vol. 4699, Springer, 2007, pp. 1–10.
Buttari, 2008, Using mixed precision for sparse matrix computations to enhance the performance while achieving 64-bit accuracy, ACM Trans. Math. Software, 34, 10.1145/1377596.1377597
A. Buttari, J. Langou, J. Kurzak, J. Dongarra, A class of parallel tiled linear algebra algorithms for multicore architectures, Technical Report UT-CS-07-600, University of Tennessee, 2007, LAPACK Working Note 191.
J. Demmel, J. Dongarra, B. Parlett, W. Kahan, M. Gu, D. Bindel, Y. Hida, X. Li, O. Marques, E. Riedy, C. Vömel, J. Langou, P. Luszczek, J. Kurzak, A. Buttari, J. Langou, S. Tomov, Prospectus for the next LAPACK and ScaLAPACK libraries, in: PARA’06: State-of-the-Art in Scientific and Parallel Computing (Umeå, Sweden), High Performance Computing Center North (HPC2N) and the Department of Computing Science, Umeå University, Springer, June 2006.
J. Demmel, L. Grigori, M. Hoemmen, J. Langou, Communication-avoiding parallel and sequential QR factorizations, CoRR abs/0806.2159, 2008.
Dongarra, 2003, The LINPACK benchmark: past, present, and future, Concurrency and Computation: Practice and Experience, 15, 820, 10.1002/cpe.728
J. Dongarra, S. Moore, G. Peterson, S. Tomov, J. Allred, V. Natoli, D. Richie, Exploring new architectures in accelerating CFD for air force applications, in: Proceedings of HPCMP Users Group Conference 2008, July 14–17, 2008. <http://www.cs.utk.edu/~tomov/ugc2008_final.pdf>.
K. Fatahalian, J. Sugerman, P. Hanrahan, Understanding the efficiency of GPU algorithms for matrix–matrix multiplication, in: HWWS ’04: Proceedings of the ACM Siggraph/Eurographics Conference on Graphics Hardware (New York, NY, USA), ACM, 2004, pp. 133–137.
M. Fatica, Accelerating LINPACK with CUDA on heterogenous clusters, in: GPGPU-2: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units (New York, NY, USA), ACM, 2009, pp. 46–51.
N. Galoppo, N. Govindaraju, M. Henson, D. Manocha, LU-GPU: efficient algorithms for solving dense linear systems on graphics hardware, in: SC ’05: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing (Washington, DC, USA), IEEE Computer Society, 2005, p. 3.
L. Grigori, J. Demmel, H. Xiang, Communication avoiding Gaussian elimination, Technical Report 6523, INRIA, 2008.
Wolfgang Gruener, Larrabee, CUDA and the quest for the free lunch, TGDaily. <http://www.tgdaily.com/content/view/38750/113/,08/2008>.
Higham, 2002
Hruska, 2008, AMD fusion now pushed back to 2011, Art Technica
Kågström, 1998, GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark, ACM Trans. Math. Software, 24, 268, 10.1145/292395.292412
Julie Langou, Julien Langou, P. Luszczek, J. Kurzak, A. Buttari, J. Dongarra, Exploiting the performance of 32 bit floating point arithmetic in obtaining 64 bit accuracy (revisiting iterative refinement for linear systems), in: SC ’06: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing (New York, NY, USA), ACM, 2006, p. 113.
Y. Li, J. Dongarra, S. Tomov, A note on auto-tuning GEMM for GPUs, Technical Report, LAPACK Working Note 212, January 2009.
NVIDIA, Nvidia Tesla doubles the performance for CUDA developers, Computer Graphics World (06/30/2008).
NVIDIA, NVIDIA CUDA Programming Guide, 6/07/2008, Version 2.0.
Owens, 2008, GPU computing, Proceedings of the IEEE, 96, 879, 10.1109/JPROC.2008.917757
Owens, 2007, A survey of general-purpose computation on graphics hardware, Comput. Graphics Forum, 26, 80, 10.1111/j.1467-8659.2007.01012.x
D. Parker, Random butterfly transformations with applications in computational linear algebra, Technical Report CSD-950023, Computer Science Department, UCLA, 1995.
D. Parker, B. Pierce, The randomizing FFT: an alternative to pivoting in Gaussian elimination, Technical Report CSD-950037, Computer Science Department, UCLA, 1995.
Pharr, 2005
G. Quintana-Ortí, F.Igual, E.Quintana-Ortí, R. van de Geijn, Solving dense linear systems on platforms with multiple hardware accelerators, in: PPoPP ’09: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (New York, NY, USA), ACM, 2009, pp. 121–130.
G. Quintana-Orti, E. Quintana-Orti, E. Chan, F. van Zee, R. van de Geijn, Programming algorithms-by-blocks for matrix computations on multithreaded architectures, Technical Report TR-08-04, University of Texas at Austin, 2008, FLAME Working Note 29.
Seiler, 2008, Larrabee: a many-core×86 architecture for visual computing, ACM Trans. Graph., 27, 1, 10.1145/1360612.1360617
S. Tomov, M. Baboulin, J. Dongarra, S. Moore, V. Natoli, G. Peterson, D. Richie, Special-purpose hardware and algorithms for accelerating dense linear algebra, in: Parallel Processing for Scientific Computing, Atlanta, March 12–14, 2008. <http://www.cs.utk.edu/~tomov/PP08_Tomov.pdf>.
S. Tomov, J. Dongarra, Accelerating the reduction to upper Hessenberg form through hybrid GPU-based computing, Technical Report 219, LAPACK Working Note, May 2009.
V. Volkov, J. Demmel, Benchmarking gpus to tune dense linear algebra, in: SC ’08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing (Piscataway, NJ, USA), IEEE Press, 2008, pp. 1–11.
LU, QR, Cholesky factorizations using vector capabilities of GPUs, Technical Report UCB/EECS-2008-49, EECS Department, University of California, Berkeley, May 2008.
Using GPUs to accelerate linear algebra routines, Poster at PAR lab winter retreat, January 9, 2008. <http://www.eecs.berkeley.edu/~volkov/volkov08-parlab.pdf>.
General-purpose computation using graphics hardware, <http://www.gpgpu.org>.
Nvidia cuda zone, NVIDIA. <http://www.nvidia.com/object/cuda_home.html>.