Parallelizing dense and banded linear algebra libraries using SMPSs

Concurrency Computation Practice and Experience - Tập 21 Số 18 - Trang 2438-2456 - 2009
Rosa M. Badía1,2, José R. Herrero3, Jesús Labarta1, J. M. Pérez1, Enrique S. Quintana–Ort́ı4, Gregorio Quintana‐Ortí4
1Barcelona Supercomputing Center, Centro Nacional de Supercomputación (BSC-CNS) and Universitat Politècnica de Catalunya, Nexus II Building, C. Jordi Girona 29, 08034 Barcelona, Spain
2Consejo Superior de Investigaciones Científicas (CSIC), Spain
3Departmento de Arquitectura de Computadores, Universitat Politècnica de Catalunya, 08034 Barcelona, Spain
4Departmento de Ingeniería y Ciencia de Computadores, Universidad Jaume I, 12.071 Castellón, Spain

Tóm tắt

AbstractThe promise of future many‐core processors, with hundreds of threads running concurrently, has led the developers of linear algebra libraries to rethink their design in order to extract more parallelism, further exploit data locality, attain better load balance, and pay careful attention to the critical path of computation. In this paper we describe how existing serial libraries such as (C)LAPACK and FLAME can be easily parallelized using the SMPSs tools, consisting of a few OpenMP‐like pragmas and a run‐time system. In the LAPACK case, this usually requires the development of blocked algorithms for simple BLAS‐level operations, which expose concurrency at a finer grain. For better performance, our experimental results indicate that column‐major order, as employed by this library, needs to be abandoned in benefit of a block data layout. This will require a deeper rewrite of LAPACK or, alternatively, a dynamic conversion of the storage pattern at run‐time. The parallelization of FLAME routines using SMPSs is simpler as this library includes blocked algorithms (or algorithms‐by‐blocks in the FLAME argot) for most operations and storage‐by‐blocks (or block data layout) is already in place. Copyright © 2009 John Wiley & Sons, Ltd.

Từ khóa


Tài liệu tham khảo

Anderson E, 1992, LAPACK Users' Guide

10.1002/cpe.1301

10.1016/j.parco.2008.10.002

ChanE Quintana‐OrtíES Quintana‐OrtíG van de GeijnR.Super matrix out‐of‐order scheduling of matrix operations for SMP and multi‐core architectures. Proceedings of the Nineteenth ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 2007) San Diego CA U.S.A. 9–11 June 2007;116–125.

ChanE Van ZeeFG Quintana‐OrtíES Quintana‐OrtíG van de GeijnR.Satisfying your dependencies with super matrix. Proceedings of IEEE Cluster Computing 2007 September 2007;91–99.

Quintana‐OrtíG Quintana‐OrtíES ChanE van de GeijnR Van ZeeFG.Design of scalable dense linear algebra libraries for multithreaded architectures: the LU factorization. Workshop on Multithreaded Architectures and Applications—MTAAP 2008 2008. CD‐ROM.

Quintana‐OrtíG Quintana‐OrtíES ChanE Van ZeeFG van de GeijnRA.Scheduling of QR factorization algorithms on SMP and multi‐core architectures. In 16th Euromicro International Conference on Parallel Distributed and Network‐based Processing—PDP 2008 El Baz FSD Bourgeois J (eds.). 2008;301–310.

ChanE Van ZeeFG BientinesiP Quintana‐OrtíES Quintana‐OrtíG van de GeijnR.Super matrix: A multithreaded runtime scheduling system for algorithms‐by‐blocks. ACM SIGPLAN 2008 Symposium on Principles and Practices of Parallel Programming ( PPoPP'08) 2008;123–132.

Quintana‐OrtíG Quintana‐OrtíES RemónA van de GeijnR.Supermatrix for the factorization of band matrices. FLAME Working Note #27 TR‐07‐51 The University of Texas at Austin Department of Computer Sciences September2007.

10.1006/jpdc.1996.0107

10.1145/1188455.1188546

Pérez JM, 2007, CellSs programming the Cell/B.E. made easier, IBM Journal of Research and Development, 51

PérezJM BadiaRM LabartaJ.A flexible and portable programming model for SMP and multi‐cores. Technical Report 03/2007 Barcelona Supercomputing Center—Centro Nacional de Supercomputación Barcelona Spain 2007.

PérezJM BadiaRM LabartaJ.A dependency‐aware task‐based programming environment for multi‐core architectures. Proceedings of the 2008 IEEE International Conference on Cluster Computing Causal Productions (ed.). September 2008;142–151. IEEE Catalog Number CFP08235‐CDR.

Golub GH, 1996, Matrix Computations

10.1145/504210.504213

10.1145/1055531.1055532

10.1145/1055531.1055533

StrazdinsP.A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization. Technical Report TR‐CS‐98‐07 Department of Computer Science The Australian National University Canberra 0200 ACT Australia 1998.

10.1145/292395.292412

10.1145/292395.292426

10.1145/1055531.1055534

Joffrain T, 2005, Proceedings of PARA 2004, 413

10.1145/1377612.1377615

Gustavson FG, 2000, The Architecture of Scientific Software, 211

10.1109/TPDS.2003.1214317

HerreroJR.A framework for efficient execution of matrix computations. PhD Thesis Polytechnic University of Catalonia Spain 2006.

LowTM van de GejinR.An API for manipulating matrices stored by blocks. Technical Report TR‐2004‐15 Department of Computer Sciences The University of Texas at Austin May2004.