Parallelization of Control Recurrences for ILP Processors

Michael Schlansker1, Vinod Kathail1, Sadun Anik1
1Computer Research Center, Hewlett-Packard Laboratories, Palo Alto, USA

Tóm tắt

The performance of applications executing on processors with instruction level parallelism is often limited by control and data dependences. Performance bottlenecks caused by dependences can frequently be eliminated through transformations which reduce the height of critical paths through the program. The utility of these techniques can be demonstrated in an increasingly broad range of important situations. This paper focuses on the height reduction of control recurrences within loops with data dependent exits. Loops with exits are transformed so as to alleviate performance bottlenecks resulting from control dependences. A compilation approach to effect these transformations is described. The techniques presented in this paper used in combination with prior work on reducing the height of data dependences provide a comprehensive approach to accelerating loops with conditional exits. In many cases, loops with conditional exits provide a degree of parallelism traditionally associated with vectorization. Multiple iterations of a loop can be retired in a single cycle on a processor with adequate instruction level parallelism with no cost in code redundancy. In more difficult cases, height reduction requires redundant computation or may not be feasible.

Tài liệu tham khảo

J. Ferrante, K. Ottenstein, and J. Warren, The program dependence graph and its use in optimization, ACM Trans. on Programming Language Systems 9(3):319–349 (1987). A. Nicolau and J. A. Fisher, Measuring the parallelism available for very long instruction word architectures, IEEE Trans. on Computers C-33(11):968–976 (1984). N. P. Jouppi and D. Wall, Available instruction level parallelism for superscalar and superpiplined machines, Proc. of the Third Int. Conf. on Architectural Support for Programming Languages and Operating Systems, Boston, Massachusetts, pp. 272–282 (1989). D. W. Wall, Limits of instruction-level paralleiism, Proc. of the Fourth Int. Conf. on Architectural Support for Programming Languages and Operating Systems, Santa Clara, California, pp. 176–188 (1991). M. S. Lam and R. P. Wilson, Limits of control flow on parallelism, Proc. of the 19th Int. Symp. on Computer Architecture, Gold Coast, Australia, pp. 46–57 (1992). D. J. Kuck, The Structure of Computers and Computations, Wiley, New York, 1978. M. Schlansker and V. Kathail, Acceleration of first and higher order recurrences on processors with instruction level parallelism, In Sixth Int. Workshop on Languages and Compilers for Parallel Computing, U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, eds., Springer-Verlag, pp. 406–429 (1993). J. A. Fisher, Trace scheduling: A technique for global microcode compaction, IEEE Trans. on Computers C-30, 7:478–490 (1981). P. Tirumalai, M. Lee, and M. S. Schlansker, Parallelization of loops with exits on pipelined architectures, Proc. of the Super Computing ’90, New York, pp. 200–212 (1990). W.-M. W. Hwu, S. A. Mahlke, W. Y. Chen, P. P. Chang, N. J. Warter, R. A. Bringmann, R. G. Ouellete, R. E. Hank, T. Kiyohara, G. E. Habb, J. G. Holm, and D. M. Lavery, The Superblock: An effective technique for VLIW and superscalar compilation, The Journal of Super Computing 7(1/2):229–248 (1993). S. A. Mahlke, W. Y. Chen, R. A. Bringmann, R. E. Hank, W. W. Hwu, B. R. Rau, and M. S. Schlansker, Sentinel scheduling: A model for compiler-controlled speculative execution. ACM Trans. on Computer Systems 11(4):376–408 (1993). J. A. Fisher, Global code generation for instruction-level parallelism: trace scheduling-2, Technical Report HPL-93-43, Hewlett-Packard Laboratories, Palo Alto, California, 1993. A. Nicolau, Parallelism, memory-anti-aliasing and correctness for trace scheduling compilers, Ph. D. Thesis. Yale University, 1984. J. R. Ellis, Bulldog: A Compiler for VLIW Architectures MIT Press, Cambridge, Massachussetts (1985). S.-M. Moon and K. Ebcioglu, An efficient resource-constrained global scheduling technique for superscalar and VLIW processors, Proc. of the 25th Ann. Int. Symp. on Microarchitecture, Portland, Oregon (1992). G. Lowney, S. Freudenberger, T. Karzes, W. D. Lichtenstein, R. Nix, J. O’Donnell, and J. Ruttenberg, The Multiflow Trace Scheduling Compiler, The Journal of Super Computing 7(1/2):51–142 (1993). P. Y. T. Hsu and E. S. Davidson, Highly concurrent scalar processing, Proc. of the Thirteenth Ann. Int. Symp. on Computer Architecture, Tokyo, Japan, pp. 386–395 (1986). B. R. Rau, Cydra 5 directed dataflow architecture, Proc. of the COMPCON ’88, San Francisco, California, pp. 106–113 (1988). B. R. Rau, D. W. L. Yen, W. Yen, and R. A. Towle, The Cydra 5 departmental supercomputer: Design philosophies, decisions and trade-offs, Computer 22(1): 12–35 (1989). J. C. H. Park and M. S. Schlansker, On predicated execution, Technical Report HPL-91-58, Hewlett-Packard Laboratories, Palo Alto California, 1991. V. Kathail, M. S. Schlansker, and B. R. Rau, HPL PlayDoh architecture specification: Version 1.0, Technical Report HPL-93-80, Hewlett-Packard Laboratories, Palo Alto, California (1993). B. R. Rau, Data flow and dependence analysis for instruction level paralleiism, In Fourth Int. Workshop on Languages and Compilers for Parallel Computing, U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, eds., Springer-Verlag, pp. 236–250 (1992). B. R. Rau and C. D. Glaeser, Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific Computing, Proc. of the Fourteenth Ann. Workshop on Microprogramming, pp. 183–198 (1981). J. C. Dehnert, P. Y. T. Hsu, and J. P. Bratt, Overlapped Loop Support on the Cydra 5, Proc. of the Third Int. Conf. on Architectural Support for Programming Languages and Operating Systems, Boston, Massachusetts, pp. 26–38 (1989). J. C. Dehnert and R. A. Towle, Compiling for the Cydra 5, The Journal of Supercomputing 7(1/2):181–228 (1993). N. J. Warter, S. A. Mahlke, W. W. Hwu, and B. R. Rau, Reverse if-conversion, Proc. of the SIGPLAN ’93 Conf. on Programming Language Design and Implementation, Albuquerque, New Mexico, pp. 290–299 (1993). M. Schlansker, V. Kathail, and S. Anik, Height reduction of control recurrences for ILP processors, Proc. of the 27th Ann. Int. Symp. on Microarchitecture, San Jose, California, pp. 40–51 (1994). B. R. Rau, M. S. Schlansker, and P. P. Tirumalai, Code generation Schemas for modulo scheduled DO-loops and WHILE-loops, Technical Report HPL-92-47, Hewlett-Packard Laboratories, Palo Alto California (1992). F. H. McMahon, The Livermore Fortran Kernels: A computer test of the numerical performance range, Technical Report UCRL-53745, Lawrence Livermore National Laboratory (1986). H.-P. Company, PA-RISC 1.1 Architecture and Instruction Set Reference Manual, Second Edition Hewlett-Packard (1992). S. A. Mahlke, D. C. Lin, W. Y. Chen, R. E. Hank, and R. A. Bringmann, Effective compiler support for predicated execution using the hyperblock, Proc. of the 25th Ann. Int. Symp. on Microarchitecture, Portland, Oregon, pp. 45–54 (1992).