PPT-Multicore: performance prediction of OpenMP applications using reuse profiles and analytical modeling
Tóm tắt
We present PPT-Multicore, an analytical model embedded in the Performance Prediction Toolkit (PPT) to predict parallel applications’ performance running on a multicore processor. PPT-Multicore builds upon our previous work towards a multicore cache model. We extract LLVM basic block labeled memory trace using an architecture-independent LLVM-based instrumentation tool only once in an application’s lifetime. The model uses the memory trace and other parameters from an instrumented sequentially executed binary. We use probabilistic and computationally efficient reuse profiles to predict the cache hit rates and runtimes of OpenMP programs’ parallel sections. We model Intel’s Broadwell, Haswell, and AMD’s Zen2 architectures and validate our framework using different applications from PolyBench and PARSEC benchmark suites. The results show that PPT-Multicore can predict cache hit rates with an overall average error rate of 1.23% while predicting the runtime with an error rate of 9.08%.
Tài liệu tham khảo
7-CPU: 7-Zip LZMA Benchmark. https://www.7-cpu.com (2021). [Online; accessed 4-Dec-2020]
Aarno D, Engblom J (2014) Software and system development using virtual platforms: full-system simulation with Wind river simics. Morgan Kaufmann, Burlington
Alexandrov A, Ionescu MF, Schauser KE, Scheiman C (1997) LogGP: incorporating long messages into the LogP model for parallel computation. J Parallel Distrib Comput 44(1):71–79
Arafa Y, Badawy AA, Chennupati G, Santhi N, Eidenbenz S (2019) Ppt-gpu: scalable gpu performance modeling. IEEE Comput Archit Lett 18(1):55–58
Arafa Y, Badawy AH, Chennupati G, Barai A, Santhi N, Eidenbenz S (2020) Fast, accurate, and scalable memory modeling of GPGPUs using reuse profiles. In: Proceedings of the 34th ACM International Conference on Supercomputing, ICS ’20. Association for Computing Machinery, New York, NY, USA
Arafa Y, Chennupati G, Barai A, Badawy AHA, Santhi N, Eidenbenz S, (2019) GPUs cache performance estimation using reuse distance analysis. In: 2019 IEEE 38th International Performance of Computing and Communications Conference (IPCCC), Piscataway, NJ, USA, pp 1–8. IEEE
Austin T, Larson E, Ernst D (2002) SimpleScalar: an infrastructure for computer system modeling. Computer 35(2):59–67
Badamo M, Casarona J, Zhao M, Yeung D (2016) Identifying power-efficient multicore cache hierarchies via reuse distance analysis. ACM Trans Comput Syst 34(1):1–30
Badawy AA, Yeung D (2017) Guiding locality optimizations for graph computations via reuse distance analysis. IEEE Comput Archit Lett 16(2):119–122
Badawy AA, Yeung D (2017) Optimizing locality in graph computations using reuse distance profiles. In: 2017 IEEE 36th International Performance Computing and Communications Conference (IPCCC), pp 1–8
Barai A, Chennupati G, Santhi N, Badawy AH, Arafa Y, Eidenbenz S (2020) PPT-SASMM: scalable analytical shared memory model: predicting the performance of multicore caches from a single-threaded execution trace. The International symposium on memory systems. MEMSYS. Association for Computing Machinery. NY, USA, New York, pp 341–351
Berg E, Hagersten E (2004) StatCache: a probabilistic approach to efficient and accurate data locality analysis. In: 2004 IEEE International Symposium on Performance Analysis of Systems and Software—ISPASS , Piscataway, NJ, USA, pp 20–27. IEEE
Berg E, Zeffer H, Hagersten E (2006) A statistical multiprocessor cache model. In: 2006 IEEE International Symposium on Performance Analysis of Systems and Software, Piscataway, NJ, USA, pp 89–99. IEEE
Beyls K, D’Hollander EH (2001) Reuse distance as a metric for cache behavior. In: Proceedings of the IASTED Conference on Parallel and Distributed Computing and Systems, Piscataway, NJ, USA pp. 617–622. IEEE
Bienia C (2011) Benchmarking modern multiprocessors. Ph.D. thesis, Princeton University
Binkert N, Beckmann B, Black G, Reinhardt SK, Saidi A, Basu A, Hestness J, Hower DR, Krishna T, Sardashti S et al (2011) The gem5 simulator. ACM SIGARCH Comput Archit News 39(2):1–7
Brehob M, Enbody R (1999) An analytical model of locality and caching. Technical Report MSU-CSE-99-31
Carlson TE, Heirman W, Eyerman S, Hur I, Eeckhout L (2014) An evaluation of high-level mechanistic core models. ACM Trans Archit Code Optim (TACO) 11(3):1–25
Carothers CD, Meredith JS, Blanco MP, Vetter JS, Mubarak M, LaPre J, Moore S (2017) Durango: scalable synthetic workload generation for extreme-scale application performance modeling and simulation. In: Proceedings of the 2017 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation, SIGSIM-PADS ’17, pp 97–108. Association for Computing Machinery, New York, NY, USA
Cascaval C, Padua DA (2003) Estimating cache misses and locality using stack distances. In: Proceedings of the 17th Annual International Conference on Supercomputing, ICS ’03, pp 150–159. ACM, New York, NY, USA
Ceballos G, Hagersten E, Black-Schaffer D (2016) Formalizing data locality in task parallel applications. International conference on algorithms and architectures for parallel processing. Springer, Cham, pp 43–61
Chennupati G, Santhi N, Bird R, Thulasidasan S, Badawy AHA, Misra S, Eidenbenz S (2018) A scalable analytical memory model for CPU performance prediction. In: Jarvis S, Wright S, Hammond S (eds) High performance computing systems. Performance modeling, benchmarking, and simulation. Springer, Cham, pp 114–135
Chennupati G, Santhi N, Eidenbenz S (2019) Scalable prformance prediction of codes with memory hierarchy and pipelines. In: Proceedings of the 2019 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation, SIGSIM-PADS ’19, pp 13–24. Association for Computing Machinery, New York, NY, USA
Chennupati G, Santhi N, Eidenbenz S, Thulasidasan S (2017) An analytical memory hierarchy model for performance prediction. In: Proceedings of the 2017 Winter Simulation Conference, WSC ’17. IEEE Press, Piscataway, NJ, USA
Chennupati G, Santhi N, Eidenbenz S, Zerr RJ, Rosa M, Zamora RJ, Park EJ, Nadiga BT, Liu J, Ahmed K, Obaida MA (2017c) Performance prediction toolkit (PPT). Los Alamos National Laboratory (LANL) . https://github.com/lanl/PPT
Collange S, Daumas M, Defour D, Parello D (2010) Barra: a parallel functional simulator for GPGPU. In: 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, pp 351–360. IEEE
Cope J, Liu N, Lang S, Carns P, Carothers C, Ross R (2011) CODES: enabling co-design of multilayer exascale storage architectures. In: Proceedings of the Workshop on Emerging Supercomputing Technologies
Culler D, Karp R, Patterson D, Sahay A, Schauser KE, Santos E, Subramonian R, von Eicken T (1993) LogP: towards a realistic model of parallel computation. Proceedings of the fourth ACM SIGPLAN symposium on principles and practice of parallel programming 28(7):1–12
Dagum L, Menon R (1998) OpenMP: an industry-standard API for shared-memory programming. IEEE Comput Sci Eng 5(1):46–55
Das S, Aamodt TM, Dally WJ (2015) Reuse distance-based probabilistic cache replacement. ACM Trans Archit Code Optim 12(4):1–22
Davis JD, Laudon J, Olukotun K (2005) Maximizing CMP throughput with mediocre cores. In: Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, PACT ’05, pp 51–62. IEEE Computer Society, USA
De Pestel S, Steen SVd, Akram S, Eeckhout L (2018) RPPM: rapid performance prediction of multithreaded applications on multicore hardware. IEEE Comput Archit Lett 17(2):183–186
Ding C, Chilimbi T (2009) A composable model for analyzing locality of multi-threaded programs. Technical Report, MSR-TR-2009-107, Microoft
Ding C, Xiang X, Bao B, Luo H, Luo YW, Wang XL (2014) Performance metrics and models for shared cache. J Comput Sci Technol 29(4):692–712
Ding C, Zhong Y (2001) Reuse distance analysis. University of Rochester, Rochester, NY, USA, Technical Report
Ding C, Zhong Y (2003) Predicting whole-program locality through reuse distance analysis. Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation ( 38(5):245–257
Dubach C, Jones T, O’Boyle M (2007) Microarchitectural design space exploration using an architecture-centric approach. In: 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007), pp. 262–271
Duong N, Zhao D, Kim T, Cammarota R, Valero M, Veidenbaum AV (2012) Improving cache management policies using dynamic reuse distances. In: Proceedings of IEEE/ACM International Symposium on Microarchitecture, MICRO-45, pp. 389–400. IEEE, Piscataway, NJ, USA
Eeckhout L (2010) Computer architecture performance evaluation methods. Synth Lect Comput Archit 5(1):1–145
Fog A (2016) Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs
Grass T, Allande C, Armejach A, Rico A, Ayguadé E, Labarta J, Valero M, Casas M, Moreto M (2016) MUSA: a multi-level simulation approach for next-generation HPC machines. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’16. IEEE Press
Grauer-Gray S, Xu L, Searles R, Ayalasomayajula S, Cavazos J, (2012) Auto-tuning a high-level language targeted to GPU codes. In 2012 Innovative Parallel Computing (InPar). IEEE, Piscataway, NJ, USA, pp 1–10
Hardavellas N, Somogyi S, Wenisch TF, Wunderlich RE, Chen S, Kim J, Falsafi B, Hoe JC, Nowatzyk AG (2004) SimFlex: a fast, accurate, flexible full-system simulation framework for performance evaluation of Server Architecture. SIGMETRICS Perform Eval Rev 31(4):31–34
Heywood RIP, Howel F (1995) HASE: a flexible toolset for computer architects. Comput J 38(10):764–775
Hill MD, Marty MR (2008) Amdahl’s law in the multicore era. IEEE Comput 41:33–38
Hughes CJ, Pai VS, Ranganathan P, Adve SV (2002) Rsim: simulating shared-memory multiprocessors with ILP processors. Computer 35(2):40–49
Huh J, Burger D, Keckler SW (2001) Exploring the design space of future CMPs. In: Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques, PACT ’01, pp 199–210. IEEE Computer Society, USA
\(\ddot{{\rm I}}\)pek E, McKee SA, Caruana R, de Supinski BR, Schulz M (2006) Efficiently exploring architectural design spaces via predictive modeling. In: Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XII, pp 195–206. Association for Computing Machinery, New York, NY, USA
Jiang Y, Zhang EZ, Tian K, Shen X (2010) Is reuse distance applicable to data locality analysis on chip multiprocessors? In: Proceedings of the 19th Joint European Conference on Theory and Practice of Software, International Conference on Compiler Construction, CC’10/ETAPS’10, pp 264–282. Springer
Kaxiras S, Young C (2000) Coherence communication prediction in shared-memory multiprocessors. In: IEEE Proceedings Sixth International Symposium on High-performance Computer Architecture. HPCA-6 (Cat. No. PR00550), pp 156–167 Piscataway, NJ, USA. IEEE
Keramidas G, Petoumenos P, Kaxiras S (2007) Cache replacement based on reuse-distance prediction. In: 2007 25th International Conference on Computer Design. IEEE, NY, USA, pp 245–250
Kise K, Katagiri T, Honda H, Yuba, T (2004) The simCore/alpha functional simulator. In: Proceedings of the 2004 workshop on computer architecture education: held in conjunction with the 31st International symposium on computer architecture, pp. 24–es
Lattner C, Adve V (2004) LLVM: a compilation framework for lifelong program analysis and transformation. In: Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization, CGO ’04, pp 75–86. IEEE Computer Society, Washington, DC, USA
Lee BC, Collins J, Wang H, Brooks D (2008) CPR: composable performance regression for scalable multiprocessor models. In: 2008 41st IEEE/ACM International Symposium on Microarchitecture, pp 270–281. IEEE
Lee S, Meredith JS, Vetter JS (2015) COMPASS: a framework for automated performance modeling and prediction. In: Proceedings of the 29th ACM on International Conference on Supercomputing, ICS ’15, pp 405–414. Association for Computing Machinery, New York, NY, USA
Liao C, Quinlan DJ, Panas T, de Supinski BR (2010) A ROSE-based openMP 3.0 research compiler supporting multiple runtime libraries. In: Proceedings of the 6th International Conference on Beyond Loop Level Parallelism in OpenMP: Accelerators, Tasking and More, IWOMP’10, pp 15–28. Springer-Verlag, Berlin, Heidelberg
Maeda RKV, Cai Q, Xu J, Wang Z, Tian Z, (2017) Fast and accurate exploration of multi-level caches using hierarchical reuse distance. In: 2017 IEEE International symposium on high performance computer architecture (HPCA). IEEE, Piscataway, NJ, USA, pp 145–156
Malakar P, Balaprakash P, Vishwanath V, Morozov V, Kumaran K (2018) Benchmarking machine learning methods for performance modeling of scientific applications. In: 2018 IEEE/ACM Performance modeling, benchmarking and simulation of high performance computer systems (PMBS), pp 33–44
Mattson RL, Gecsei J, Slutz DR, Traiger IL (1970) Evaluation techniques for storage hierarchies. IBM Syst J 9(2):78–117
McCurdy C, Fischer C (2005) Using pin as a memory reference generator for multiprocessor simulation. SIGARCH Comput Archit News 33(5):39–44
Nethercote N, Seward J (2007) Valgrind: a framework for heavyweight dynamic binary instrumentation. ACM SIGPLAN Notices 42(6):89–100
Niu Q, Dinan J, Lu Q, Sadayappan P (2012) PARDA: a fast parallel reuse distance analysis algorithm. In: Proceedings of the 2012 IEEE 26th International parallel and distributed processing symposium, IPDPS ’12, pp 1284–1294. IEEE Computer Society, USA
Obaida MA, Liu J, Chennupati G, Santhi N, Eidenbenz S (2018) Parallel application performance prediction using analysis based models and HPC simulations. In: Proceedings of the 2018 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation, SIGSIM-PADS ’18, pp 49–59. Association for Computing Machinery, New York, NY, USA
Pakin S, McCormick P (2013) Hardware-independent application characterization. In: 2013 IEEE International Symposium on Workload Characterization (IISWC), pp 111–112
Patel A, Afram F, Chen S, Ghose K (2011) MARSS: a full system simulator for multicore X86 CPUs. In: Proceedings of the 48th Design Automation Conference, DAC ’11, pp 1050–1055. Association for Computing Machinery, New York, NY, USA
Payer M, Kravina E, Gross TR, (2013) Lightweight memory tracing. In: 2013 USENIX Annual Technical Conference (USENIX ATC 13). USENIX Association, San Jose, CA, pp 115–126
Pouchet LN (2012) Polybench: the polyhedral benchmark suite. URL: http://www.cs.ucla.edu/pouchet/software/polybench
Reddi VJ, Settle A, Connors DA, Cohn RS (2004) PIN: a binary instrumentation tool for computer architecture research and education. In: Proceedings of the 2004 Workshop on computer architecture education: Held in Conjunction with the 31st International Symposium on Computer Architecture, WCAE ’04, pp 22–es. Association for Computing Machinery, New York, NY, USA
Rodrigues AF, Hemmert KS, Barrett BW, Kersey C, Oldfield R, Weston M, Risen R, Cook J, Rosenfeld P, Cooper-Balis E, Jacob B (2011) The structural simulation toolkit. SIGMETRICS Perform Eval Rev 38(4):37–42
Sabarimuthu JM, Venkatesh TG (2019) Analytical derivation of concurrent reuse distance profile for multi-threaded application running on chip multi-processor. IEEE Trans Parallel Distrib Syst 30(8):1704–1721
Sanchez D, Kozyrakis C (2013) ZSim: fast and accurate microarchitectural simulation of thousand-core systems. SIGARCH Comput Archit News 41(3):475–486
Santhi N, Eidenbenz S, Liu J (2015) The Simian concept: parallel discrete event simulation with interpreted languages and just-in-time compilation. In: 2015 Winter Simulation Conference (WSC), pp. 3013–3024
Schuff DL, Kulkarni M, Pai VS (2010) Accelerating multicore reuse distance analysis with sampling and parallelization. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT ’10, p. 53–64. ACM
Schuff DL, Parsons BS, Pai VS (2010) Multicore-aware reuse distance analysis. In: 2010 IEEE International symposium on parallel and distributed processing. Workshops and Phd Forum (IPDPSW). IEEE, IEEE, Piscataway, NJ, USA, pp 1–8
Sen R, Wood DA (2013) Reuse-based Online Models for Caches. In: Proceedings of the ACM SIGMETRICS/International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS ’13, pp. 279–292. ACM, New York, NY, USA
Shalf J, Dosanjh S, Morrison J (2011) Exascale computing technology challenges. In: Palma JMLM, Daydé M, Marques O, Lopes JC (eds) High Perform Comput Comput Sci–VECPAR 2010. Springer, Berlin, pp 1–25
Sharkey J, Ponomarev D, Ghose K (2005) Abstract M-SIM: a flexible, multithreaded architectural simulation environment. Technical report, Department of Computer Science, State University of New York at Binghamton
Shen X, Shaw J, Meeker B, Ding C (2007) Locality Approximation Using Time. In: Proceedings of the 34th annual ACM SIGPLAN-SIGACT symposium on principles of programming languages, POPL ’07, pp. 55–61. ACM, New York, NY, USA
Shi X, Su F, Peir JK, Xia Y, Yang Z (2009) Modeling and stack simulation of CMP cache capacity and accessibility. IEEE Trans Parallel Distrib Syst 20(12):1752–1763
Spafford KL, Vetter JS (2012) Aspen: A domain specific language for performance modeling. In: SC ’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–11
Van den Steen S, Eyerman S, De Pestel S, Mechri M, Carlson T, Black-Schaffer D, Hagersten E, Eeckhout L (2016) Analytical processor performance and power modeling using micro-architecture independent characteristics. IEEE Trans Comput 65(12):3537–3551
den Steen SV, Eeckhout L (2018) Modeling superscalar processor memory-level parallelism. IEEE Comput. Archit. Lett 17(1):9–12
Sun G, Hughes CJ, Kim C, Zhao J, Xu C, Xie Y, Chen YK (2011) Moguls: a model to explore the memory hierarchy for bandwidth improvements. SIGARCH Comput Archit News 39(3):377–388
Terpstra D, Jagode H, You H, Dongarra J (2010) Collecting performance data with PAPI-C. Tools for high performance computing 2009. Springer, Berlin, pp 157–173
Thazhuthaveetil M, Vaswani K, Joseph P (2006) Construction and use of linear regression models for processor performance analysis. In: Twelfth international symposium on high-performance computer architecture. IEEE Computer Society, Los Alamitos, CA, USA, pp. 99–108
Unat D, Chan C, Zhang W, Williams S, Bachan J, Bell J, Shalf J (2015) ExaSAT: an exascale co-design tool for performance modeling. Int J High Perform Comput Appl 29(2):209–232
Wu MJ, Yeung D (2012) Identifying optimal multicore cache hierarchies for loop-based parallel programs via reuse distance analysis. In: Proceedings of the 2012 ACM SIGPLAN workshop on memory systems performance and correctness, MSPC ’12, pp. 2–11. Association for Computing Machinery, New York, NY, USA
Wu MJ, Yeung D (2013) Efficient reuse distance analysis of multicore scaling for loop-based parallel programs. ACM Trans Comput Syst 31(1):1–37
Wu MJ, Zhao M, Yeung D (2013) Studying multicore processor scaling via reuse distance analysis. In: Proceedings of the 40th Annual international symposium on computer architecture, ISCA ’13, p. 499–510. Association for Computing Machinery, New York, NY, USA
Wu X, Taylor V (2013) Performance modeling of hybrid MPI/openMP scientific applications on large-scale multicore Supercomputers. J Comput Syst Sci 79(8):1256–1268
Zhao M, Yeung D (2017) Using multicore reuse distance to study coherence directories. ACM Trans Comput Syst (TOCS) 35(2):1–49
Zhong Y, Dropsho SG, Shen X, Studer A, Ding C (2007) Miss rate prediction across program inputs and cache configurations. IEEE Trans Comput 56(3):328–343
Zhong Y, Shen X, Ding C (2009) Program locality analysis using reuse distance. ACM Trans Program Lang Syst 31(6):1–39