PPT-Multicore: performance prediction of OpenMP applications using reuse profiles and analytical modeling

Springer Science and Business Media LLC - Tập 78 - Trang 2354-2385 - 2021

Atanu Barai¹, Yehia Arafa¹, Abdel-Hameed Badawy¹, Gopinath Chennupati², Nandakishore Santhi², Stephan Eidenbenz²

¹New Mexico State University, Las Cruces, USA

²Los Alamos National Laboratory, Los Alamos, USA

Tóm tắt

We present PPT-Multicore, an analytical model embedded in the Performance Prediction Toolkit (PPT) to predict parallel applications’ performance running on a multicore processor. PPT-Multicore builds upon our previous work towards a multicore cache model. We extract LLVM basic block labeled memory trace using an architecture-independent LLVM-based instrumentation tool only once in an application’s lifetime. The model uses the memory trace and other parameters from an instrumented sequentially executed binary. We use probabilistic and computationally efficient reuse profiles to predict the cache hit rates and runtimes of OpenMP programs’ parallel sections. We model Intel’s Broadwell, Haswell, and AMD’s Zen2 architectures and validate our framework using different applications from PolyBench and PARSEC benchmark suites. The results show that PPT-Multicore can predict cache hit rates with an overall average error rate of 1.23% while predicting the runtime with an error rate of 9.08%.

Tài liệu tham khảo

7-CPU: 7-Zip LZMA Benchmark. https://www.7-cpu.com (2021). [Online; accessed 4-Dec-2020] Aarno D, Engblom J (2014) Software and system development using virtual platforms: full-system simulation with Wind river simics. Morgan Kaufmann, Burlington Alexandrov A, Ionescu MF, Schauser KE, Scheiman C (1997) LogGP: incorporating long messages into the LogP model for parallel computation. J Parallel Distrib Comput 44(1):71–79 Arafa Y, Badawy AA, Chennupati G, Santhi N, Eidenbenz S (2019) Ppt-gpu: scalable gpu performance modeling. IEEE Comput Archit Lett 18(1):55–58 Arafa Y, Badawy AH, Chennupati G, Barai A, Santhi N, Eidenbenz S (2020) Fast, accurate, and scalable memory modeling of GPGPUs using reuse profiles. In: Proceedings of the 34th ACM International Conference on Supercomputing, ICS ’20. Association for Computing Machinery, New York, NY, USA Arafa Y, Chennupati G, Barai A, Badawy AHA, Santhi N, Eidenbenz S, (2019) GPUs cache performance estimation using reuse distance analysis. In: 2019 IEEE 38th International Performance of Computing and Communications Conference (IPCCC), Piscataway, NJ, USA, pp 1–8. IEEE Austin T, Larson E, Ernst D (2002) SimpleScalar: an infrastructure for computer system modeling. Computer 35(2):59–67 Badamo M, Casarona J, Zhao M, Yeung D (2016) Identifying power-efficient multicore cache hierarchies via reuse distance analysis. ACM Trans Comput Syst 34(1):1–30 Badawy AA, Yeung D (2017) Guiding locality optimizations for graph computations via reuse distance analysis. IEEE Comput Archit Lett 16(2):119–122 Badawy AA, Yeung D (2017) Optimizing locality in graph computations using reuse distance profiles. In: 2017 IEEE 36th International Performance Computing and Communications Conference (IPCCC), pp 1–8 Barai A, Chennupati G, Santhi N, Badawy AH, Arafa Y, Eidenbenz S (2020) PPT-SASMM: scalable analytical shared memory model: predicting the performance of multicore caches from a single-threaded execution trace. The International symposium on memory systems. MEMSYS. Association for Computing Machinery. NY, USA, New York, pp 341–351 Berg E, Hagersten E (2004) StatCache: a probabilistic approach to efficient and accurate data locality analysis. In: 2004 IEEE International Symposium on Performance Analysis of Systems and Software—ISPASS , Piscataway, NJ, USA, pp 20–27. IEEE Berg E, Zeffer H, Hagersten E (2006) A statistical multiprocessor cache model. In: 2006 IEEE International Symposium on Performance Analysis of Systems and Software, Piscataway, NJ, USA, pp 89–99. IEEE Beyls K, D’Hollander EH (2001) Reuse distance as a metric for cache behavior. In: Proceedings of the IASTED Conference on Parallel and Distributed Computing and Systems, Piscataway, NJ, USA pp. 617–622. IEEE Bienia C (2011) Benchmarking modern multiprocessors. Ph.D. thesis, Princeton University Binkert N, Beckmann B, Black G, Reinhardt SK, Saidi A, Basu A, Hestness J, Hower DR, Krishna T, Sardashti S et al (2011) The gem5 simulator. ACM SIGARCH Comput Archit News 39(2):1–7 Brehob M, Enbody R (1999) An analytical model of locality and caching. Technical Report MSU-CSE-99-31 Carlson TE, Heirman W, Eyerman S, Hur I, Eeckhout L (2014) An evaluation of high-level mechanistic core models. ACM Trans Archit Code Optim (TACO) 11(3):1–25 Carothers CD, Meredith JS, Blanco MP, Vetter JS, Mubarak M, LaPre J, Moore S (2017) Durango: scalable synthetic workload generation for extreme-scale application performance modeling and simulation. In: Proceedings of the 2017 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation, SIGSIM-PADS ’17, pp 97–108. Association for Computing Machinery, New York, NY, USA Cascaval C, Padua DA (2003) Estimating cache misses and locality using stack distances. In: Proceedings of the 17th Annual International Conference on Supercomputing, ICS ’03, pp 150–159. ACM, New York, NY, USA Ceballos G, Hagersten E, Black-Schaffer D (2016) Formalizing data locality in task parallel applications. International conference on algorithms and architectures for parallel processing. Springer, Cham, pp 43–61 Chennupati G, Santhi N, Bird R, Thulasidasan S, Badawy AHA, Misra S, Eidenbenz S (2018) A scalable analytical memory model for CPU performance prediction. In: Jarvis S, Wright S, Hammond S (eds) High performance computing systems. Performance modeling, benchmarking, and simulation. Springer, Cham, pp 114–135 Chennupati G, Santhi N, Eidenbenz S (2019) Scalable prformance prediction of codes with memory hierarchy and pipelines. In: Proceedings of the 2019 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation, SIGSIM-PADS ’19, pp 13–24. Association for Computing Machinery, New York, NY, USA Chennupati G, Santhi N, Eidenbenz S, Thulasidasan S (2017) An analytical memory hierarchy model for performance prediction. In: Proceedings of the 2017 Winter Simulation Conference, WSC ’17. IEEE Press, Piscataway, NJ, USA Chennupati G, Santhi N, Eidenbenz S, Zerr RJ, Rosa M, Zamora RJ, Park EJ, Nadiga BT, Liu J, Ahmed K, Obaida MA (2017c) Performance prediction toolkit (PPT). Los Alamos National Laboratory (LANL) . https://github.com/lanl/PPT Collange S, Daumas M, Defour D, Parello D (2010) Barra: a parallel functional simulator for GPGPU. In: 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, pp 351–360. IEEE Cope J, Liu N, Lang S, Carns P, Carothers C, Ross R (2011) CODES: enabling co-design of multilayer exascale storage architectures. In: Proceedings of the Workshop on Emerging Supercomputing Technologies Culler D, Karp R, Patterson D, Sahay A, Schauser KE, Santos E, Subramonian R, von Eicken T (1993) LogP: towards a realistic model of parallel computation. Proceedings of the fourth ACM SIGPLAN symposium on principles and practice of parallel programming 28(7):1–12 Dagum L, Menon R (1998) OpenMP: an industry-standard API for shared-memory programming. IEEE Comput Sci Eng 5(1):46–55 Das S, Aamodt TM, Dally WJ (2015) Reuse distance-based probabilistic cache replacement. ACM Trans Archit Code Optim 12(4):1–22 Davis JD, Laudon J, Olukotun K (2005) Maximizing CMP throughput with mediocre cores. In: Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, PACT ’05, pp 51–62. IEEE Computer Society, USA De Pestel S, Steen SVd, Akram S, Eeckhout L (2018) RPPM: rapid performance prediction of multithreaded applications on multicore hardware. IEEE Comput Archit Lett 17(2):183–186 Ding C, Chilimbi T (2009) A composable model for analyzing locality of multi-threaded programs. Technical Report, MSR-TR-2009-107, Microoft Ding C, Xiang X, Bao B, Luo H, Luo YW, Wang XL (2014) Performance metrics and models for shared cache. J Comput Sci Technol 29(4):692–712 Ding C, Zhong Y (2001) Reuse distance analysis. University of Rochester, Rochester, NY, USA, Technical Report Ding C, Zhong Y (2003) Predicting whole-program locality through reuse distance analysis. Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation ( 38(5):245–257 Dubach C, Jones T, O’Boyle M (2007) Microarchitectural design space exploration using an architecture-centric approach. In: 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007), pp. 262–271 Duong N, Zhao D, Kim T, Cammarota R, Valero M, Veidenbaum AV (2012) Improving cache management policies using dynamic reuse distances. In: Proceedings of IEEE/ACM International Symposium on Microarchitecture, MICRO-45, pp. 389–400. IEEE, Piscataway, NJ, USA Eeckhout L (2010) Computer architecture performance evaluation methods. Synth Lect Comput Archit 5(1):1–145 Fog A (2016) Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs Grass T, Allande C, Armejach A, Rico A, Ayguadé E, Labarta J, Valero M, Casas M, Moreto M (2016) MUSA: a multi-level simulation approach for next-generation HPC machines. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’16. IEEE Press Grauer-Gray S, Xu L, Searles R, Ayalasomayajula S, Cavazos J, (2012) Auto-tuning a high-level language targeted to GPU codes. In 2012 Innovative Parallel Computing (InPar). IEEE, Piscataway, NJ, USA, pp 1–10 Hardavellas N, Somogyi S, Wenisch TF, Wunderlich RE, Chen S, Kim J, Falsafi B, Hoe JC, Nowatzyk AG (2004) SimFlex: a fast, accurate, flexible full-system simulation framework for performance evaluation of Server Architecture. SIGMETRICS Perform Eval Rev 31(4):31–34 Heywood RIP, Howel F (1995) HASE: a flexible toolset for computer architects. Comput J 38(10):764–775 Hill MD, Marty MR (2008) Amdahl’s law in the multicore era. IEEE Comput 41:33–38 Hughes CJ, Pai VS, Ranganathan P, Adve SV (2002) Rsim: simulating shared-memory multiprocessors with ILP processors. Computer 35(2):40–49 Huh J, Burger D, Keckler SW (2001) Exploring the design space of future CMPs. In: Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques, PACT ’01, pp 199–210. IEEE Computer Society, USA \(\ddot{{\rm I}}\)pek E, McKee SA, Caruana R, de Supinski BR, Schulz M (2006) Efficiently exploring architectural design spaces via predictive modeling. In: Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XII, pp 195–206. Association for Computing Machinery, New York, NY, USA Jiang Y, Zhang EZ, Tian K, Shen X (2010) Is reuse distance applicable to data locality analysis on chip multiprocessors? In: Proceedings of the 19th Joint European Conference on Theory and Practice of Software, International Conference on Compiler Construction, CC’10/ETAPS’10, pp 264–282. Springer Kaxiras S, Young C (2000) Coherence communication prediction in shared-memory multiprocessors. In: IEEE Proceedings Sixth International Symposium on High-performance Computer Architecture. HPCA-6 (Cat. No. PR00550), pp 156–167 Piscataway, NJ, USA. IEEE Keramidas G, Petoumenos P, Kaxiras S (2007) Cache replacement based on reuse-distance prediction. In: 2007 25th International Conference on Computer Design. IEEE, NY, USA, pp 245–250 Kise K, Katagiri T, Honda H, Yuba, T (2004) The simCore/alpha functional simulator. In: Proceedings of the 2004 workshop on computer architecture education: held in conjunction with the 31st International symposium on computer architecture, pp. 24–es Lattner C, Adve V (2004) LLVM: a compilation framework for lifelong program analysis and transformation. In: Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization, CGO ’04, pp 75–86. IEEE Computer Society, Washington, DC, USA Lee BC, Collins J, Wang H, Brooks D (2008) CPR: composable performance regression for scalable multiprocessor models. In: 2008 41st IEEE/ACM International Symposium on Microarchitecture, pp 270–281. IEEE Lee S, Meredith JS, Vetter JS (2015) COMPASS: a framework for automated performance modeling and prediction. In: Proceedings of the 29th ACM on International Conference on Supercomputing, ICS ’15, pp 405–414. Association for Computing Machinery, New York, NY, USA Liao C, Quinlan DJ, Panas T, de Supinski BR (2010) A ROSE-based openMP 3.0 research compiler supporting multiple runtime libraries. In: Proceedings of the 6th International Conference on Beyond Loop Level Parallelism in OpenMP: Accelerators, Tasking and More, IWOMP’10, pp 15–28. Springer-Verlag, Berlin, Heidelberg Maeda RKV, Cai Q, Xu J, Wang Z, Tian Z, (2017) Fast and accurate exploration of multi-level caches using hierarchical reuse distance. In: 2017 IEEE International symposium on high performance computer architecture (HPCA). IEEE, Piscataway, NJ, USA, pp 145–156 Malakar P, Balaprakash P, Vishwanath V, Morozov V, Kumaran K (2018) Benchmarking machine learning methods for performance modeling of scientific applications. In: 2018 IEEE/ACM Performance modeling, benchmarking and simulation of high performance computer systems (PMBS), pp 33–44 Mattson RL, Gecsei J, Slutz DR, Traiger IL (1970) Evaluation techniques for storage hierarchies. IBM Syst J 9(2):78–117 McCurdy C, Fischer C (2005) Using pin as a memory reference generator for multiprocessor simulation. SIGARCH Comput Archit News 33(5):39–44 Nethercote N, Seward J (2007) Valgrind: a framework for heavyweight dynamic binary instrumentation. ACM SIGPLAN Notices 42(6):89–100 Niu Q, Dinan J, Lu Q, Sadayappan P (2012) PARDA: a fast parallel reuse distance analysis algorithm. In: Proceedings of the 2012 IEEE 26th International parallel and distributed processing symposium, IPDPS ’12, pp 1284–1294. IEEE Computer Society, USA Obaida MA, Liu J, Chennupati G, Santhi N, Eidenbenz S (2018) Parallel application performance prediction using analysis based models and HPC simulations. In: Proceedings of the 2018 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation, SIGSIM-PADS ’18, pp 49–59. Association for Computing Machinery, New York, NY, USA Pakin S, McCormick P (2013) Hardware-independent application characterization. In: 2013 IEEE International Symposium on Workload Characterization (IISWC), pp 111–112 Patel A, Afram F, Chen S, Ghose K (2011) MARSS: a full system simulator for multicore X86 CPUs. In: Proceedings of the 48th Design Automation Conference, DAC ’11, pp 1050–1055. Association for Computing Machinery, New York, NY, USA Payer M, Kravina E, Gross TR, (2013) Lightweight memory tracing. In: 2013 USENIX Annual Technical Conference (USENIX ATC 13). USENIX Association, San Jose, CA, pp 115–126 Pouchet LN (2012) Polybench: the polyhedral benchmark suite. URL: http://www.cs.ucla.edu/pouchet/software/polybench Reddi VJ, Settle A, Connors DA, Cohn RS (2004) PIN: a binary instrumentation tool for computer architecture research and education. In: Proceedings of the 2004 Workshop on computer architecture education: Held in Conjunction with the 31st International Symposium on Computer Architecture, WCAE ’04, pp 22–es. Association for Computing Machinery, New York, NY, USA Rodrigues AF, Hemmert KS, Barrett BW, Kersey C, Oldfield R, Weston M, Risen R, Cook J, Rosenfeld P, Cooper-Balis E, Jacob B (2011) The structural simulation toolkit. SIGMETRICS Perform Eval Rev 38(4):37–42 Sabarimuthu JM, Venkatesh TG (2019) Analytical derivation of concurrent reuse distance profile for multi-threaded application running on chip multi-processor. IEEE Trans Parallel Distrib Syst 30(8):1704–1721 Sanchez D, Kozyrakis C (2013) ZSim: fast and accurate microarchitectural simulation of thousand-core systems. SIGARCH Comput Archit News 41(3):475–486 Santhi N, Eidenbenz S, Liu J (2015) The Simian concept: parallel discrete event simulation with interpreted languages and just-in-time compilation. In: 2015 Winter Simulation Conference (WSC), pp. 3013–3024 Schuff DL, Kulkarni M, Pai VS (2010) Accelerating multicore reuse distance analysis with sampling and parallelization. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT ’10, p. 53–64. ACM Schuff DL, Parsons BS, Pai VS (2010) Multicore-aware reuse distance analysis. In: 2010 IEEE International symposium on parallel and distributed processing. Workshops and Phd Forum (IPDPSW). IEEE, IEEE, Piscataway, NJ, USA, pp 1–8 Sen R, Wood DA (2013) Reuse-based Online Models for Caches. In: Proceedings of the ACM SIGMETRICS/International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS ’13, pp. 279–292. ACM, New York, NY, USA Shalf J, Dosanjh S, Morrison J (2011) Exascale computing technology challenges. In: Palma JMLM, Daydé M, Marques O, Lopes JC (eds) High Perform Comput Comput Sci–VECPAR 2010. Springer, Berlin, pp 1–25 Sharkey J, Ponomarev D, Ghose K (2005) Abstract M-SIM: a flexible, multithreaded architectural simulation environment. Technical report, Department of Computer Science, State University of New York at Binghamton Shen X, Shaw J, Meeker B, Ding C (2007) Locality Approximation Using Time. In: Proceedings of the 34th annual ACM SIGPLAN-SIGACT symposium on principles of programming languages, POPL ’07, pp. 55–61. ACM, New York, NY, USA Shi X, Su F, Peir JK, Xia Y, Yang Z (2009) Modeling and stack simulation of CMP cache capacity and accessibility. IEEE Trans Parallel Distrib Syst 20(12):1752–1763 Spafford KL, Vetter JS (2012) Aspen: A domain specific language for performance modeling. In: SC ’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–11 Van den Steen S, Eyerman S, De Pestel S, Mechri M, Carlson T, Black-Schaffer D, Hagersten E, Eeckhout L (2016) Analytical processor performance and power modeling using micro-architecture independent characteristics. IEEE Trans Comput 65(12):3537–3551 den Steen SV, Eeckhout L (2018) Modeling superscalar processor memory-level parallelism. IEEE Comput. Archit. Lett 17(1):9–12 Sun G, Hughes CJ, Kim C, Zhao J, Xu C, Xie Y, Chen YK (2011) Moguls: a model to explore the memory hierarchy for bandwidth improvements. SIGARCH Comput Archit News 39(3):377–388 Terpstra D, Jagode H, You H, Dongarra J (2010) Collecting performance data with PAPI-C. Tools for high performance computing 2009. Springer, Berlin, pp 157–173 Thazhuthaveetil M, Vaswani K, Joseph P (2006) Construction and use of linear regression models for processor performance analysis. In: Twelfth international symposium on high-performance computer architecture. IEEE Computer Society, Los Alamitos, CA, USA, pp. 99–108 Unat D, Chan C, Zhang W, Williams S, Bachan J, Bell J, Shalf J (2015) ExaSAT: an exascale co-design tool for performance modeling. Int J High Perform Comput Appl 29(2):209–232 Wu MJ, Yeung D (2012) Identifying optimal multicore cache hierarchies for loop-based parallel programs via reuse distance analysis. In: Proceedings of the 2012 ACM SIGPLAN workshop on memory systems performance and correctness, MSPC ’12, pp. 2–11. Association for Computing Machinery, New York, NY, USA Wu MJ, Yeung D (2013) Efficient reuse distance analysis of multicore scaling for loop-based parallel programs. ACM Trans Comput Syst 31(1):1–37 Wu MJ, Zhao M, Yeung D (2013) Studying multicore processor scaling via reuse distance analysis. In: Proceedings of the 40th Annual international symposium on computer architecture, ISCA ’13, p. 499–510. Association for Computing Machinery, New York, NY, USA Wu X, Taylor V (2013) Performance modeling of hybrid MPI/openMP scientific applications on large-scale multicore Supercomputers. J Comput Syst Sci 79(8):1256–1268 Zhao M, Yeung D (2017) Using multicore reuse distance to study coherence directories. ACM Trans Comput Syst (TOCS) 35(2):1–49 Zhong Y, Dropsho SG, Shen X, Studer A, Ding C (2007) Miss rate prediction across program inputs and cache configurations. IEEE Trans Comput 56(3):328–343 Zhong Y, Shen X, Ding C (2009) Program locality analysis using reuse distance. ACM Trans Program Lang Syst 31(6):1–39

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA