A power-performance balanced network-on-chip for mixed CPU-GPU systems

Advances in Computers - Tập 124 - Trang 45-80 - 2022
Amirhossein Mirhosseini1, Mohammad Sadrosadati2, Behnaz Soltani3, Hamid Sarbazi-Azad2,3
1Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI, United States
2School of Computer Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
3Department of Computer Engineering, Sharif University of Technology, Tehran, Iran

Tài liệu tham khảo

Mirhosseini, 2017, Binochs: bimodal network-on-chip for cpu-gpu heterogeneous systems, 1 Sadrosadati, 2018, Ltrf: enabling highcapacity register files for gpus via hardware/software cooperative register prefetching, 489 Lee, 2013, Adaptive virtual channel partitioning for network-on-chip in heterogeneous architectures, ACM Trans. Des. Autom. Electron. Syst. (TODAES), 18, 1 Mirhosseini, 2016, Quantifying the difference in resource demand among classic and modern noc workloads, 404 Zhan, 2016, Oscar: Orchestrating stt-ram cache traffic for heterogeneous cpu-gpu architectures, 1 Mirhosseini, 2019, Baran: bimodal adaptive reconfigurable-allocator network-on-chip, ACM Trans. Parallel Comput., 5, 10.1145/3294049 Jang, 2015, Bandwidth-efficient on-chip interconnect designs for gpgpus, 1 Bogdan, 2010, Workload characterization and its impact on multicore platform design, 231 Vijaykumar, 2015, A case for core-assisted bottleneck acceleration in gpus: enabling flexible data compression with assist warps, 41 Kayiran, 2014, Managing gpu concurrency in heterogeneous architectures, 114 A. Mirhosseini, M. Sadrosadati, A. Fakhrzadehgan, M. Modarressi, and H. Sarbazi-Azad, “An energy-efficient virtual channel power-gating mechanism for on-chip networks,” in Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition. EDA Consortium, 2015, pp. 1527–1532. Mehrvarzy, 2016, Power-and performance-efficient cluster-based network-on-chip with reconfigurable topology, Microprocess. Microsyst., 46, 122, 10.1016/j.micpro.2016.03.004 Jalili, 2016, Power-efficient partially-adaptive routing in on-chip mesh networks, 65 Teimouri, 2013, Power and performance efficient partial circuits in packet-switched networks-on-chip, 509 Rahmati, 2012, Power-efficient deterministic and adaptive routing in torus networks-on-chip, Microprocess. Microsyst., 36, 571, 10.1016/j.micpro.2011.05.009 Sabbaghi-Nadooshan, 2010, The 2d sem: a novel high-performance and low-power mesh-based topology for networks-on-chip, Int. J. Parallel Emerg. Distrib. Syst., 25, 331, 10.1080/17445760902894712 Arjomand, 2010, Power-performance analysis of networks-on-chip with arbitrary buffer allocation schemes, IEEE Trans. Comput. Aid. Des. Integ. Circuits Syst., 29, 1558, 10.1109/TCAD.2010.2061171 Arjomand, 2010, Voltage-frequency planning for thermal-aware, low-power design of regular 3-d nocs, 57 Arjomand, 2009, A comprehensive power-performance model for nocs with multi-flit channel buffers, 470 Modarressi, 2009, Performance and power efficient on-chip communication using adaptive virtual point-to-point connections, 203 Sabbaghi-Nadooshan, 2008, A novel high-performance and low-power mesh-based noc, 1 Modarressi, 2007, Power-aware mapping for reconfigurable noc architectures, 417 Sadrosadati, 2015, An efficient dvs scheme for on-chip networks using reconfigurable virtual channel allocators, 249 Yin, 2012, Energy-efficient non-minimal path on-chip interconnection network for heterogeneous systems, 57 Matsutani, 2012, A multi-vdd dynamic variable-pipeline on-chip router for cmps, 407 Dreslinski, 2010, Near-threshold computing: reclaiming moore's law through energy efficient integrated circuits, Proc. IEEE, 98, 253, 10.1109/JPROC.2009.2034764 Mishra, 2013, A heterogeneous multiple network-on-chip design: an application-aware approach, 1 Che, 2009, Rodinia: a benchmark suite for heterogeneous computing, 44 Gopireddy, 2016, Scalcore: designing a core for voltage scalability, 681 Jan, 2012, A 22nm soc platform technology featuring 3-d tri-gate and high-k/metal gate, optimized for ultra low power, high performance and high density soc applications, 1 Dimitrakopoulos, 2008, Fast arbiters for on-chip network switches, 664 Evripidou, 2012, Virtualizing virtual channels for increased network-on-chip robustness and upgradeability, 21 Gratz, 2008, Regional congestion awareness for load balance in networks-on-chip, 203 Chiu, 2000, The odd-even turn model for adaptive routing, IEEE Trans. Parallel Distrib. Syst., 11, 729, 10.1109/71.877831 Dally, 1993, Deadlock-free adaptive routing in multicomputer networks using virtual channels, IEEE Trans. Parallel Distrib. Syst., 4, 466, 10.1109/71.219761 Valiant, 1982, A scheme for fast parallel communication, SIAM J. Comput., 11, 350, 10.1137/0211027 Leng, 2013, Gpuwattch: enabling energy optimizations in gpgpus, 487 Sethia, 2014, Equalizer: dynamic tuning of gpu resources for efficient execution, 647 Kim, 2008, System level analysis of fast, per-core dvfs using on-chip switching regulators, 123 Godycki, 2014, Enabling realistic fine-grain voltage scaling with reconfigurable power distribution networks, 381 Ausavarungnirun, 2012, Staged memory scheduling: achieving high performance and scalability in heterogeneous systems, 416 Bakhoda, 2009, Analyzing cuda workloads using a detailed gpu simulator, 163 Jiang, 2013, A detailed and flexible cycle-accurate network-on-chip simulator, 86 Sadrosadati, 2017, Effective cache bank placement for gpus, 2017, 31 Abts, 2009, Achieving predictable performance through better memory controller placement in many-core CMPs, ACM SIGARCH Comput. Archit. News, 37, 451, 10.1145/1555815.1555810 Lee, 2012, Tap: A tlp-aware cache management policy for a cpu-gpu heterogeneous architecture, 1 Jeong, 2012, A qos-aware memory controller for dynamically balancing gpu and cpu bandwidth use in an mpsoc, 850 Jog, 2014, Application-aware memory system for fair and efficient execution of concurrent gpgpu applications, 1 Jog, 2013, Orchestrated scheduling and prefetching for gpgpus, 332 A. Jog, O. Kayiran, N. Chidambaram Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, “Owl: cooperative thread array aware scheduling techniques for improving gpgpu performance,” in ACM SIGPLAN Not., vol. 48, no. 4. ACM, 2013, pp. 395–406. Kayiran, 2013, Neither more nor less: optimizing thread-level parallelism for gpgpus, 157 Rogers, 2012, Cache-conscious wavefront scheduling, 72 Das, 2010, Aergia: exploiting packet latency slack in on-chip networks, 106 Kumar, 2007, Express virtual channels: towards the ideal interconnection fabric, ACM SIGARCH Comput. Archit. News, 35, 150, 10.1145/1273440.1250681 Nicopoulos, 2006, Vichar: a dynamic virtual channel regulator for network-on-chip routers, 333 Choi, 2016, Hybrid network-on-chip architectures for accelerating deep learning kernels on heterogeneous manycore platforms, 1 Abadal, 2016, Wisync: an architecture for fast synchronization through on-chip wireless communication, ACM SIGPLAN Not., 51, 3, 10.1145/2954679.2872396 Sadrosadati, 2019, Itap: idletime-aware power management for gpu execution units, ACM Trans. Archit. Code Optim., 16, 10.1145/3291606 Miller, 2012, Booster: reactive core acceleration for mitigating the effects of process variation and application imbalance in low-voltage chips, 1 Dreslinski, 2007, An energy efficient parallel architecture using near threshold operation, 175 Aghilinasab, 2016, Reducing power consumption of gpgpus through instruction reordering, 356 Ruhl, 2013, Ia-32 Processor with a wide-voltage-operating range in 32-nm cmos, IEEE Micro, 33, 28, 10.1109/MM.2013.8