A survey on techniques for cooperative CPU-GPU computing
Tóm tắt
Từ khóa
Tài liệu tham khảo
Lee, 2014, Boosting CUDA applications with CPU-GPU hybrid computing, Int. J. Parallel Program., 42, 384, 10.1007/s10766-013-0252-y
Pandit, 2014, Fluidic kernels: cooperative execution of OpenCL programs on multiple heterogeneous devices, Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, 273, 10.1145/2544137.2544163
Lee, 2015, SKMD: single kernel on multiple devices for transparent CPU-GPU collaboration, ACM Trans. Comput. Syst. (TOCS), 33, 9, 10.1145/2798725
Piao, 2015, JAWS: a JavaScript framework for adaptive CPU-GPU work sharing, ACM SIGPLAN Notices, 50, 251, 10.1145/2858788.2688525
Tomov, 2010, Dense linear algebra solvers for multicore with GPU accelerators, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 1
Agulleiro, 2012, Hybrid computing: CPU+ GPU co-processing and its application to tomographic reconstruction, J. Ultramicroscopy, 115, 109, 10.1016/j.ultramic.2012.02.003
Song, 2012, Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems, Proceedings of the 26th ACM International Conference on Supercomputing, 365, 10.1145/2304576.2304625
Xu, 2012, Discrete particle simulation of gas-solid two-phase flows with multi-scale CPU-GPU hybrid computation, Chem. Eng. J., 207, 746, 10.1016/j.cej.2012.07.049
Teodoro, 2013, Efficient irregular wavefront propagation algorithms on hybrid CPU–GPU machines, Parallel Comput., 39, 189, 10.1016/j.parco.2013.03.001
Papadrakakis, 2011, A new era in scientific computing: domain decomposition methods in hybrid CPU–GPU architectures, Comput. Methods Appl. Mech. Eng., 200, 1490, 10.1016/j.cma.2011.01.013
Chakroun, 2013, Combining multi-core and GPU computing for solving combinatorial optimization problems, J. Parallel Distrib. Comput., 73, 1563, 10.1016/j.jpdc.2013.07.023
Chen, 2014, Adaptive block size for dense QR factorization in hybrid CPU-GPU systems via statistical modeling, Parallel Comput., 40, 70, 10.1016/j.parco.2014.03.001
Zhang, 2015, Accelerating aerial image simulation using improved CPU/GPU collaborative computing, Comput. Electr. Eng., 46, 176, 10.1016/j.compeleceng.2015.05.018
Wan, 2016, Efficient CPU-GPU cooperative computing for solving the subset-sum problem, Concurr. Comput. Pract. Exp., 28, 492, 10.1002/cpe.3629
Yao, 2016, STEM image simulation with hybrid CPU/GPU programming, Ultramicroscopy, 166, 1, 10.1016/j.ultramic.2016.04.001
Liu, 2016, Hybrid CPU-GPU scheduling and execution of tree traversals, Proceedings of the 2016 International Conference on Supercomputing, 2
Antoniadis, 2017, A hybrid CPU-GPU parallelization scheme of variable neighborhood search for inventory optimization problems, Electron. Notes Discret. Math., 58, 47, 10.1016/j.endm.2017.03.007
Wende, 2012, On improving the performance of multi-threaded CUDA applications with concurrent kernel execution by kernel reordering, 2012 Symposium on Application Accelerators in High Performance Computing (SAAHPC), 74, 10.1109/SAAHPC.2012.12
Auerbach, 2012, A compiler and runtime for heterogeneous computing, Proceedings of the 49th Annual Design Automation Conference, 271, 10.1145/2228360.2228411
Robson, 2016, Runtime coordinated heterogeneous tasks in charm++, Proceedings of the Second International Workshop on Extreme Scale Programming Models and Middleware, 40
Huang, 2012, A CPU-GPGPU scheduler based on data transmission bandwidth of workload, 2012 13th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT), 610
Boyer, 2013, Improving GPU performance prediction with data transfer modeling, 2013 IEEE 27th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 1097, 10.1109/IPDPSW.2013.236
Mokhtari, 2014, BigKernel--high performance CPU-GPU communication pipelining for big data-style applications, 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 819, 10.1109/IPDPS.2014.89
Sunitha, 2017, Performance improvement of CUDA applications by reducing CPU-GPU data transfer overhead, 2017 International Conference on Inventive Communication and Computational Technologies (ICICCT), 211, 10.1109/ICICCT.2017.7975190
Lázaro-Muñoz, 2017, A tasks reordering model to reduce transfers overhead on GPUs, J. Parallel Distrib. Comput., 109, 258, 10.1016/j.jpdc.2017.06.015
Stratton, 2008, MCUDA: an efficient implementation of CUDA kernels for multi-core CPUs, LCPC, 2008, 16
Diamos, 2008, Harmony: an execution model and runtime for heterogeneous many core systems, Proceedings of the 17th International Symposium on High Performance Distributed Computing, 197, 10.1145/1383422.1383447
Papakonstantinou, 2009, FCUDA: enabling efficient compilation of CUDA kernels onto FPGAs, 2009 IEEE 7th Symposium on Application Specific Processors, SASP’09, 35, 10.1109/SASP.2009.5226333
Diamos, 2010, Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems, Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, 353, 10.1145/1854273.1854318
Gummaraju, 2010, Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors, Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, 205, 10.1145/1854273.1854302
Hong, 2010, MapCG: writing parallel program portable between CPU and GPU, Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, 217, 10.1145/1854273.1854303
Augonnet, 2011, StarPU: a unified platform for task scheduling on heterogeneous multicore architectures, Concurr. Comput. Pract. Exp., 23, 187, 10.1002/cpe.1631
Wang, 2008, Task scheduling of parallel processing in CPU-GPU collaborative environment, International Conference on Computer Science and Information Technology, 2008. ICCSIT'08, 228, 10.1109/ICCSIT.2008.27
Luk, 2009, Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping, Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, 45, 10.1145/1669112.1669121
Jiménez, 2009, Predictive runtime code scheduling for heterogeneous architectures, HiPEAC, 9, 19
Gregg, 2012, Fine-grained Resource sharing for concurrent GPGPU kernels, Proceedings of the 4th USENIX Conference on Hot Topics in Parallelism (HotPar'12)
Zhong, 2012, Data partitioning on heterogeneous multicore and multi-GPU systems using functional performance models of data-parallel applications, 2012 IEEE International Conference on Cluster Computing (CLUSTER), 191, 10.1109/CLUSTER.2012.34
Sun, 2012, Enabling task-level scheduling on heterogeneous platforms, Proceedings of the 5th Annual Workshop on General Purpose Processing With Graphics Processing Units, 84, 10.1145/2159430.2159440
Grasso, 2013, Automatic problem size sensitive task partitioning on heterogeneous parallel systems, ACM SIGPLAN Notices, 48, 281, 10.1145/2517327.2442545
Zhong, 2014, Kernelet: high-throughput GPU kernel executions with dynamic slicing and scheduling, IEEE Trans. Parallel Distrib. Syst., 25, 1522, 10.1109/TPDS.2013.257
Yao, 2013, Partition strategies for C source programs to support CPU+GPU coordination computing, International Conference on Information Science and Cloud Computing, 39
Aciu, 2013, Algorithm for cooperative CPU-GPU computing, 15th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC), 352
Wen, 2014, Smart multi-task scheduling for OpenCL programs on CPU/GPU heterogeneous platforms, 21st International Conference on High Performance Computing (HiPC), 1
Li, 2014, Symbiotic scheduling of concurrent GPU kernels for performance and energy optimizations, Proceedings of the 11th ACM Conference on Computing Frontiers, 36
Vilches, 2015, Adaptive partitioning for irregular applications on heterogeneous CPU-GPU chips, Procedia Comput. Sci., 51, 140, 10.1016/j.procs.2015.05.213
Wang, 2016, Performance Optimization for CPU-GPU Heterogeneous Parallel System, 2016 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD), 1259, 10.1109/FSKD.2016.7603359
Wang, 2016, A user mode CPU-GPU scheduling framework for hybrid workloads, Future Gener. Comput. Syst., 63, 25, 10.1016/j.future.2016.03.011
Wang, 2010, Power-efficient work distribution method for CPU-GPU heterogeneous system, 2010 International Symposium on Parallel and Distributed Processing With Applications (ISPA), 122, 10.1109/ISPA.2010.22
Ge, 2014, PEACH: a model for performance and energy aware cooperative hybrid computing, Proceedings of the 11th ACM Conference on Computing Frontiers, 24
Lang, 2014, An execution time and energy model for an energy-aware execution of a conjugate gradient method with CPU/GPU collaboration, J. Parallel Distrib. Comput., 74, 2884, 10.1016/j.jpdc.2014.06.001
Ma, 2016, Energy conservation for GPU-CPU architectures with dynamic workload division and frequency scaling, Sustain. Comput. Inform. Syst., 12, 21
Siehl, 2016, Power-aware heterogeneous computing through CPU-GPU hybridization, Energy, 20, 60
Chau, 2017, Energy efficient job scheduling with DVFS for CPU-GPU heterogeneous systems, Proceedings of the Eighth International Conference on Future Energy Systems, 1
Gong, 2017, Cooperative DVFS for energy-efficient HEVC decoding on embedded CPU-GPU architecture, Proceedings of the 54th Annual Design Automation Conference, 42
Kale, 1993, CHARM++: a portable concurrent object oriented system based on C++, ACM Sigplan Notices, 28, 91, 10.1145/167962.165874
Lattner, 2004, LLVM: a compilation framework for lifelong program analysis & transformation, Proceedings of the International Symposium on Code Generation and Optimization: Feedback-Directed and Runtime Optimization, 75, 10.1109/CGO.2004.1281665
Stafford, 2017, To distribute or not to distribute: the question of load balancing for performance or energy, European Conference on Parallel Processing, 710
Daga, 2011, On the efficacy of a fused CPU+ GPU processor (or APU) for parallel computing, 2011 Symposium on Application Accelerators in High-Performance Computing (SAAHPC), 141, 10.1109/SAAHPC.2011.29
Lee, 2013, Performance characterization of data-intensive kernels on AMD fusion architectures, Computer Science-Research and Development, 28, 175, 10.1007/s00450-012-0209-1
Spafford, 2012, The tradeoffs of fused memory hierarchies in heterogeneous computing architectures, Proceedings of the 9th Conference on Computing Frontiers, 103, 10.1145/2212908.2212924
Said, 2016, On the efficiency of the accelerated processing unit for scientific computing, Proceedings of the 24th High Performance Computing Symposium, 25
Dashti, 2017, Analyzing memory management methods on integrated CPU-GPU systems, Proceedings of the 2017 ACM SIGPLAN International Symposium on Memory Management, 59, 10.1145/3092255.3092256
Daga, 2012, Exploiting coarse-grained parallelism in b+ tree searches on an apu, High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion, 240, 10.1109/SC.Companion.2012.40
Gu, 2014, Implementation and evaluation of deep neural networks (DNN) on mainstream heterogeneous systems, Proceedings of 5th Asia-Pacific Workshop on Systems, 12
Wyrzykowski, 2013, Efficient execution of erasure codes on AMD APU architecture, International Conference on Parallel Processing and Applied Mathematics, 613
Delorme, 2013, Parallel radix sort on the AMD fusion accelerated processing unit, 2013 42nd International Conference on Parallel Processing (ICPP), 339, 10.1109/ICPP.2013.43
He, 2013, Revisiting co-processing for hash joins on the coupled CPU-GPU architecture, Proceedings of the VLDB Endowment, 6, 889, 10.14778/2536206.2536216
He, 2014, In-cache query co-processing on coupled CPU-GPU architectures, Proceedings of the VLDB Endowment, 8, 329, 10.14778/2735496.2735497
Eberhart, 2014, Hybrid strategy for stencil computations on the APU, Proceedings of the 1st International Workshop on High-Performance Stencil Computations, 43
Cheng, 2015, Energy-efficient query processing on embedded CPU-GPU architectures, Proceedings of the 11th International Workshop on Data Management on New Hardware, 10
Zhang, 2017, Understanding co-running behaviors on integrated CPU/GPU architectures, Ieee Trans. Parallel Distrib. Syst., 28, 905, 10.1109/TPDS.2016.2586074
Lupescu, 2017, Using the integrated GPU to improve CPU sort performance, 2017 46th International Conference on Parallel Processing Workshops (ICPPW), 39, 10.1109/ICPPW.2017.19
Zhang, 2017, FinePar: irregularity-aware fine-grained workload partitioning on integrated architectures, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 27, 10.1109/CGO.2017.7863726
Zhu, 2017, Co-run scheduling with power cap on integrated CPU-GPU systems, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 967, 10.1109/IPDPS.2017.124
Fang, 2017, Understanding data partition for applications on CPU-GPU integrated processors, International Conference on Mobile Ad-Hoc and Sensor Networks, 426
Mittal, 2015, A survey of CPU-GPU heterogeneous computing techniques, ACM Computing Surveys (CSUR), 47, 69, 10.1145/2788396
Insieme compiler and runtime infrastructure. Distributed and Parallel Systems Group, 2012. University of Innsbruck. URL http://insieme-compiler.org.
Web Worker. URL http://www.w3.org/TR/workers.
WebCL Standard. URL www.khronos.org/webcl/.
CUDA C Programming Guide, Version 8.0, Nvidia Corporation (2017). URL www.nvidia.com.
OpenCL Programming User Guide, rev 1.0, Advanced Micro Devices, Inc. (2013). URL www.amd.com.
OpenMP Application Program Interface, Version 4.0 (2013). URL www.openmp.org.