FlexPDA: A Flexible Programming Framework for Deep Learning Accelerators

Springer Science and Business Media LLC - Tập 37 - Trang 1200-1220 - 2022
Lei Liu1,2, Xiu Ma1,2, Hua-Xiao Liu1,2, Guang-Li Li3,4
1College of Computer Science and Technology, Jilin University, Changchun, China
2Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, China
3State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
4University of Chinese Academy of Sciences, Beijing, China

Tóm tắt

There are a wide variety of intelligence accelerators with promising performance and energy efficiency, deployed in a broad range of applications such as computer vision and speech recognition. However, programming productivity hinders the deployment of deep learning accelerators. The low-level library invoked in the high-level deep learning framework which supports the end-to-end execution with a given model, is designed to reduce the programming burden on the intelligence accelerators. Unfortunately, it is inflexible for developers to build a network model for every deep learning application, which probably brings unnecessary repetitive implementation. In this paper, a flexible and efficient programming framework for deep learning accelerators, FlexPDA, is proposed, which provides more optimization opportunities than the low-level library and realizes quick transplantation of applications to intelligence accelerators for fast upgrades. We evaluate FlexPDA by using 10 representative operators selected from deep learning algorithms and an end-to-end network. The experimental results validate the effectiveness of FlexPDA, which achieves an end-to-end performance improvement of 1.620x over the low-level library.

Tài liệu tham khảo

Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014. http://arxiv.org/abs/1409.1556, Sept. 2021. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In Proc. the 2016 IEEE Conference on Computer Vision and Pattern Recognition, June 2016, pp.770-778. https://doi.org/10.1109/CVPR.2016.90. LiKamWa R, Hou Y, Gao J, Polansky M, Zhong L. RedEye: Analog convnet image sensor architecture for continuous mobile vision. ACM SIGARCH Comput. Archit. News, 2016, 44(3): 255-266. https://doi.org/10.1145/3007787.3001164. Qian Y, Woodland P C. Very deep convolutional neural networks for robust speech recognition. In Proc. the 2016 IEEE Spoken Language Technology Workshop, Dec. 2016, pp.481-488. https://doi.org/10.1109/SLT.2016.7846307. Abdel-Hamid O, Mohamed A, Jiang H, Deng L, Penn G, Yu D. Convolutional neural networks for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2014, 22(10): 1533-1545. https://doi.org/10.1109/TASLP.2014.2339736. Eriguchi A, Hashimoto K, Tsuruoka Y. Tree-to-sequence attentional neural machine translation. arXiv:1603.06075, 2016. http://arxiv.org/abs/1409.1556, Sept. 2021. Deng L, He X, Gao J. Deep stacking networks for information retrieval. In Proc. the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, May 2013, pp.3153-3157. https://doi.org/10.1109/ICASSP.2013.6638239. Chen X, Ma H, Wan J, Li B, Xia T. Multi-view 3D object detection network for autonomous driving. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.1907-1915. https://doi.org/10.1109/CVPR.2017.691. Maqueda A I, Loquercio A, Gallego G, García N, Scaramuzza D. Event-based vision meets deep learning on steering prediction for self-driving cars. In Proc. the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2018, pp.5419-5427. https://doi.org/10.1109/CVPR.2018.00568. Cireşan D C, Giusti A, Gambardella L M, Schmidhuber J. Mitosis detection in breast cancer histology images with deep neural networks. In Proc. the International Conference on Medical Image Computing and Computer-Assisted Intervention, Sept. 2013, pp.411-418. https://doi.org/10.1007/978-3-642-40763-5_51. Ma M, Shi Y, Li W, Gao Y, Xu J. A novel two-stage deep method for mitosis detection in breast cancer histology images. In Proc. the 24th International Conference on Pattern Recognition, Aug. 2018, pp.3892-3897. https://doi.org/10.1109/ICPR.2018.8546192. Abadi M, Barham P, Chen J et al. TensorFlow: A system for large-scale machine learning. In Proc. the 12th USENIX Symposium on Operating Systems Design and Implementation, Nov. 2016, pp.265-283. Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T. Caffe: Convolutional architecture for fast feature embedding. In Proc. the 22nd ACM International Conference on Multimedia, Nov. 2014, pp.675-678. https://doi.org/10.1145/2647868.2654889. Al-Rfou R, Alain G, Almahairi A et al. Theano: A Python framework for fast computation of mathematical expressions. arXiv:1605.02688, 2016. https://arxiv.org/abs/1605.02688, Sept. 2021. Chen Y, Luo T, Liu S et al. DaDianNao: A machine-learning supercomputer. In Proc. the 47th Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2014, pp.609-622. https://doi.org/10.1109/MICRO.2014.58. Lattner C, Adve V. LLVM: A compilation framework for lifelong program analysis & transformation. In Proc. the International Symposium on Code Generation and Optimization, Mar. 2004, pp.75-86. https://doi.org/10.1109/CGO.2004.1281665. Chen T, Du Z, Sun N, Wang J, Wu C, Chen Y, Temam O. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. SIGARCH Comput. Archit. News, 2014, 42(1): 269-284. https://doi.org/10.1145/2654822.2541967. Fatahalian K, Knight T J, Houston M et al. Sequoia: Programming the memory hierarchy. In Proc. the 2006 ACM/IEEE Conference on Supercomputing, Nov. 2006, Article No. 4. https://doi.org/10.1109/SC.2006.55. Lan H Y, Wu L Y, Zhang X, Tao J H, Chen X Y, Wang B R, Wang Y Q, Guo Q, Chen Y J. DLPlib: A library for deep learning processor. Journal of Computer Science and Technology, 2017, 32(2): 286-296. https://doi.org/10.1007/s11390-017-1722-2. Zhang X, Zhou X, Lin M, Sun J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2018, pp.6848-6856. Li J, Jiang Z, Liu F, Dong X, Li G, Wang X, Cao W, Liu L, Wang Y, Li T, Feng X. Characterizing the I/O pipeline in the deployment of CNNs on commercial accelerators. In Proc. the 2020 IEEE Int. Conf. Parallel Distributed Processing with Applications, Big Data Cloud Computing, Sustainable Computing Communications, Social Computing Networking, Dec. 2020, pp.137-144. https://doi.org/10.1109/ISPABDCloud-SocialCom-SustainCom51426.2020.00043. Thomas D, Moorby P. The Verilogr Hardware Description Language. Springer Science & Business Media, 2008. Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks. Commun. ACM, 2017, 60(6): 84-90. https://doi.org/10.1145/3065386. Dagum L, Menon R. OpenMP: An industry-standard API for shared-memory programming. IEEE Computational Science and Engineering, 1998, 5(1): 46-55. https://doi.org/10.1109/99.660313. Saini S, Simon H. Enhancing applications performance on Intel Paragon through dynamic memory allocation. In Proc. the Scalable Parallel Libraries Conference, Oct. 1993, pp.232-239. https://doi.org/10.1109/SPLC.1993.365561. Udayakumaran S, Barua R. Compiler-decided dynamic memory allocation for scratch-pad based embedded systems. In Proc. the 2003 International Conference on Compilers, Architecture and Synthesis for Embedded Systems, Oct. 2003, pp.276-286. https://doi.org/10.1145/951710.951747. Feautrier P, Lengauer C. Polyhedron model. In Encyclopedia of Parallel Computing, Padua D (ed.), Springer, 2011. https://doi.org/10.1007/978-0-387-09766-4_502. Pellegrini S, Hoeer T, Fahringer T. Exact dependence analysis for increased communication overlap. In Proc. the European MPI Users’ Group Meeting, Sept. 2012, pp.89-99. https://doi.org/10.1007/978-3-642-33518-1_14. Wu J, Belevich A, Bendersky E, Heffernan M, Leary C, Pienaar J, Roune B, Springer R, Weng X, Hundt R. GPUCC: An open-source GPGPU compiler. In Proc. the 2016 International Symposium on Code Generation and Optimization, March 2016, pp.105-116. Du Z, Fasthuber R, Chen T, Ienne P, Li L, Luo T, Feng X, Chen Y, Temam O. ShiDianNao: Shifting vision processing closer to the sensor. In Proc. the 42nd Annual International Symposium on Computer Architecture, June 2015, pp.92-104. Zhang S, Du Z, Zhang L, Lan H, Liu S, Li L, Guo Q, Chen T, Chen Y. Cambricon-X: An accelerator for sparse neural networks. In Proc. the 49th Annual IEEE/ACM International Symposium on Microarchitecture, Oct. 2016, Article No. 20. https://doi.org/10.1109/MICRO.2016.7783723. Fahmy H, Holt R C. Software architecture transformations. In Proc. the 2000 International Conference on Software Maintenance, Oct. 2000, pp.88-96. https://doi.org/10.1109/ICSM.2000.883020. Fahmy H, Holt R C. Using graph rewriting to specify software architectural transformations. In Proc. the 15th IEEE International Conference on Automated Software Engineering, Sept. 2000, pp.187-196. https://doi.org/10.1109/ASE.2000.873663. Moriconi M, Qian X, Riemenschneider R A. Correct architecture refinement. IEEE Transactions on Software Engineering, 1995, 21(4): 356-372. https://doi.org/10.1109/32.385972. Chen X, Peng S, Jin L, Zhuang Y, Song J, Du W, Liu S, Zhi T. Partition and scheduling algorithms for neural network accelerators. In Proc. the 13th International Symposium on Advanced Parallel Processing Technologies, Aug. 2019, pp.55-67. https://doi.org/10.1007/978-3-030-29611-7_5. Mishra P, Dutt N, Nicolau A. Functional abstraction driven design space exploration of heterogeneous programmable architectures. In Proc. the 14th International Symposium on Systems Synthesis, September 30-October 3, 2001, pp.256-261. https://doi.org/10.1145/500001.500061. Peterson J B, Athanas P M. Resource pools: An abstraction for configurable computing codesign. Proceedings of the SPIE, 1996, 2914: 218-224. https://doi.org/10.1117/12.255819. Handziski V, Polastre J, Hauer J H, Sharp C, Wolisz A, Culler D. Flexible hardware abstraction for wireless sensor networks. In Proc. the 2nd European Workshop on Wireless Sensor Networks, Feb. 2005, pp.145-157. https://doi.org/10.1109/EWSN.2005.1462006. Du W, Wu L, Chen X, Zhuang Y, Zhi T. ZhuQue: A neural network programming model based on labeled data layout. In Proc. the 13th International Symposium on Advanced Parallel Processing Technologies, Aug. 2019, pp.27-39. https://doi.org/10.1007/978-3-030-29611-7_3. Song J, Zhuang Y, Chen X, Zhi T, Liu S. Compiling optimization for neural network accelerators. In Proc. the 13th International Symposium on Advanced Parallel Processing Technologies, August 2019, pp.15-26. https://doi.org/10.1007/978-3-030-29611-7_2. Chen T, Moreau T, Jiang Z et al. TVM: An automated end-to-end optimizing compiler for deep learning. In Proc. the 13th USENIX Symposium on Operating Systems Design and Implementation, Oct. 2018, pp.578-594. Truong L, Barik R, Totoni E, Liu H, Markley C, Fox A, Shpeisman T. Latte: A language, compiler, and runtime for elegant and efficient deep neural networks. ACM SIGPLAN Notice, 2016, 51(6): 209-223. https://doi.org/10.1145/2908080.2908105. Vasilache N, Zinenko O, Theodoridis T et al. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. arXiv:1802.04730, 2018. https://arxiv.org/abs/1802.04730, Sept. 2021. Kim H, Lyuh C G, Kwon Y. Automated optimization for memory-efficient high-performance deep neural network accelerators. ETRI Journal, 2020, 42(4): 505-517. https://doi.org/10.4218/etrij.2020-0125. Li G, Wang X, Ma X, Liu L, Feng X. XDN: Towards efficient inference of residual neural networks on Cambricon chips. In Proc. the 2nd Bench Council International Symposiumon Benchmarking, Measuring and Optimization, Nov. 2019, pp.51-56. https://doi.org/10.1007/978-3-030-49556-5_4. Liu Z, Leng J, Chen Q, Li C, Zheng W, Li L, Guo M. DLFusion: An auto-tuning compiler for layer fusion on deep neural network accelerator. arXiv:2011.05630, 2020. https://arxiv.org/abs/2011.05630, Sept. 2021. Zhao J, Di P. Optimizing the memory hierarchy by compositing automatic transformations on computations and data. In Proc. the 53rd Annual IEEE/ACM International Symposium on Microarchitecture, Oct. 2020, pp.427-441. https://doi.org/10.1109/MICRO50266.2020.00044. Zheng H, Oh S, Wang H, Briggs P, Gai J, Jain A, Liu Y, Heaton R, Huang R, Wang Y. Optimizing memory-access patterns for deep learning accelerators. arXiv:2002.12798, 2020. https://arxiv.org/abs/2002.12798, Sept. 2021.