A Parametrizable High-Level Synthesis Library for Accelerating Neural Networks on FPGAs

Lester Kalms1, Pedram Amini Rad1, Muhammad Asad Ali1, Arsany Iskander2, Diana Göhringer1
1Technische Universität Dresden, Dresden, Germany
2German University in Cairo, New Cairo, Egypt

Tóm tắt

AbstractIn recent years, Convolutional Neural Network CNN have been incorporated in a large number of applications, including multimedia retrieval and image classification. However, CNN based algorithms are computationally and resource intensive and therefore difficult to be used in embedded systems. FPGA based accelerators are becoming more and more popular in research and industry due to their flexibility and energy efficiency. However, the available resources and the size of the on-chip memory can limit the performance of the FPGA accelerator for CNN. This work proposes an High-Level Synthesis HLS library for CNN algorithms. It contains seven different streaming-capable CNN (plus two conversion) functions for creating large neural networks with deep pipelines. The different functions have many parameter settings (e.g. for resolution, feature maps, data types, kernel size, parallelilization, accuracy, etc.), which also enable compile-time optimizations. Our functions are integrated into the HiFlipVX library, which is an open source HLS FPGA library for image processing and object detection. This offers the possibility to implement different types of computer vision applications with one library. Due to the various configuration and parallelization possibilities of the library functions, it is possible to implement a high-performance, scalable and resource-efficient system, as our evaluation of the MobileNets algorithm shows.

Từ khóa


Tài liệu tham khảo

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. (2016). Tensorflow: A system for large-scale machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16). https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet_v1.md (pp. 265–283).

Akgün, G., Kalms, L., Göhringer, D. (2020). Resource efficient dynamic voltage and frequency scaling on xilinx fpgas. In International symposium on applied reconfigurable computing (ARC) (pp. 178–192).

Chen, Y., He, J., Zhang, X., Hao, C., Chen, D. (2019). Cloud-dnn: an open framework for mapping dnn models to cloud fpgas. In Proceedings of the international symposium on field-programmable gate arrays (FPGA) (pp. 73–82). https://doi.org/10.1145/3289602.3293915.

Chen, Y., Luo, T., Liu, S., Zhang, S., He, L., Wang, J., Li, L., Chen, T., Xu, Z., Sun, N., Temam, O. (2014). Dadiannao: a machine-learning supercomputer. In 47th annual IEEE/ACM international symposium on microarchitecture (pp. 609–622).

Giduthuri, R., & Pulli, K. (2016). Openvx: A framework for accelerating computer vision. In SIGGRAPH ASIA 2016 Courses (pp. 14:1–14:50). https://doi.org/10.1145/2988458.2988513.

Guan, Y., Liang, H., Xu, N., Wang, W., Shi, S., Chen, X., Sun, G., Zhang, W., Cong, J. (2017). Fp-dnn: An automated framework for mapping deep neural networks onto fpgas with rtl-hls hybrid templates. In 25th annual international symposium on field-programmable custom computing machines (FCCM) (pp. 152–159).

Guo, K., Sui, L., Qiu, J., Yu, J., Wang, J., Yao, S., Han, S., Wang, Y., Yang, H. (2018). Angel-eye: A complete design flow for mapping cnn onto embedded fpga. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 37(1), 35–47.

Hassan, R., & Mostafa, H. (2020). Implementation of deep neural networks on fpga-cpu platform using xilinx sdsoc Analog Integrated Circuits and Signal Processing. https://doi.org/10.1007/s10470-020-01638-5.

Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861.

Intel. (2020). Intel FPGA SDK for OpenCL Pro Edition: Programming Guide 19.4.

Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167.

Ji, S., Xu, W., Yang, M., Yu, K. (2013). 3d convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1), 221–231.

Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on multimedia (pp. 675– 678).

Kalms, L., & Göhringer, D. (2017). Exploration of opencl for fpgas using sdaccel and comparison to gpus and multicore cpus. In 27th international conference on field programmable logic and applications (FPL) (pp. 1–4). https://doi.org/10.23919/FPL.2017.8056847.

Kalms, L., & Göhringer, D. (2020). Accelerated high-level synthesis feature detection for FPGAs using HiFlipVX, chap. 7, (pp. 115–135). New York: Springer.

Kalms, L., & Göhringer, D. (2020). Hiflipvx: Open source high-level synthesis fpga library for image processing. https://github.com/TUD-ADS/HiFlipVX.

Kalms, L., Podlubne, A., Göhringer, D. (2019). Hiflipvx: An open source high-level synthesis fpga library for image processing. In Applied reconfigurable computing (pp. 149–164).

Krizhevsky, A., Sutskever, I., Hinton, G.E. (2017). Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84–90. https://doi.org/10.1145/3065386.

Liu, B., Zou, D., Feng, L., Feng, S., Fu, P., Li, J. (2019). An fpga-based cnn accelerator integrating depthwise separable convolution. Electronics, 8, 281.

Liu, Z., Chow, P., Xu, J., Jiang, J., Dou, Y., Zhou, J. (2019). A uniform architecture design for accelerating 2d and 3d cnns on fpgas. Electronics, 8, 65.

Long, J., Shelhamer, E., Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In 2015 IEEE Conference on computer vision and pattern recognition (CVPR) (pp. 3431– 3440).

Omidian, H., & Lemieux, G.G.F. (2018). Janus: A compilation system for balancing parallelism and performance in openvx. Journal of Physics: Conference Series (JPCS) 012–011. https://doi.org/10.1088/1742-6596/1004/1/012011.

Pouyanfar, S., Sadiq, S., Yan, Y., Tian, H., Tao, Y., Reyes, M.P., Shyu, M.L., Chen, S.C., Iyengar, S.S. (2018). A survey on deep learning: Algorithms, techniques, and applications. ACM Comput. Surv 51(5). https://doi.org/10.1145/3234150.

Qasaimeh, M., Denolf, K., Lo, J., Vissers, K., Zambreno, J., Jones, P.H. (2019). Comparing energy efficiency of cpu, gpu and fpga implementations for vision kernels. In International conference on embedded software and systems (ICESS) (pp. 1–8).

Ren, S., He, K., Girshick, R., Sun, J. (2017). Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6), 1137–1149.

Sekar, C. (2017). Hemasunder: Tutorial t7: Designing with xilinx sdsoc. In 30th international conference on VLSI design and 16th international conference on embedded systems (VLSID) (pp. xl–xli). https://doi.org/10.1109/VLSID.2017.97.

Song, L., Wang, Y., Han, Y., Zhao, X., Liu, B., Li, X. (2016). C-brain: A deep learning accelerator that tames the diversity of cnns through adaptive data-level parallelization. In Proceedings of the 53rd Annual Design Automation Conference (DAC). https://doi.org/10.1145/2897937.2897995.

Suda, N., Chandra, V., Dasika, G., Mohanty, A., Ma, Y., Vrudhula, S., Seo, J.S., Cao, Y. (2016). Throughput-optimized opencl-based fpga accelerator for large-scale convolutional neural networks. In Inproceedings of international symposium on field-programmable gate arrays (FPGA) (pp. 16–25). https://doi.org/10.1145/2847263.2847276.

Taheri, S., Behnam, P., Bozorgzadeh, E., Veidenbaum, A., Nicolau, A. (2019). Affix: Automatic acceleration framework for fpga implementation of openvx vision algorithms. In International symposium on field-programmable gate arrays (FPGA) (pp. 252–261). https://doi.org/10.1145/3289602.3293907.

Tapiador Morales, R., Rios-Navarro, A., Linares-Barranco, A., Kim, M., Kadetotad, D., Seo, J.S. (2016). Comprehensive evaluation of opencl-based convolutional neural network accelerators in xilinx and altera fpgas coRR.

Tensorflow. (2020). Ssd mobilenet v1. https://tensorflow.org/lite/models/object_detection/overview.

Venieris, S.I., & Bouganis, C. (2017). Latency-driven design for fpga-based convolutional neural networks. In 27Th international conference on field programmable logic and applications (FPL) (pp. 1–8).

Wang, Y., Xu, J., Han, Y., Li, H., Li, X. (2016). Deepburning: Automatic generation of fpga-based learning accelerators for the neural network family. In 53rd design automation conference (DAC) (pp. 1–6).

Winterstein, F., Bayliss, S., Constantinides, G.A. (2013). High-level synthesis of dynamic data structures: A case study using vivado hls. In International conference on field-programmable technology (FPT) (pp. 362–365). https://doi.org/10.1109/FPT.2013.6718388.

Wu, H., Judd, P., Zhang, X., Isaev, M., Micikevicius, P. (2020). Integer quantization for deep learning inference: Principles and empirical evaluation.

Xilinx. (2019). xfopencv. https://github.com/Xilinx/xfopencv.

Zhang, C., Li, P., Sun, G., Guan, Y., Xiao, B., Cong, J. (2015). Optimizing fpga-based accelerator design for deep convolutional neural networks. In Proceedings of the international symposium on field-programmable gate arrays (FPGA) (pp. 161–170). https://doi.org/10.1145/2684746.2689060.

Zhang, C., Sun, G., Fang, Z., Zhou, P., Pan, P., Cong, J. (2018). Caffeine:towards uniformed representation and acceleration for deep convolutional neural networks. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

Zhang, J., Li, J., & 25–34. (2017). Improving the performance of opencl-based fpga accelerator for convolutional neural network. In Inproceedings of international symposium on field-programmable gate arrays (FPGA). https://doi.org/10.1145/3020078.3021698.