JMFEEL-Net: a joint multi-scale feature enhancement and lightweight transformer network for crowd counting

Mingtao Wang1, Xin Zhou1, Yuanyuan Chen1
1College of Computer Science, Sichuan University, Chengdu, China

Tóm tắt

Crowd counting based on convolutional neural networks (CNNs) has made significant progress in recent years. However, the limited receptive field of CNNs makes it challenging to capture global features for comprehensive contextual modeling, resulting in insufficient accuracy in count estimation. In comparison, vision transformer (ViT)-based counting networks have demonstrated remarkable performance by exploiting their powerful global contextual modeling capabilities. However, ViT models are associated with higher computational costs and training difficulty. In this paper, we propose a novel network named JMFEEL-Net, which utilizes joint multi-scale feature enhancement and lightweight transformer to improve crowd counting accuracy. Specifically, we use a high-resolution CNN as the backbone network to generate high-resolution feature maps. In the backend network, we propose a multi-scale feature enhancement module to address the problem of low recognition accuracy caused by multi-scale variations, especially when counting small-scale objects in dense scenes. Furthermore, we introduce an improved lightweight ViT encoder to effectively model complex global contexts. We also adopt a multi-density map supervision strategy to learn crowd distribution features from feature maps of different resolutions, thereby improving the quality and training efficiency of the density maps. To validate the effectiveness of the proposed method, we conduct extensive experiments on four challenging datasets, namely ShanghaiTech Part A/B, UCF-QNRF, and JHU-Crowd++, achieving very competitive counting performance.

Tài liệu tham khảo

Chan AB, Liang Z-SJ, Vasconcelos N (2008) Privacy preserving crowd monitoring: counting people without people models or tracking. In: 2008 IEEE conference on computer vision and pattern recognition. IEEE, pp 1–7 Sindagi VA, Patel VM (2018) A survey of recent advances in cnn-based single image crowd counting and density estimation. Pattern Recogn Lett 107:3–16 Liu Z, Wang Q, Meng F (2022) A benchmark for multi-class object counting and size estimation using deep convolutional neural networks. Eng Appl Artif Intell 116:105449 Ko T (2008) A survey on behavior analysis in video surveillance for homeland security applications. In: 2008 37th IEEE applied imagery pattern recognition workshop. IEEE, pp 1–8 Zhang Y, Zhou D, Chen S, Gao S, Ma Y (2016) Single-image crowd counting via multi-column convolutional neural network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 589–597 Babu Sam D, Surya S, Venkatesh Babu R (2017) Switching convolutional neural network for crowd counting. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5744–5752 Li Y, Zhang X, Chen D (2018) CSRNet: dilated convolutional neural networks for understanding the highly congested scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1091–1100 Liu W, Salzmann M, Fua P (2019) Context-aware crowd counting. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5099–5108 Basalamah S, Khan SD, Ullah H (2019) Scale driven convolutional neural network model for people counting and localization in crowd scenes. IEEE Access 7:71576–71584 Gao J, Wang Q, Yuan Y (2019) Scar: spatial-/channel-wise attention regression networks for crowd counting. Neurocomputing 363:1–8 Jiang X, Zhang L, Xu M, Zhang T, Lv P, Zhou B, Yang X, Pang Y (2020) Attention scaling for crowd counting. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4706–4715 Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al (2020) An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929 Liang D, Chen X, Xu W, Zhou Y, Bai X (2022) Transcrowd: weakly-supervised crowd counting with transformers. Sci China Inf Sci 65(6):160104 Lin H, Ma Z, Ji R, Wang Y, Hong X (2022) Boosting crowd counting via multifaceted attention. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 19628–19637 Tian Y, Chu X, Wang H (2021) CCTrans: simplifying and improving crowd counting with transformer. arXiv:2109.14483 Qian Y, Zhang L, Hong X, Donovan C, Arandjelovic O, Fife U, Harbin P (2022) Segmentation assisted u-shaped multi-scale transformer for crowd counting. In: 2022 British machine vision conference. The British Machine Vision Association (BMVA) Wang J, Sun K, Cheng T, Jiang B, Deng C, Zhao Y, Liu D, Mu Y, Tan M, Wang X et al (2020) Deep high-resolution representation learning for visual recognition. IEEE Trans Pattern Anal Mach Intell 43(10):3349–3364 Sam DB, Sajjan NN, Babu RV, Srinivasan M (2018) Divide and grow: capturing huge diversity in crowd images with incrementally growing CNN. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3618–3626 Cao X, Wang Z, Zhao Y, Su F (2018) Scale aggregation network for accurate and efficient crowd counting. In: Proceedings of the European conference on computer vision (ECCV), pp 734–750 Sindagi VA, Patel VM (2017) Generating high-quality crowd density maps using contextual pyramid CNNs. In: Proceedings of the IEEE international conference on computer vision, pp 1861–1870 Liu L, Qiu Z, Li G, Liu S, Ouyang W, Lin L (2019) Crowd counting with deep structured scale integration network. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1774–1783 Guo D, Li K, Zha Z-J, Wang M (2019) DADNet: dilated-attention-deformable convnet for crowd counting. In: Proceedings of the 27th ACM international conference on multimedia, pp 1823–1832 Liu N, Long Y, Zou C, Niu Q, Pan L, Wu H (2019) ADCrowdNet: an attention-injective deformable convolutional network for crowd understanding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3225–3234 Zou Z, Cheng Y, Qu X, Ji S, Guo X, Zhou P (2019) Attend to count: crowd counting with adaptive capacity multi-scale CNNs. Neurocomputing 367:75–83 Zhang A, Shen J, Xiao Z, Zhu F, Zhen X, Cao X, Shao L (2019) Relational attention network for crowd counting. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6788–6797 Xie J, Pang C, Zheng Y, Li L, Lyu C, Lyu L, Liu H (2022) Multi-scale attention recalibration network for crowd counting. Appl Soft Comput 117:108457 Mehta S, Rastegari M (2021) MobileViT: light-weight, general-purpose, and mobile-friendly vision transformer. arXiv:2110.02178 Dai J, Qi H, Xiong Y, Li Y, Zhang G, Hu H, Wei Y (2017) Deformable convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 764–773 Idrees H, Tayyab M, Athrey K, Zhang D, Al-Maadeed S, Rajpoot N, Shah M (2018) Composition loss for counting, density map estimation and localization in dense crowds. In: Proceedings of the European conference on computer vision (ECCV), pp 532–546 Sindagi VA, Yasarla R, Patel VM (2020) JHU-Crowd++: large-scale crowd counting dataset and a benchmark method. IEEE Trans Pattern Anal Mach Intell 44(5):2594–2609 Liang D, Xu W, Zhu Y, Zhou Y (2022) Focal inverse distance transform maps for crowd localization. IEEE Transactions on Multimedia Liang D, Xu W, Bai X (2022) An end-to-end transformer model for crowd localization. In: European conference on computer vision. Springer, pp 38–54 Dai M, Huang Z, Gao J, Shan H, Zhang J (2023) Cross-head supervision for crowd counting with noisy annotations. In: ICASSP 2023-2023 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 1–5 Wang Q, Breckon TP (2022) Crowd counting via segmentation guided attention networks and curriculum loss. IEEE Trans Intell Transp Syst 23(9):15233–15243 Gao X, Xie J, Chen Z, Liu A-A, Sun Z, Lyu L (2023) Dilated convolution-based feature refinement network for crowd localization. ACM Trans Multimed Comput Commun Appl 19(6):1–16 Tian Y, Lei Y, Zhang J, Wang JZ (2019) Padnet: pan-density crowd counting. IEEE Trans Image Process 29:2714–2727 Liu X, Yang J, Ding W, Wang T, Wang Z, Xiong J (2020) Adaptive mixture regression network with local counting map for crowd counting. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIV 16. Springer, pp 241–257 Wei B, Yuan Y, Wang Q (2020) MSPNet: multi-supervised parallel network for crowd counting. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 2418–2422 Wan J, Chan A (2020) Modeling noisy annotations for crowd counting. Adv Neural Inf Process Syst 33:3386–3396 Khan SD, Basalamah S (2021) Sparse to dense scale prediction for crowd couting in high density crowds. Arab J Sci Eng 46(4):3051–3065 Xu C, Liang D, Xu Y, Bai S, Zhan W, Bai X, Tomizuka M (2022) AutoScale: learning to scale for crowd counting. Int J Comput Vision 130(2):405–434 Khan SD, Basalamah S (2021) Scale and density invariant head detection deep model for crowd counting in pedestrian crowds. Vis Comput 37(8):2127–2137 Wan J, Liu Z, Chan AB (2021) A generalized loss function for crowd counting and localization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1974–1983 Khan SD, Salih Y, Zafar B, Noorwali A (2021) A deep-fusion network for crowd counting in high-density crowded scenes. Int J Comput Intell Syst 14(1):168 Meng Y, Bridge J, Wei M, Zhao Y, Qiao Y, Yang X, Huang X, Zheng Y (2022) Counting with adaptive auxiliary learning. arXiv:2203.04061