JMFEEL-Net: a joint multi-scale feature enhancement and lightweight transformer network for crowd counting
Knowledge and Information Systems - Trang 1-21 - 2024
Tóm tắt
Crowd counting based on convolutional neural networks (CNNs) has made significant progress in recent years. However, the limited receptive field of CNNs makes it challenging to capture global features for comprehensive contextual modeling, resulting in insufficient accuracy in count estimation. In comparison, vision transformer (ViT)-based counting networks have demonstrated remarkable performance by exploiting their powerful global contextual modeling capabilities. However, ViT models are associated with higher computational costs and training difficulty. In this paper, we propose a novel network named JMFEEL-Net, which utilizes joint multi-scale feature enhancement and lightweight transformer to improve crowd counting accuracy. Specifically, we use a high-resolution CNN as the backbone network to generate high-resolution feature maps. In the backend network, we propose a multi-scale feature enhancement module to address the problem of low recognition accuracy caused by multi-scale variations, especially when counting small-scale objects in dense scenes. Furthermore, we introduce an improved lightweight ViT encoder to effectively model complex global contexts. We also adopt a multi-density map supervision strategy to learn crowd distribution features from feature maps of different resolutions, thereby improving the quality and training efficiency of the density maps. To validate the effectiveness of the proposed method, we conduct extensive experiments on four challenging datasets, namely ShanghaiTech Part A/B, UCF-QNRF, and JHU-Crowd++, achieving very competitive counting performance.
Tài liệu tham khảo
Chan AB, Liang Z-SJ, Vasconcelos N (2008) Privacy preserving crowd monitoring: counting people without people models or tracking. In: 2008 IEEE conference on computer vision and pattern recognition. IEEE, pp 1–7
Sindagi VA, Patel VM (2018) A survey of recent advances in cnn-based single image crowd counting and density estimation. Pattern Recogn Lett 107:3–16
Liu Z, Wang Q, Meng F (2022) A benchmark for multi-class object counting and size estimation using deep convolutional neural networks. Eng Appl Artif Intell 116:105449
Ko T (2008) A survey on behavior analysis in video surveillance for homeland security applications. In: 2008 37th IEEE applied imagery pattern recognition workshop. IEEE, pp 1–8
Zhang Y, Zhou D, Chen S, Gao S, Ma Y (2016) Single-image crowd counting via multi-column convolutional neural network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 589–597
Babu Sam D, Surya S, Venkatesh Babu R (2017) Switching convolutional neural network for crowd counting. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5744–5752
Li Y, Zhang X, Chen D (2018) CSRNet: dilated convolutional neural networks for understanding the highly congested scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1091–1100
Liu W, Salzmann M, Fua P (2019) Context-aware crowd counting. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5099–5108
Basalamah S, Khan SD, Ullah H (2019) Scale driven convolutional neural network model for people counting and localization in crowd scenes. IEEE Access 7:71576–71584
Gao J, Wang Q, Yuan Y (2019) Scar: spatial-/channel-wise attention regression networks for crowd counting. Neurocomputing 363:1–8
Jiang X, Zhang L, Xu M, Zhang T, Lv P, Zhou B, Yang X, Pang Y (2020) Attention scaling for crowd counting. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4706–4715
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al (2020) An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929
Liang D, Chen X, Xu W, Zhou Y, Bai X (2022) Transcrowd: weakly-supervised crowd counting with transformers. Sci China Inf Sci 65(6):160104
Lin H, Ma Z, Ji R, Wang Y, Hong X (2022) Boosting crowd counting via multifaceted attention. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 19628–19637
Tian Y, Chu X, Wang H (2021) CCTrans: simplifying and improving crowd counting with transformer. arXiv:2109.14483
Qian Y, Zhang L, Hong X, Donovan C, Arandjelovic O, Fife U, Harbin P (2022) Segmentation assisted u-shaped multi-scale transformer for crowd counting. In: 2022 British machine vision conference. The British Machine Vision Association (BMVA)
Wang J, Sun K, Cheng T, Jiang B, Deng C, Zhao Y, Liu D, Mu Y, Tan M, Wang X et al (2020) Deep high-resolution representation learning for visual recognition. IEEE Trans Pattern Anal Mach Intell 43(10):3349–3364
Sam DB, Sajjan NN, Babu RV, Srinivasan M (2018) Divide and grow: capturing huge diversity in crowd images with incrementally growing CNN. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3618–3626
Cao X, Wang Z, Zhao Y, Su F (2018) Scale aggregation network for accurate and efficient crowd counting. In: Proceedings of the European conference on computer vision (ECCV), pp 734–750
Sindagi VA, Patel VM (2017) Generating high-quality crowd density maps using contextual pyramid CNNs. In: Proceedings of the IEEE international conference on computer vision, pp 1861–1870
Liu L, Qiu Z, Li G, Liu S, Ouyang W, Lin L (2019) Crowd counting with deep structured scale integration network. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1774–1783
Guo D, Li K, Zha Z-J, Wang M (2019) DADNet: dilated-attention-deformable convnet for crowd counting. In: Proceedings of the 27th ACM international conference on multimedia, pp 1823–1832
Liu N, Long Y, Zou C, Niu Q, Pan L, Wu H (2019) ADCrowdNet: an attention-injective deformable convolutional network for crowd understanding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3225–3234
Zou Z, Cheng Y, Qu X, Ji S, Guo X, Zhou P (2019) Attend to count: crowd counting with adaptive capacity multi-scale CNNs. Neurocomputing 367:75–83
Zhang A, Shen J, Xiao Z, Zhu F, Zhen X, Cao X, Shao L (2019) Relational attention network for crowd counting. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6788–6797
Xie J, Pang C, Zheng Y, Li L, Lyu C, Lyu L, Liu H (2022) Multi-scale attention recalibration network for crowd counting. Appl Soft Comput 117:108457
Mehta S, Rastegari M (2021) MobileViT: light-weight, general-purpose, and mobile-friendly vision transformer. arXiv:2110.02178
Dai J, Qi H, Xiong Y, Li Y, Zhang G, Hu H, Wei Y (2017) Deformable convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 764–773
Idrees H, Tayyab M, Athrey K, Zhang D, Al-Maadeed S, Rajpoot N, Shah M (2018) Composition loss for counting, density map estimation and localization in dense crowds. In: Proceedings of the European conference on computer vision (ECCV), pp 532–546
Sindagi VA, Yasarla R, Patel VM (2020) JHU-Crowd++: large-scale crowd counting dataset and a benchmark method. IEEE Trans Pattern Anal Mach Intell 44(5):2594–2609
Liang D, Xu W, Zhu Y, Zhou Y (2022) Focal inverse distance transform maps for crowd localization. IEEE Transactions on Multimedia
Liang D, Xu W, Bai X (2022) An end-to-end transformer model for crowd localization. In: European conference on computer vision. Springer, pp 38–54
Dai M, Huang Z, Gao J, Shan H, Zhang J (2023) Cross-head supervision for crowd counting with noisy annotations. In: ICASSP 2023-2023 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 1–5
Wang Q, Breckon TP (2022) Crowd counting via segmentation guided attention networks and curriculum loss. IEEE Trans Intell Transp Syst 23(9):15233–15243
Gao X, Xie J, Chen Z, Liu A-A, Sun Z, Lyu L (2023) Dilated convolution-based feature refinement network for crowd localization. ACM Trans Multimed Comput Commun Appl 19(6):1–16
Tian Y, Lei Y, Zhang J, Wang JZ (2019) Padnet: pan-density crowd counting. IEEE Trans Image Process 29:2714–2727
Liu X, Yang J, Ding W, Wang T, Wang Z, Xiong J (2020) Adaptive mixture regression network with local counting map for crowd counting. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIV 16. Springer, pp 241–257
Wei B, Yuan Y, Wang Q (2020) MSPNet: multi-supervised parallel network for crowd counting. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 2418–2422
Wan J, Chan A (2020) Modeling noisy annotations for crowd counting. Adv Neural Inf Process Syst 33:3386–3396
Khan SD, Basalamah S (2021) Sparse to dense scale prediction for crowd couting in high density crowds. Arab J Sci Eng 46(4):3051–3065
Xu C, Liang D, Xu Y, Bai S, Zhan W, Bai X, Tomizuka M (2022) AutoScale: learning to scale for crowd counting. Int J Comput Vision 130(2):405–434
Khan SD, Basalamah S (2021) Scale and density invariant head detection deep model for crowd counting in pedestrian crowds. Vis Comput 37(8):2127–2137
Wan J, Liu Z, Chan AB (2021) A generalized loss function for crowd counting and localization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1974–1983
Khan SD, Salih Y, Zafar B, Noorwali A (2021) A deep-fusion network for crowd counting in high-density crowded scenes. Int J Comput Intell Syst 14(1):168
Meng Y, Bridge J, Wei M, Zhao Y, Qiao Y, Yang X, Huang X, Zheng Y (2022) Counting with adaptive auxiliary learning. arXiv:2203.04061