Self-attention Guidance Based Crowd Localization and Counting
Tóm tắt
Most existing studies on crowd analysis are limited to the level of counting, which cannot provide the exact location of individuals. This paper proposes a self-attention guidance based crowd localization and counting network (SA-CLCN), which can simultaneously locate and count crowds. We take the form of object detection, using the original point annotations of crowd datasets as supervision to train the network. Ultimately, the center point coordinate of each head as well as the number of crowds are predicted. Specifically, to cope with the spatial and positional variations of the crowd, the proposed method introduces transformer to construct a globallocal feature extractor (GLFE) together with the convolutional structure. It establishes the near-to-far dependency between elements so that the global context and local detail features of the crowd image can be extracted simultaneously. Then, this paper designs a pyramid feature fusion module (PFFM) to fuse the global and local information from high level to low level to obtain a multiscale feature representation. In downstream tasks, this paper predicts candidate point offsets and confidence scores by a simple regression header and classification header. In addition, the Hungarian algorithm is used to match the predicted point set and the labelled point set to facilitate the calculation of losses. The proposed network avoids the errors or higher costs associated with using traditional density maps or bounding box annotations. Importantly, we have conducted extensive experiments on several crowd datasets, and the proposed method has produced competitive results in both counting and localization.
Từ khóa
Tài liệu tham khảo
W. H. Qin, G. H. Su, X. N. Li. Technology for simulating crowd evacuation behaviors. International Journal of Automation and Computing, vol. 6, no.4, pp.351–355, 2009. DOI: https://doi.org/10.1007/s11633-009-0351-9.
Y. Hao, Z. J. Xu, Y. Liu, J. Wang, J. L. Fan. Effective crowd anomaly detection through spatio-temporal texture analysis. International Journal of Automation and Computing, vol. 16, no. 1, pp. 27–39, 2019. DOI: https://doi.org/10.1007/S11633-018-1141-Z.
G. Yang, Z. H. Chen. Pedestrian tracking algorithm for dense crowd based on deep learning. In Proceedings of the 6th International Conference on Systems and Informatics, Shanghai, China, pp. 568–572, 2019. DOI: https://doi.org/10.1109/ICSAI48974.2019.9010144.
G. N. Dai. Deep learning method for citywide crowd flows prediction. In Proceedings of the 20th IEEE International Conference on Mobile Data Management, Hong Kong, China, pp. 373–374, 2019. DOI: https://doi.org/10.1109/MDM.2019.00-25.
H. Idrees, I. Saleemi, C. Seibert, M. Shah. Multi-source multi-scale counting in extremely dense crowd images. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Portland, USA, pp. 2547–2554, 2013. DOI: https://doi.org/10.1109/CVPR.2013.329.
A. B. Chan, N. Vasconcelos. Bayesian Poisson regression for crowd counting. In Proceedings of IEEE 12th International Conference on Computer Vision, Kyotos, Japan, pp. 545–551, 2009. DOI: https://doi.org/10.1109/ICCV.2009.5459191.
C. C. Liu, X. Y. Weng, Y. D. Mu. Recurrent attentive zooming for joint crowd counting and precise localization. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, pp. 1217–1226, 2019. DOI: https://doi.org/10.1109/CVPR.2019.00131.
I. S. Topkaya, H. Erdogan, F. Porikli. Counting people by clustering person detector outputs. In Proceedings of the 11th IEEE International Conference on Advanced Video and Signal Based Surveillance, Seoul, Republic of Korea, pp. 313–318, 2014. DOI: https://doi.org/10.1109/AVSS.2014.6918687.
M. Li, Z. X. Zhang, K. Q. Huang, T. N. Tan. Estimating the number of people in crowded scenes by mid based foreground segmentation and head-shoulder detection. In Proceedings of the 19th International Conference on Pattern Recognition, Tampa, USA, 2008. DOI: https://doi.org/10.1109/ICPR.2008.4761705.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, USA, pp. 6000–6010, 2017.
Y. Y. Zhang, D. S. Zhou, S. Q. Chen, S. H. Gao, Y. Ma. Single-image crowd counting via multi-column convolutional neural network. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 589–597, 2016. DOI: https://doi.org/10.1109/CVPR.2016.70.
D. B. Sam, S. Surya, R. V. Babu. Switching convolutional neural network for crowd counting. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 4031–4039, 2017. DOI: https://doi.org/10.1109/CVPR.2017.429.
Y. H. Li, X. F. Zhang, D. M. Chen. CSRNet: Dilated convolutional neural networks for understanding the highly congested scenes. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, pp. 1091–1100, 2018. DOI: https://doi.org/10.1109/CV-PR.2018.00120.
V. A. Sindagi, V. M. Patel. Generating high-quality crowd density maps using contextual pyramid CNNs. In Proceedings of IEEE International Conference on Computer Vision, Venice, Italy, pp. 1879–1888, 2017. DOI: https://doi.org/10.1109/IC-CV.2017.206.
W. Z. Liu, M. Salzmann, P. Fua. Context-aware crowd counting. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, pp. 5094–5103, 2019. DOI: https://doi.org/10.1109/CVPR.2019.00524.
A. R. Zhang, J. Y. Shen, Z. H. Xiao, F. Zhu, X. T. Zhen, X. B. Cao, L. Shao. Relational attention network for crowd counting. In Proceedings of IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, pp. 6787–6796, 2019. DOI: https://doi.org/10.1109/ICCV.2019.00689.
X. H. Jiang, L. Zhang, M. L. Xu, T. Z. Zhang, P. Lv, B. Zhou, X. Yang, Y. W. Pang. Attention scaling for crowd counting. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 4705–4714, 2020. DOI: https://doi.org/10.1109/CVPR42600.2020.00476.
Z. R. Fan, J. Ruan. Scale adaptive enhance network for crowd counting. In Proceedings of the 11th International Conference on Educational and Information Technology, Chengdu, China, pp. 220–225, 2022. DOI: https://doi.org/10.1109/ICEIT54416.2022.9690718.
J. Wan, A. Chan. Adaptive density map generation for crowd counting. In Proceedings of IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, pp. 1130–1139, 2019. DOI: https://doi.org/10.1109/ICCV.2019.00122.
V. Sindagi, V. Patel. Multi-level bottom-top and top-bottom feature fusion for crowd counting. In Proceedings of IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, pp. 1002–1012, 2019. DOI: https://doi.org/10.1109/ICCV.2019.00109.
Z. Shen, Y. Xu, B. B. Ni, M. S. Wang, J. G. Hu, X. K. Yang. Crowd counting via adversarial cross-scale consistency pursuit. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, pp. 5245–5254, 2018. DOI: https://doi.org/10.1109/CVPR.2018.00550.
L. B. Liu, Z. L. Qiu, G. B. Li, S. F. Liu, W. L. Ouyang, L. Lin. Crowd counting with deep structured scale integration network. In Proceedings of IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, pp. 1774–1783, 2019. DOI: https://doi.org/10.1109/ICCV.2019.00186.
X. K. Cao, Z. P. Wang, Y. Y. Zhao, F. Su. Scale aggregation network for accurate and efficient crowd counting. In Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, pp. 757–773, 2018. DOI: https://doi.org/10.1007/978-3-030-01228-1_45.
D. Z. Lian, J. Li, J. Zheng, W. X. Luo, S. H. Gao. Density map regression guided detection network for rgb-d crowd counting and localization. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, pp. 1821–1830, 2019. DOI: https://doi.org/10.1109/CV-PR.2019.00192.
D. Z. Lian, X. N. Chen, J. Li, W. X. Luo, S. H. Gao. Locating and counting heads in crowds with a depth prior. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 12, pp. 9056–9072, 2022. DOI: https://doi.org/10.1109/TPAMI.2021.3124956.
Y. T. Liu, M. J. Shi, Q. J. Zhao, X. F. Wang. Point in, box out: Beyond counting persons in crowds. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, pp. 6462–6471, 2019. DOI: https://doi.org/10.1109/CVPR.2019.00663.
D. B. Sam, S. V. Peri, M. N. Sundararaman, A. Kamath, R. V. Babu. Locate, size, and count: Accurately resolving people in dense crowds via detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 8, pp. 2739–2751, 2021. DOI: https://doi.org/10.1109/TPAMI.2020.2974830.
H. Law, J. Deng. CornerNet: Detecting objects as paired keypoints. In Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, pp. 765–781, 2018. DOI: https://doi.org/10.1007/978-3-030-01264-9_45.
X. Y. Zhou, D. Q. Wang, P. Krähenbühl. Objects as points, [Online], Available: https://arxiv.org/abs/1904.07850, 2019.
Y. Wang, J. H. Hou, X. Y. Hou, L. P. Chau. A self-training approach for point-supervised object detection and counting in crowds. IEEE Transactions on Image Processing, vol. 30, pp. 2876–2887, 2021. DOI: https://doi.org/10.1109/TIP.2021.3055632.
H. Idrees, M. Tayyab, K. Athrey, D. Zhang, S. Al-Maadeed, N. Rajpoot, M. Shah. Composition loss for counting, density map estimation and localization in dense crowds. In Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, pp. 544–559, 2018. DOI: https://doi.org/10.1007/978-3-030-01216-8_33.
M. Zand, H. Damirchi, A. Farley, M. Molahasani, M. Greenspan, A. Etemad. Multiscale crowd counting and localization by multitask point supervision. In Proceedings of ICASSP IEEE International Conference on Acoustics, Speech and Signal Processing, Singapore, pp. 1820–1824, 2022. DOI: https://doi.org/10.1109/ICASSP43922.2022.9747776.
D. B. Sam, S. V. Peri, N. S. Mukuntha, R. V. Babu. Going beyond the regression paradigm with accurate dot prediction for dense crowds. In Proceedings of IEEE Winter Conference on Applications of Computer Vision, Snowmass, USA, pp. 2853–2861, 2020. DOI: https://doi.org/10.1109/WACV45572.2020.9093386.
Y. Wang, X. Y. Hou, L. P. Chau. Dense point prediction: A simple baseline for crowd counting and localization. In Proceedings of IEEE International Conference on Multimedia & Expo Workshops, Shenzhen, China, pp. 1–6, 2021. DOI: https://doi.org/10.1109/ICMEW53276.2021.9455954.
J. Cheng, H. P. Xiong, Z. G. Cao, H. Lu. Decoupled two-stage crowd counting and beyond. IEEE Transactions on Image Processing, vol. 30, pp. 2862–2875, 2021. DOI: https://doi.org/10.1109/TIP.2021.3055631.
Q. Y. Song, C. A. Wang, Z. K. Jiang, Y. B. Wang, Y. Tai, C. J. Wang, J. L. Li, F. Y. Huang, Y. Wu. Rethinking counting and localization in crowds: A purely point-based framework. In Proceedings of IEEE/CVF International Conference on Computer Vision, Montreal, Canada, pp. 3345–3354, 2021. DOI: https://doi.org/10.1109/ICCV48922.2021.00335.
N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko. End-to-end object detection with transformers. In Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, pp. 213–229, 2020. DOI: https://doi.org/10.1007/978-3-030-58452-8_13.
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. H. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations, 2021.
Z. Liu, Y. T. Lin, Y. Cao, H. Hu, Y. X. Wei, Z. Zhang, S. Lin, B. N. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of IEEE/CVF International Conference on Computer Vision, Montreal, Canada, pp. 9992–10002, 2021. DOI: https://doi.org/10.1109/ICCV48922.2021.00986.
R. Ranftl, A. Bochkovskiy, V. Koltun. Vision transformers for dense prediction. In Proceedings of IEEE/CVF International Conference on Computer Vision, Montreal, Canada, pp. 12159–12168, 2021. DOI: https://doi.org/10.1109/ICCV48922.2021.01196.
Z. L. Peng, W. Huang, S. Z. Gu, L. X. Xie, Y. W. Wang, J. B. Jiao, Q. X. Ye. Conformer: Local features coupling global representations for visual recognition. In Proceedings of IEEE/CVF International Conference on Computer Vision, Montreal, Canada, pp. 357–366, 2021. DOI: https://doi.org/10.1109/ICCV48922.2021.00042.
G. L. Sun, Y. Liu, T. Probst, D. P. Paudel, N. Popovic, L. Van Gool. Boosting crowd counting with transformers, [Online], Available: https://arxiv.org/abs/2105.10926, 2021.
H. Lin, Z. H. Ma, R. R. Ji, Y. W. Wang, X. P. Hong. Boosting crowd counting via multifaceted attention. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, pp. 19596–19605, 2022. DOI: https://doi.org/10.1109/CVPR52688.2022.01901.
Y. Tian, X. X. Chu, H. P. Wang. CCTrans: Simplifying and improving crowd counting with transformer, [Online], Available: https://arxiv.org/abs/2109.14483, 2021.
J. Y. Gao, M. G. Gong, X. L. Li. Congested crowd instance localization with dilated convolutional swin transformer. Neurocomputing, vol. 513, pp.94–103, 2022. DOI: https://doi.org/10.1016/j.neucom.2022.09.113.
K. Simonyan, A. Zisserman. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, USA, 2015.
K. M. He, X. Y. Zhang, S. Q. Ren, J. Sun. Deep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 770–778, 2016. DOI: https://doi.org/10.1109/CVPR.2016.90.
F. Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 1800–1807, 2017. DOI: https://doi.org/10.1109/CVPR.2017.195.
Y. C. Zhao, G. T. Wang, C. X. Tang, C. Luo, W. J. Zeng, Z. J. Zha. A battle of network structures: An empirical study of cnn, transformer, and MLP, [Online], Available: https://arxiv.org/abs/2108.13002, 2021.
A. Islam, S. Jia, N. D. B. Bruce. How much position information do convolutional neural networks encode? In Proceedings of the 8th International Conference on Learning Representations, Addis Ababa, Ethiopia, 2020.
W. H. Wang, E. Z. Xie, X. Li, D. P. Fan, K. T. Song, D. Liang, T. Lu, P. Luo, L. Shao. PVT v2: Improved baselines with pyramid vision transformer. Computational Visual Media, vol. 8, no.3, pp.415–424, 2022. DOI: https://doi.org/10.1007/s41095-022-0274-8.
W. H. Wang, E. Z. Xie, X. Li, D. P. Fan, K. T. Song, D. Liang, T. Lu, P. Luo, L. Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of IEEE/CVF International Conference on Computer Vision, Montreal, Canada, pp. 548–558, 2021. DOI: https://doi.org/10.1109/ICCV48922.2021.00061.
H. W. Kuhn. The hungarian method for the assignment problem. Naval Research Logistics Quarterly, vol. 2, no. 1–2, pp. 83–97, 1955. DOI: https://doi.org/10.1002/nav.3800020109.
R. Girshick. Fast R-CNN. In Proceedings of IEEE International Conference on Computer Vision, Santiago, Chile, pp. 1440–1448, 2015. DOI: https://doi.org/10.1109/ICCV.2015.169.
V. A. Sindagi, R. Yasarla, V. M. Patel. Jhu-crowd++: Large-scale crowd counting dataset and a benchmark method. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 5, pp. 2594–2609, 2022. DOI: https://doi.org/10.1109/TPAMI.2020.3035969.
Q. Wang, J. Y. Gao, W. Lin, X. L. Li. NWPU-crowd: A large-scale benchmark for crowd counting and localization. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 6, pp. 2141–2149, 2021. DOI: https://doi.org/10.1109/TPAMI.2020.3013269.
I. Loshchilov, F. Hutter. Decoupled weight decay regularization. In Proceedings of the 7th International Conference on Learning Representations, New Orleans, USA, 2019.
Z. H. Ma, X. Wei, X. P. Hong, Y. H. Gong. Bayesian loss for crowd count estimation with point supervision. In Proceedings of IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, pp. 6141–6150, 2019. DOI: https://doi.org/10.1109/ICCV.2019.00624.
X. Y. Liu, J. Yang, W. R. Ding, T. Q. Wang, Z. J. Wang, J. J. Xiong. Adaptive mixture regression network with local counting map for crowd counting. In Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, pp. 241–257, 2020. DOI: https://doi.org/10.1007/978-3-030-58586-0_15.
S. Bai, Z. Q. He, Y. Qiao, H. Z. Hu, W. Wu, J. J. Yan. Adaptive dilated network with self-correction supervision for counting. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 4593–4602, 2020. DOI: https://doi.org/10.1109/CVPR42600.2020.00465.
P. T. Do. Attention in crowd counting using the transformer and density map to improve counting result. In Proceedings of the 8th NAFOSTED Conference on Information and Computer Science, Hanoi, Vietnam, pp. 65–70, 2021. DOI: https://doi.org/10.1109/NICS54270.2021.9701500.
S. Abousamra, M. Hoai, D. Samaras, C. Chen. Localization in the crowd with topological constraints. In Proceedings of the 35th AAAI Conference on Artificial Intelligence, Vancouver, Canada, pp. 872–881, 2021. DOI: https://doi.org/10.1609/aaai.v35i2.16170.
D. K. Liang, W. Xu, X. Bai. An end-to-end transformer model for crowd localization. In Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Israel, pp. 38–54, 2022. DOI: https://doi.org/10.1007/978-3-031-19769-7_3.
P. Y. Hu, D. Ramanan. Finding tiny faces. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 1522–1530, 2017. DOI: https://doi.org/10.1109/CVPR.2017.166.