Interactive guidance network for object detection based on radar-camera fusion

Jiapeng Wang1,2, Linhua Kong1,2, Dongxia Chang1,2, Zisen Kong1,2, Yao Zhao1,2
1Institute of Information Science, Beijing Jiaotong University, Beijing, China
2Beijing Key Laboratory of Advanced Information Science and Network Technology, Beijing Jiaotong University, Beijing, China

Tóm tắt

In recent years, the performance of image-based object detection algorithms has improved significantly, especially in the field of autonomous driving. It is well known that camera sensors are susceptible to adverse weather conditions, which can significantly affect their performance. In contrast, millimeter wave radar is robust to such weather conditions. As a result, the fusion of millimeter-wave radar and camera sensor has gained considerable attention as a promising approach for object detection. However, existing methods hardly take into account the correlation between the two modalities, leading to detection results that are vulnerable to radar noise, visual blur, and other confounding factors. To address this challenge, we propose an interactive guidance network that leverages a cross-modal attention mechanism, enabling radar and camera sensors to mutually guide each other and learn the underlying correlation between the two modalities. Our approach aims to achieve complementary fusion of features while effectively utilizing information from both radar and camera sensors to enhance detection results. Moreover, a bi-directional fusion Feature Pyramid Network (FPN) structure is introduced, which generates feature maps with enhanced semantic and texture information. To assess the effectiveness of our proposed method, we conducted experiments on the NuScenes dataset. The results demonstrate that our approach outperforms existing state-of-the-art methods in terms of object detection accuracy.

Tài liệu tham khảo

Wei Z, Zhang F, Chang S, Liu Y, Wu H, Feng Z (2022) Mmwave radar and vision fusion for object detection in autonomous driving: a review. Sensors 22(7):2542 Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), vol 1, pp 886–893 Felzenszwalb P, McAllester D, Ramanan D (2008) A discriminatively trained, multiscale, deformable part model. In: 2008 IEEE conference on computer vision and pattern recognition, pp 1–8 Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788 Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 Michaelis C, Mitzkus B, Geirhos R, Rusak E, Bringmann O, Ecker AS, Bethge M, Brendel W (2019) Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 Wang Z, Wu Y, Niu Q (2019) Multi-sensor fusion in automated driving: A survey. Ieee Access 8:2847–2868 Cho M-g (2019) A study on the obstacle recognition for autonomous driving rc car using lidar and thermal infrared camera. In: 2019 Eleventh international conference on ubiquitous and future networks (ICUFN), pp 544–546 Zhang R, Cao S (2018) Real-time human motion behavior detection via cnn using mmwave radar. IEEE Sensors Letters 3(2):1–4 Yoneda K, Hashimoto N, Yanase R, Aldibaja M, Suganuma N (2018) Vehicle localization using 76ghz omnidirectional millimeter-wave radar for winter automated driving. In: 2018 IEEE intelligent vehicles symposium (IV), pp 971–977 Wang X, Xu L, Sun H, Xin J, Zheng N (2016) On-road vehicle detection and tracking using mmw radar and monovision fusion. IEEE Trans Intell Transp Syst 17(7):2075–2084 Wang X, Xu L, Sun H, Xin J, Zheng N (2014) Bionic vision inspired on-road obstacle detection and tracking using radar and visual information. In: 17th International IEEE conference on intelligent transportation systems (ITSC), pp 39–44 Ćesić J, Marković I, Cvišić I, Petrović I (2016) Radar and stereo vision fusion for multitarget tracking on the special euclidean group. Robot Auton Syst 83:338–348 Zhong Z, Liu S, Mathew M (2018) Dubey A (2018) Camera radar fusion for increased reliability in adas applications. Electronic Imaging 17:258–1 Lekic V, Babic Z (2019) Automotive radar and camera fusion using generative adversarial networks. Comput Vis Image Underst 184:1–8 Obrvan M, Ćesić J, Petrović I (2016) Appearance based vehicle detection by radar-stereo vision integration. In: Robot 2015: second Iberian robotics conference: advances in robotics, vol 1, pp 437–449 Chadwick S, Maddern W, Newman P (2019) Distant vehicle detection using radar and vision. In: 2019 International conference on robotics and automation (ICRA), pp 8311–8317 Chang S, Zhang Y, Zhang F, Zhao X, Huang S, Feng Z, Wei Z (2020) Spatial attention fusion for obstacle detection using mmwave radar and vision sensor. Sensors 20(4):956 Caesar H, Bankiti V, Lang AH, Vora S, Liong VE, Xu Q, Krishnan A, Pan Y, Baldan G, Beijbom O (2020) nuscenes: a multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11621–11631 Nabati R, Qi H (2019) Rrpn: radar region proposal network for object detection in autonomous vehicles. In: 2019 IEEE international conference on image processing (ICIP), pp 3093–3097 Meyer M, Kuschk G (2019) Deep learning based 3d object detection for automotive radar and camera. In: 2019 16th European radar conference (EuRAD), pp 133–136 Dong X, Zhuang B, Mao Y, Liu L (2021) Radar camera fusion via representation learning in autonomous driving. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1672–1681 John V, Mita S (2019) Rvnet: deep sensor fusion of monocular camera and radar for image-based obstacle detection in challenging environments. In: Pacific-rim symposium on image and video technology, pp 351–364 Nobis F, Geisslinger M, Weber M, Betz J, Lienkamp M (2019) A deep learning-based radar and camera sensor fusion architecture for object detection. In: 2019 Sensor data fusion: trends. Solutions, applications (SDF), pp 1–7 Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141 Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7794–7803 Yang Z, Zhu L, Wu Y, Yang Y (2020) Gated channel transformation for visual recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11794–11803 Woo S, Park J, Lee J-Y, Kweon IS (2018) Cbam: convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), pp 3–19 Huang Z, Wang X, Huang L, Huang C, Wei Y, Liu W (2019) Ccnet: criss-cross attention for semantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 603–612 Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H (2019) Dual attention network for scene segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3146–3154 Xu G, Zhou W, Qian X, Ye L, Lei J, Yu L (2023) Ccfnet: cross-complementary fusion network for rgb-d scene parsing of clothing images. J Vis Commun Image Represent 90:103727 Sun B, Yao Z, Zhang Y, Yu L (2020) Local relation network with multilevel attention for visual question answering. J Vis Commun Image Represent 73:102762 Wang Y, Shen Y, Liu Z, Liang PP, Zadeh A, Morency L-P (2019) Words can shift: Dynamically adjusting word representations using nonverbal behaviors. Proceedings of the AAAI Conference on Artificial Intelligence 33:7216–7223 Wei X, Zhang T, Li Y, Zhang Y, Wu F (2020) Multi-modality cross attention network for image and sentence matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10941–10950 Song X, Guo H, Xu X, Chao H, Xu S, Turkbey B, Wood BJ, Wang G, Yan P (2021) Cross-modal attention for mri and ultrasound volume registration. In: International conference on medical image computing and computer-assisted intervention, pp 66–75 Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Nabati R, Qi H (2020) Radar-camera sensor fusion for joint object detection and distance estimation in autonomous vehicles. arXiv preprint arXiv:2009.08428 Tan M, Pang R, Le QV (2020) Efficientdet: scalable and efficient object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10781–10790 Li C, Li L, Jiang H, Weng K, Geng Y, Li L, Ke Z, Li Q, Cheng M, Nie W, Li Y, Zhang B, Liang Y, Zhou L, Xu X, Chu X, Wei X, Wei X (2022) Yolov6: a single-stage object detection framework for industrial applications. arXiv:2209.02976