Learning rich feature representation and aggregation for accurate visual tracking

Springer Science and Business Media LLC - Tập 53 - Trang 28114-28132 - 2023
Yijin Yang1, Xiaodong Gu1
1Department of Electronic Engineering, Fudan University, Shanghai, China

Tóm tắt

Visual tracking is a key component of computer vision and has a wide range of practical applications. Recently, the tracking-by-segmentation framework has been widely applied in visual tracking due to its astonishing performance on accuracy. It attempts to learn from the framework of video object segmentation to realize accurate tracking. Although segmentation-based trackers are effective for target scale estimation, the segmentation network makes the trackers have high requirements for the extracted target features due to the need for pixel-level segmentation. Therefore, in this article, we propose a novel feature representation and aggregation network and introduce it into the tracking-by-segmentation framework to extract and integrate rich features for accurate and robust segmentation tracking. To be specific, firstly, the proposed approach models three complementary feature representations, including contextual semantic, local position, and structural patch feature representations, through cross-attention, cross-correlation and dilated involution mechanisms respectively. Secondly, these features are fused by a simple feature aggregation network. Thirdly, the fusion features are fed into the segmentation network to obtain accurate target state estimation. In addition, to adapt the segmentation network to the appearance changes and partial occlusion, we introduce a template update strategy and a bounding box refinement module for robust segmentation and tracking. The extensive experimental results on twelve challenging tracking benchmarks show that the proposed tracker outperforms most of the state-of-the-art trackers and achieves very promising tracking performance on the OTB100 and VOT2018 benchmarks.

Tài liệu tham khảo

Marvasti-Zadeh SM, Cheng L, Ghanei-Yakhdan H, Kasaei S (2022) Deep learning for visual tracking: a comprehensive survey. IEEE Trans Intell Transp Syst 23(5):3943–3968 Jiao L, Wang D, Bai Y, Chen P, Liu F (2021) Deep learning in visual tracking: a review. IEEE Trans Neural Netw Learn Syst 1–20 Javed S, Danelljan M, JKhan FS, Khan MH, Felsberg M, Matas J, (2023) Visual object tracking with discriminative filters and Siamese networks: a survey and outlook. IEEE Trans Pattern Anal Mach Intell 45(5):6552–6574 Feng P, Xu C, Zhao Z, Liu F, Guo J, Yuan C, Wang T, Duan K (2018) A deep features based generative model for visual tracking. Neurocomputing 308:245–254 Zhao JW, Zhang WD, Cao FL (2018) Robust object tracking using a sparse coadjutant observation model. Multimedia Tools and ApplicationS 77(23):30969–30991 Lukezic A, Matas J, Kristan M (2020) D3S - A discriminative single shot segmentation tracker. Proc IEEE Conf Comput Vis Pattern Recog, pp 7131–7140 Mondal A (2021) Occluded object tracking using object-background prototypes and particle filter. Appl Intell 51(8):5259–5279 Zeng Y, Zeng B, Yin X, Chen G (2022) SiamPCF: Siamese point regression with coarse-fine classification network for visual tracking. Appl Intell 52(5):4973–4986 Tang F, Ling Q, Yin X, Chen G (2021) Learning to rank proposals for siamese visual tracking. IEEE Trans Image Process 30:8785–8796 Gao L, Liu B, Fu P, Xu M, Li J (2022) Visual tracking via dynamic saliency discriminative correlation filter. Appl Intell 52(6):5897–5911 Fan J, Song H, Zhang K, Yang K, Liu Q (2021) Feature alignment and aggregation Siamese networks for fast visual tracking. IEEE Trans Circuit Syst Video Technol 31(4):1296–1307 Zhou Y, Zhang Y (2022) SiamET: a Siamese based visual tracking network with enhanced templates. Appl Intell 52(9):9782–9794 Zhou L, Ding X, Li W, Leng J, Lei B, Yang W (2023) A location-aware Siamese network for high-speed visual tracking. Appl Intell 53(4):4431–4447 Xiao D, Tan K, Wei Z, Zhang G (2023) Siamese block attention network for online update object tracking. Appl Intell 53(3):3459–3471 Bhat G, Danelljan M, Gool LV, Timofte R (2020) Know your surroundings: exploiting scene information for object tracking. Proc Eur Conf Comput Vis, pp 205–221 Zhu X, Wu X, Xu T, Feng Z, Kittler J (2022) Robust visual object tracking via adaptive attribute-aware discriminative correlation filters. IEEE Trans Multimedia 24:301–312 Wu X, Xu J, Zhu Z, Wang Y, Zhang Q, Tang S, Liang M, Cao B (2022) Correlation filter tracking algorithm based on spatial-temporal regularization and context awareness. Appl Intell 52(15):17772–17783 Nai K, Li Z, Wang H (2022) Learning channel-aware correlation filters for robust object tracking. IEEE Trans Circuit Syst Video Technol 32(11):7843–7857 Zhang Z, Liu Y, Li B, Hu W, Peng H (2021) Toward accurate pixelwise object tracking via attention retrieval. IEEE Trans Image Process 30:8553–8566 Yang YJ, Gu XD (2022) Learning edges and adaptive surroundings for discriminant segmentation tracking. Digital Signal Processing 121:103309 Wang N, Zhou W, Wang J, Li H (2021) Transformer meets tracker: exploiting temporal context for robust visual tracking. Proc IEEE Conf Comput Vis Pattern Recog, pp 1571–1580 Cui Y, Jiang C, Wang L, Wu G (2022) MixFormer: end-to-end tracking with iterative mixed attention. IEEE Conf Comput Vis Pattern Recog, pp 13598–13608 Liu L, Kong G, Duan X, Long H, Wu Y (2023) Siamese network with transformer and saliency encoder for object tracking. Appl Intell 53(2):2265–2279 Li D, Hu J, Wang CH, Li XT, She Q, Zhu L, Zhang T, Chen QF (2021) Involution: inverting the inherence of convolution for visual recognition. Proc IEEE Conf Comput Vis Pattern Recog, pp 12316–12325 Wang Q, Zhang L, Bertinetto L, Hu W, Torr PHS (2019) Fast online object tracking and segmentation: a unifying approach. Proc IEEE Conf Comput Vis Pattern Recog, pp 1328–1338 Kristan M, Leonardis A, Matas J, Felsberg M et al (2016) The visual object tracking VOT2016 challenge results. Proc Eur Conf Comput Vis, pp 607–612 Kristan M, Leonardis A, Matas J, Felsberg M et al (2018) The sixth visual object tracking VOT2018 challenge results. Proc Eur Conf Comput Vis, pp 3–53 Kristan M, Leonardis A, Matas J, Felsberg M et al (2020) The eighth visual object tracking VOT2020 challenge results. Proc Eur Conf Comput Vis Worksh, pp 547–601 Huang L, Zhao X, Huang K (2021) GOT-10k: a large high-diversity benchmark for generic object tracking in the wild. IEEE Trans Pattern Anal Mach Intell 43(5):1562–1577 Xu N, Yang L, Fan Y, Yue D, Liang Y, Yang J, Huang T (2018) YouTube-VOS: a large-scale video object segmentation benchmark. Proc Eur Conf Comput Vis, pp 603–619 Kristan M, Leonardis A, Matas J, Felsberg M et al (2019) The seventh visual object tracking VOT2019 challenge results. Proc Int Conf Comput Vis, pp 2206–2241 Muller M, Bibi A, Giancola S, Al-Subaihi S, Ghanem B (2018) Trackingnet: a large-scale dataset and benchmark for object tracking in the wild. Proc Int Conf Comput Vis, pp 310–327 Wu Y, Lim J, Yang MH (2015) Online object tracking: a benchmark. IEEE Trans Pattern Anal Mach Intell 37(9):1834–1848 Fan H, Lin LT, Yang F, Chu P, Deng G, Yu SJ, Bai HX, Xu Y, Liao CY, Ling HB (2019) LaSOT: a high-quality benchmark for large-scale single object tracking. Proc IEEE Conf Comput Vis Pattern Recog, pp 5369–5378 Mueller M, Smith N, Ghanem B (2016) A benchmark and simulator for UAV tracking. Proc Eur Conf Comput Vis, pp 445–461 Liang PP, Blasch E, Ling HB (2015) Encoding color information for visual tracking: algorithms and benchmark. IEEE Trans Image Process 24(12):5630–5644 Galoogahi HK, Fagg A, Huang C, Ramanan D, Lucey S (2017) Need for speed: a benchmark for higher frame rate object tracking. Proc Int Conf Comput Vis, pp 1134–1143 Kristan M, Matas J, Leonardis A, Felsberg M et al (2021) The ninth visual object tracking VOT2021 challenge results. Proc Int Conf Comput Vis Worksh, pp 2711–2738 Cui Z, Lu N (2021) Feature selection accelerated convolutional neural networks for visual tracking. Appl Intell 51(11):8230–8244 Chen S, Qiu C, Zhang Z (2022) An efficient method for tracking failure detection using parallel correlation filtering and Siamese network. Appl Intell 52(7):7713–7722 Bolme DS, Beveridge JR, Draper BA, Lui YM (2010) Visual object tracking using adaptive correlation filters. Proc IEEE Conf Comput Vis Pattern Recog, pp 2544–2550 Bhat G, Johnander J, Danelljan M, Khan FS, Felsberg M (2018) Unveiling the power of deep tracking. Proc Eur Conf Comput Vis, pp 493–509 Li Z, Nai K, Li G, Jiang S (2022) Learning a dynamic feature fusion tracker for object tracking. IEEE Trans Intell Transp Syst 23(2):1479–1491 Danelljan M, Hager G, Khan FS, Felsberg M (2017) Discriminative scale space tracking. IEEE Trans Pattern Anal Mach Intell 39(8):1561–1575 Danelljan M, Bhat G, Khan FS, Felsberg M (2019) ATOM: accurate tracking by overlap maximization. Proc IEEE Conf Comput Vis Pattern Recog, pp 4655–4664 Pi Z, Shao Y, Gao C, Sang N (2022) Instance-based feature pyramid for visual object tracking. IEEE Trans Multimedia 32(6):3774–3787 Bertinetto L, Valmadre J, Henriques JF, Vedaldi A, Torr PHS (2016) Fully-convolutional Siamese networks for object tracking. Proc Eur Conf Comput Vis, pp 850–865 Li B, Yan J, Wu W, Zhu Z, Hu X (2018) High performance visual tracking with Siamese region proposal network. Proc IEEE Conf Comput Vis Pattern Recog, pp 8971–8980 Li B, Wu W, Wang Q, Zhang F, Xing J, Yan J (2019) SiamRPN++: evolution of Siamese visual tracking with very deep networks. Proc IEEE Conf Comput Vis Pattern Recog, pp 4277–4286 Yang K, He Z, Pei W, Zhou Z, Li X, Yuan D, Zhang H (2022) SiamCorners: Siamese corner networks for visual tracking. IEEE Trans Multimedia 24:1956–1967 Xu Y, Wang Z, Li Z, Yuan Y, Yu G (2020) SiamFC++: towards robust and accurate visual tracking with target estimation guidelines. Proceedings of AAAI Conference on Artificial Intelligence, pp 12549–12556 Zhao D, Ma C, Zhu D, Shuai J, Lu J (2022) Learning bi-grained cross-correlation Siamese networks for visual tracking. Appl Intell 52(11):12175–12190 Xi M, Zhou WG, Wang N, Li HQ (2022) Learning temporal-correlated and channel-decorrelated Siamese networks for visual tracking. IEEE Trans Multimedia 24:2791–2803 Gao S, Zhou C, Ma C, Wang X, Yuan J (2022) AiATrack: attention in attention for transformer visual tracking. Proc Eur Conf Comput Vis, pp 146–164 Fu Z, Fu Z, Liu Q, Cai W, Wang Y (2022) SparseTT: visual tracking with sparse transformers. Int Joint Conf Artif Intell, pp 905–912 Yan B, Zhang XY, Wang D, Lu H, Yang XY (2021) Alpha-Refine: boosting tracking performance by precise bounding box estimation. Proc IEEE Conf Comput Vis Pattern Recog, pp 5285–5294 He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. Proc IEEE Conf Comput Vis Pattern Recog, pp 770–778 Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252 Chen X, Yan B, Zhu J, Wang D, Yang X, Lu H (2021) Transformer tracking. Proc IEEE Conf Comput Vis Pattern Recog, pp 8126–8135 Zhou Z, Pei W, Li X, Wang H, Zheng F, He Z (2021) Saliency-associated object tracking. Proc Int Conf Comput Vis, pp 9846–9855 Han G, Su J, Liu Y, Zhao Y, Kwong S (2023) Multi-stage visual tracking with Siamese Anchor-free proposal network. IEEE Trans Multimedia 25:430–442 Zhang Z, Peng H, Fu J, Li B, Hu W (2020) Ocean: object-aware Anchor-free tracking. Proc Eur Conf Comput Vis, pp 771–787 Zhang Z, Liu Y, Wang X, Li B, Hu W (2021) Learn to match: automatic matching network design for visual tracking. Proc Int Conf Comput Vis, pp 13319–13328 Zhu H, Peng H, Xu G, Deng L, Cheng Y, Song A (2022) Bilateral weighted regression ranking model with spatial-temporal correlation filter for visual tracking. IEEE Trans Multimedia 24:2098–2111 Ma Z, Wang L, Zhang H, Lu W, Yin J (2020) RPT: learning point set representation for Siamese visual tracking. Proc Eur Conf Comput Vis Worksh, pp 653–665 Bhat G, Danelljan M, Gool LV, Timofte R (2019) Learning discriminative model prediction for tracking. Proc Int Conf Comput Vis, pp 6181–6190 Lukezic A, Matas J, Kristan M (2021) A discriminative single-shot segmentation network for visual object tracking. IEEE Trans Pattern Anal Mach Intell 44(12):9742–9755 Yan B, Peng H, Fu J, Wang D, Lu H (2021) Learning spatio-temporal transformer for visual tracking. Proc Int Conf Comput Vis, pp 10448–10457 Ye B, Chang H, Ma B, Shan S, Chen X (2022) Joint feature learning and relation modeling for tracking: a one-stream framework. Proc Eur Conf Comput Vis Worksh, pp 341–357