Learning rich feature representation and aggregation for accurate visual tracking
Tóm tắt
Visual tracking is a key component of computer vision and has a wide range of practical applications. Recently, the tracking-by-segmentation framework has been widely applied in visual tracking due to its astonishing performance on accuracy. It attempts to learn from the framework of video object segmentation to realize accurate tracking. Although segmentation-based trackers are effective for target scale estimation, the segmentation network makes the trackers have high requirements for the extracted target features due to the need for pixel-level segmentation. Therefore, in this article, we propose a novel feature representation and aggregation network and introduce it into the tracking-by-segmentation framework to extract and integrate rich features for accurate and robust segmentation tracking. To be specific, firstly, the proposed approach models three complementary feature representations, including contextual semantic, local position, and structural patch feature representations, through cross-attention, cross-correlation and dilated involution mechanisms respectively. Secondly, these features are fused by a simple feature aggregation network. Thirdly, the fusion features are fed into the segmentation network to obtain accurate target state estimation. In addition, to adapt the segmentation network to the appearance changes and partial occlusion, we introduce a template update strategy and a bounding box refinement module for robust segmentation and tracking. The extensive experimental results on twelve challenging tracking benchmarks show that the proposed tracker outperforms most of the state-of-the-art trackers and achieves very promising tracking performance on the OTB100 and VOT2018 benchmarks.
Tài liệu tham khảo
Marvasti-Zadeh SM, Cheng L, Ghanei-Yakhdan H, Kasaei S (2022) Deep learning for visual tracking: a comprehensive survey. IEEE Trans Intell Transp Syst 23(5):3943–3968
Jiao L, Wang D, Bai Y, Chen P, Liu F (2021) Deep learning in visual tracking: a review. IEEE Trans Neural Netw Learn Syst 1–20
Javed S, Danelljan M, JKhan FS, Khan MH, Felsberg M, Matas J, (2023) Visual object tracking with discriminative filters and Siamese networks: a survey and outlook. IEEE Trans Pattern Anal Mach Intell 45(5):6552–6574
Feng P, Xu C, Zhao Z, Liu F, Guo J, Yuan C, Wang T, Duan K (2018) A deep features based generative model for visual tracking. Neurocomputing 308:245–254
Zhao JW, Zhang WD, Cao FL (2018) Robust object tracking using a sparse coadjutant observation model. Multimedia Tools and ApplicationS 77(23):30969–30991
Lukezic A, Matas J, Kristan M (2020) D3S - A discriminative single shot segmentation tracker. Proc IEEE Conf Comput Vis Pattern Recog, pp 7131–7140
Mondal A (2021) Occluded object tracking using object-background prototypes and particle filter. Appl Intell 51(8):5259–5279
Zeng Y, Zeng B, Yin X, Chen G (2022) SiamPCF: Siamese point regression with coarse-fine classification network for visual tracking. Appl Intell 52(5):4973–4986
Tang F, Ling Q, Yin X, Chen G (2021) Learning to rank proposals for siamese visual tracking. IEEE Trans Image Process 30:8785–8796
Gao L, Liu B, Fu P, Xu M, Li J (2022) Visual tracking via dynamic saliency discriminative correlation filter. Appl Intell 52(6):5897–5911
Fan J, Song H, Zhang K, Yang K, Liu Q (2021) Feature alignment and aggregation Siamese networks for fast visual tracking. IEEE Trans Circuit Syst Video Technol 31(4):1296–1307
Zhou Y, Zhang Y (2022) SiamET: a Siamese based visual tracking network with enhanced templates. Appl Intell 52(9):9782–9794
Zhou L, Ding X, Li W, Leng J, Lei B, Yang W (2023) A location-aware Siamese network for high-speed visual tracking. Appl Intell 53(4):4431–4447
Xiao D, Tan K, Wei Z, Zhang G (2023) Siamese block attention network for online update object tracking. Appl Intell 53(3):3459–3471
Bhat G, Danelljan M, Gool LV, Timofte R (2020) Know your surroundings: exploiting scene information for object tracking. Proc Eur Conf Comput Vis, pp 205–221
Zhu X, Wu X, Xu T, Feng Z, Kittler J (2022) Robust visual object tracking via adaptive attribute-aware discriminative correlation filters. IEEE Trans Multimedia 24:301–312
Wu X, Xu J, Zhu Z, Wang Y, Zhang Q, Tang S, Liang M, Cao B (2022) Correlation filter tracking algorithm based on spatial-temporal regularization and context awareness. Appl Intell 52(15):17772–17783
Nai K, Li Z, Wang H (2022) Learning channel-aware correlation filters for robust object tracking. IEEE Trans Circuit Syst Video Technol 32(11):7843–7857
Zhang Z, Liu Y, Li B, Hu W, Peng H (2021) Toward accurate pixelwise object tracking via attention retrieval. IEEE Trans Image Process 30:8553–8566
Yang YJ, Gu XD (2022) Learning edges and adaptive surroundings for discriminant segmentation tracking. Digital Signal Processing 121:103309
Wang N, Zhou W, Wang J, Li H (2021) Transformer meets tracker: exploiting temporal context for robust visual tracking. Proc IEEE Conf Comput Vis Pattern Recog, pp 1571–1580
Cui Y, Jiang C, Wang L, Wu G (2022) MixFormer: end-to-end tracking with iterative mixed attention. IEEE Conf Comput Vis Pattern Recog, pp 13598–13608
Liu L, Kong G, Duan X, Long H, Wu Y (2023) Siamese network with transformer and saliency encoder for object tracking. Appl Intell 53(2):2265–2279
Li D, Hu J, Wang CH, Li XT, She Q, Zhu L, Zhang T, Chen QF (2021) Involution: inverting the inherence of convolution for visual recognition. Proc IEEE Conf Comput Vis Pattern Recog, pp 12316–12325
Wang Q, Zhang L, Bertinetto L, Hu W, Torr PHS (2019) Fast online object tracking and segmentation: a unifying approach. Proc IEEE Conf Comput Vis Pattern Recog, pp 1328–1338
Kristan M, Leonardis A, Matas J, Felsberg M et al (2016) The visual object tracking VOT2016 challenge results. Proc Eur Conf Comput Vis, pp 607–612
Kristan M, Leonardis A, Matas J, Felsberg M et al (2018) The sixth visual object tracking VOT2018 challenge results. Proc Eur Conf Comput Vis, pp 3–53
Kristan M, Leonardis A, Matas J, Felsberg M et al (2020) The eighth visual object tracking VOT2020 challenge results. Proc Eur Conf Comput Vis Worksh, pp 547–601
Huang L, Zhao X, Huang K (2021) GOT-10k: a large high-diversity benchmark for generic object tracking in the wild. IEEE Trans Pattern Anal Mach Intell 43(5):1562–1577
Xu N, Yang L, Fan Y, Yue D, Liang Y, Yang J, Huang T (2018) YouTube-VOS: a large-scale video object segmentation benchmark. Proc Eur Conf Comput Vis, pp 603–619
Kristan M, Leonardis A, Matas J, Felsberg M et al (2019) The seventh visual object tracking VOT2019 challenge results. Proc Int Conf Comput Vis, pp 2206–2241
Muller M, Bibi A, Giancola S, Al-Subaihi S, Ghanem B (2018) Trackingnet: a large-scale dataset and benchmark for object tracking in the wild. Proc Int Conf Comput Vis, pp 310–327
Wu Y, Lim J, Yang MH (2015) Online object tracking: a benchmark. IEEE Trans Pattern Anal Mach Intell 37(9):1834–1848
Fan H, Lin LT, Yang F, Chu P, Deng G, Yu SJ, Bai HX, Xu Y, Liao CY, Ling HB (2019) LaSOT: a high-quality benchmark for large-scale single object tracking. Proc IEEE Conf Comput Vis Pattern Recog, pp 5369–5378
Mueller M, Smith N, Ghanem B (2016) A benchmark and simulator for UAV tracking. Proc Eur Conf Comput Vis, pp 445–461
Liang PP, Blasch E, Ling HB (2015) Encoding color information for visual tracking: algorithms and benchmark. IEEE Trans Image Process 24(12):5630–5644
Galoogahi HK, Fagg A, Huang C, Ramanan D, Lucey S (2017) Need for speed: a benchmark for higher frame rate object tracking. Proc Int Conf Comput Vis, pp 1134–1143
Kristan M, Matas J, Leonardis A, Felsberg M et al (2021) The ninth visual object tracking VOT2021 challenge results. Proc Int Conf Comput Vis Worksh, pp 2711–2738
Cui Z, Lu N (2021) Feature selection accelerated convolutional neural networks for visual tracking. Appl Intell 51(11):8230–8244
Chen S, Qiu C, Zhang Z (2022) An efficient method for tracking failure detection using parallel correlation filtering and Siamese network. Appl Intell 52(7):7713–7722
Bolme DS, Beveridge JR, Draper BA, Lui YM (2010) Visual object tracking using adaptive correlation filters. Proc IEEE Conf Comput Vis Pattern Recog, pp 2544–2550
Bhat G, Johnander J, Danelljan M, Khan FS, Felsberg M (2018) Unveiling the power of deep tracking. Proc Eur Conf Comput Vis, pp 493–509
Li Z, Nai K, Li G, Jiang S (2022) Learning a dynamic feature fusion tracker for object tracking. IEEE Trans Intell Transp Syst 23(2):1479–1491
Danelljan M, Hager G, Khan FS, Felsberg M (2017) Discriminative scale space tracking. IEEE Trans Pattern Anal Mach Intell 39(8):1561–1575
Danelljan M, Bhat G, Khan FS, Felsberg M (2019) ATOM: accurate tracking by overlap maximization. Proc IEEE Conf Comput Vis Pattern Recog, pp 4655–4664
Pi Z, Shao Y, Gao C, Sang N (2022) Instance-based feature pyramid for visual object tracking. IEEE Trans Multimedia 32(6):3774–3787
Bertinetto L, Valmadre J, Henriques JF, Vedaldi A, Torr PHS (2016) Fully-convolutional Siamese networks for object tracking. Proc Eur Conf Comput Vis, pp 850–865
Li B, Yan J, Wu W, Zhu Z, Hu X (2018) High performance visual tracking with Siamese region proposal network. Proc IEEE Conf Comput Vis Pattern Recog, pp 8971–8980
Li B, Wu W, Wang Q, Zhang F, Xing J, Yan J (2019) SiamRPN++: evolution of Siamese visual tracking with very deep networks. Proc IEEE Conf Comput Vis Pattern Recog, pp 4277–4286
Yang K, He Z, Pei W, Zhou Z, Li X, Yuan D, Zhang H (2022) SiamCorners: Siamese corner networks for visual tracking. IEEE Trans Multimedia 24:1956–1967
Xu Y, Wang Z, Li Z, Yuan Y, Yu G (2020) SiamFC++: towards robust and accurate visual tracking with target estimation guidelines. Proceedings of AAAI Conference on Artificial Intelligence, pp 12549–12556
Zhao D, Ma C, Zhu D, Shuai J, Lu J (2022) Learning bi-grained cross-correlation Siamese networks for visual tracking. Appl Intell 52(11):12175–12190
Xi M, Zhou WG, Wang N, Li HQ (2022) Learning temporal-correlated and channel-decorrelated Siamese networks for visual tracking. IEEE Trans Multimedia 24:2791–2803
Gao S, Zhou C, Ma C, Wang X, Yuan J (2022) AiATrack: attention in attention for transformer visual tracking. Proc Eur Conf Comput Vis, pp 146–164
Fu Z, Fu Z, Liu Q, Cai W, Wang Y (2022) SparseTT: visual tracking with sparse transformers. Int Joint Conf Artif Intell, pp 905–912
Yan B, Zhang XY, Wang D, Lu H, Yang XY (2021) Alpha-Refine: boosting tracking performance by precise bounding box estimation. Proc IEEE Conf Comput Vis Pattern Recog, pp 5285–5294
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. Proc IEEE Conf Comput Vis Pattern Recog, pp 770–778
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Chen X, Yan B, Zhu J, Wang D, Yang X, Lu H (2021) Transformer tracking. Proc IEEE Conf Comput Vis Pattern Recog, pp 8126–8135
Zhou Z, Pei W, Li X, Wang H, Zheng F, He Z (2021) Saliency-associated object tracking. Proc Int Conf Comput Vis, pp 9846–9855
Han G, Su J, Liu Y, Zhao Y, Kwong S (2023) Multi-stage visual tracking with Siamese Anchor-free proposal network. IEEE Trans Multimedia 25:430–442
Zhang Z, Peng H, Fu J, Li B, Hu W (2020) Ocean: object-aware Anchor-free tracking. Proc Eur Conf Comput Vis, pp 771–787
Zhang Z, Liu Y, Wang X, Li B, Hu W (2021) Learn to match: automatic matching network design for visual tracking. Proc Int Conf Comput Vis, pp 13319–13328
Zhu H, Peng H, Xu G, Deng L, Cheng Y, Song A (2022) Bilateral weighted regression ranking model with spatial-temporal correlation filter for visual tracking. IEEE Trans Multimedia 24:2098–2111
Ma Z, Wang L, Zhang H, Lu W, Yin J (2020) RPT: learning point set representation for Siamese visual tracking. Proc Eur Conf Comput Vis Worksh, pp 653–665
Bhat G, Danelljan M, Gool LV, Timofte R (2019) Learning discriminative model prediction for tracking. Proc Int Conf Comput Vis, pp 6181–6190
Lukezic A, Matas J, Kristan M (2021) A discriminative single-shot segmentation network for visual object tracking. IEEE Trans Pattern Anal Mach Intell 44(12):9742–9755
Yan B, Peng H, Fu J, Wang D, Lu H (2021) Learning spatio-temporal transformer for visual tracking. Proc Int Conf Comput Vis, pp 10448–10457
Ye B, Chang H, Ma B, Shan S, Chen X (2022) Joint feature learning and relation modeling for tracking: a one-stream framework. Proc Eur Conf Comput Vis Worksh, pp 341–357