VMSG: a video caption network based on multimodal semantic grouping and semantic attention

Xin Yang1, Xiangchen Wang1, Xiaohui Ye1, Tao Li1
1College of Automation Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, China

Tóm tắt

Network video typically contains a variety of information that is used by the video caption model to generate video tags. The process of creating video captions is divided into two steps—video information extraction and natural language generation. Existing models have the problem of redundant information in continuous frames when generating natural language, which affects the accuracy of the caption. As a result, this paper proposes a multimodal semantic grouping and semantic attention video caption model (VMSG). VMSG uses a novel semantic grouping method for decoding, which divides the video with the same semantics into a semantic group for decoding and predicting the next word, to reduce the redundant information of continuous video frames, which differs from the decoding mode of grouping by frame. Because the importance of each semantic group varies, we investigate a semantic attention mechanism to add weight to the semantic group and use a single-layer LSTM to simplify the model. Experiments show that VMSG outperforms some state-of-the-art models in terms of caption generation performance and alleviates the problem of redundant information in continuous video frames.

Từ khóa


Tài liệu tham khảo

Ryu, H., Kang, S., Kang, H., et al.: Semantic grouping network for video captioning. Proc. AAAI Conf. Artif. Intell. 35(3), 2514–2522 (2021)

Wang, B., Ma, L., Zhang, W., et al.: Reconstruction network for video captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 7622–7631. (2018)

Jin, Q., Chen, J., Chen, S., et al.: Describing videos using multi-modal fusion. In: Proceedings of the 24th ACM international conference on multimedia, 1087–1091. (2016)

Hori, C., Hori, T., Lee T Y, et al.: Attention-based multimodal fusion for video description.In: Proceedings of the IEEE international conference on computer vision. 4193–4202. (2017)

Chen, Y., Wang, S., Zhang, W., et al.: Less is more: Picking informative frames for video captioning. In: Proceedings of the European conference on computer vision (ECCV). 358–373. (2018)

Tran, D., Bourdev, L., Fergus, R., et al.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision. 4489–4497. (2015)

He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778. (2016)

Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?, In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. (2018)

Giannakopoulos, T.: Pyaudioanalysis: an open-source python library for audio signal analysis[J]. PLoS ONE 10(12), e0144610 (2015)

Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need[J]. Adv. Neural. Inf. Process. Syst. 30, I (2017)

Xu, J., Mei, T., Yao, T., et al.: Msr-vtt: A large video description dataset for bridging video and language. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (2016). https://doi.org/10.1109/CVPR.2016.571

Papineni, K., Roukos, S., Ward, T., et al.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 311–318. (2002)

Banerjee, S., Lavie, A.: METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65–72. (2005)

Rouge, LCY.: A package for automatic evaluation of summaries. In: Proceedings of Workshop on Text Summarization of ACL, Spain. (2004)

Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 4566–4575. (2015)

Ji, W., Wang, R.: A multi-instance multi-label dual learning approach for video captioning [J]. ACM Trans. Multimid. Computing Commun. Appl. 17(2s), 1–18 (2021)

Xu, N., Mao, W., Chen, G.: Multi-interactive memory network for aspect based multimodal sentiment analysis. In: Proceedings of the AAAI Conference on Artificial Intelligence. 33(01), 371–378. (2019)

Xu, W., Yu, J., Miao, Z., et al.: Deep reinforcement polishing network for video captioning[J]. IEEE Trans. Multimed. 23, 1772–1784 (2020)

Ji, W., Wang, R., Tian, Y., et al.: An attention based dual learning approach for video captioning[J]. Appl. Soft Comput. 117, 108332 (2022)

Yao, L., Torabi, A., Cho, K., et al.: Describing videos by exploiting temporal structure. In: Proceedings of the IEEE international conference on computer vision. 4507–4515. (2015)

Zhang, K., Li, D., Huang, J., et al.: Automated video behavior recognition of pigs using two-stream convolutional networks[J]. Sensors 20(4), 1085 (2020)

Pan, B., Cai, H., Huang, D.A., et al.: Spatio-temporal graph for video captioning with knowledge distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10870–10879. (2020)

Liu, D., Qu, X., Dong, J., et al.: Reasoning step-by-step: Temporal sentence localization in videos via deep rectification-modulation network. In: Proceedings of the 28th International Conference on Computational Linguistics. (2020)

Chen, S., Zhong, X., Li, L., et al.: Adaptively converting auxiliary attributes and textual embedding for video captioning based on BiLSTM [J]. Neural Process. Lett. 52(3), 2353–2369 (2020)

Pan, Y., Xu, J., Li, Y., et al.: Pre-training for Video Captioning Challenge 2020 Summary[J]. arXiv preprint arXiv:2008.00947.(2020)

Yang, B., Zou, Y., Liu, F., et al.: Non-autoregressive coarse-to-fine video captioning[J]. arXiv preprint arXiv:1911.12018.(2019)

Zhu, M., Duan, C., Yu, C.: Video Captioning in Compressed Video[J]. arXiv preprint arXiv:2101.00359. (2021)

Chen, J., Pan, Y., Li, Y., et al.: Retrieval augmented convolutional encoder-decoder networks for video captioning[J]. ACM Trans. Multimed. Comput. Commun. Appl. 19(1s), 1–24 (2023)

Deng, J., Li, L., Zhang, B., et al.: Syntax-guided hierarchical attention network for video captioning [J]. IEEE Trans. Circuits Syst. Video Technol. 32(2), 880–892 (2021)

Liu, F., Wu, X., You, C., et al.: Aligning source visual and target language domains for zero-shot captioning [J]. IEEE Trans. Pattern Anal. Mach. Intell. 44(12), 1 (2022)

Zhao, H., Chen, Z., Guo, L., et al.: Video captioning based on vision transformer and reinforcement learning[J]. PeerJ Comput. Sci. 8, e916 (2022)

Krishnamoorthy, N., Malkarnenkar, G., Mooney, R., et al.: Generating natural-language video descriptions using text-mined knowledge. In: Twenty-Seventh AAAI Conference on Artificial Intelligence. 328–337. (2013)

Venugopalan, S., Xu, H., Donahue, J., et al.: Translating videos to natural language using deep recurrent neural networks [J]. arXiv preprint arXiv:1412.4729. (2014)

Baraldi, L., Grana, C., Cucchiara, R.: Hierarchical boundary-aware neural encoder for video captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 1657–1666. (2017)

Guo, J., Tan, X., He, D., et al.: Non-autoregressive neural machine translation with enhanced decoder input. In: Proceedings of the AAAI conference on artificial intelligence. 33(01), 3723–3730. (2019)

Xu, N., Liu, A.A., Nie, W., et al.: Multi-guiding long short-term memory for video captioning [J]. Multimed. Syst. 25, 663–672 (2019)

Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532–1543. (2014)

Lee, J.Y.: Deep multimodal embedding for video captioning[J]. Multimed. Tools Appl. 78(22), 31793–31805 (2019)

Ramanishka, V., Das, A., Park, D.H. et al.: Multimodal video description. In: Proceedings of the 24th ACM international conference on Multimedia. 1092–1096. (2016)

Nagrani, A., Yang, S., Arnab, A., et al.: Attention bottlenecks for multimodal fusion [J]. Adv. Neural. Inf. Process. Syst. 34, 14200–14213 (2021)

Liu, A.A., Xu, N., Wong, Y., et al.: Hierarchical & multimodal video captioning: discovering and transferring multimodal knowledge for vision to language [J]. Comput. Vis. Image Underst. 163, 113–125 (2017)

Chen, S., Jiang, Y.G.: Motion guided spatial attention for video captioning. In: Proceedings of the AAAI conference on artificial intelligence. 33(01), 8191–8198. (2019)

Zhang, J., Peng, Y.: Object-aware aggregation with bidirectional temporal graph for video captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8327–8336. (2019)