VMSG: a video caption network based on multimodal semantic grouping and semantic attention
Tóm tắt
Từ khóa
Tài liệu tham khảo
Ryu, H., Kang, S., Kang, H., et al.: Semantic grouping network for video captioning. Proc. AAAI Conf. Artif. Intell. 35(3), 2514–2522 (2021)
Wang, B., Ma, L., Zhang, W., et al.: Reconstruction network for video captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 7622–7631. (2018)
Jin, Q., Chen, J., Chen, S., et al.: Describing videos using multi-modal fusion. In: Proceedings of the 24th ACM international conference on multimedia, 1087–1091. (2016)
Hori, C., Hori, T., Lee T Y, et al.: Attention-based multimodal fusion for video description.In: Proceedings of the IEEE international conference on computer vision. 4193–4202. (2017)
Chen, Y., Wang, S., Zhang, W., et al.: Less is more: Picking informative frames for video captioning. In: Proceedings of the European conference on computer vision (ECCV). 358–373. (2018)
Tran, D., Bourdev, L., Fergus, R., et al.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision. 4489–4497. (2015)
He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778. (2016)
Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?, In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. (2018)
Giannakopoulos, T.: Pyaudioanalysis: an open-source python library for audio signal analysis[J]. PLoS ONE 10(12), e0144610 (2015)
Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need[J]. Adv. Neural. Inf. Process. Syst. 30, I (2017)
Xu, J., Mei, T., Yao, T., et al.: Msr-vtt: A large video description dataset for bridging video and language. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (2016). https://doi.org/10.1109/CVPR.2016.571
Papineni, K., Roukos, S., Ward, T., et al.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 311–318. (2002)
Banerjee, S., Lavie, A.: METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65–72. (2005)
Rouge, LCY.: A package for automatic evaluation of summaries. In: Proceedings of Workshop on Text Summarization of ACL, Spain. (2004)
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 4566–4575. (2015)
Ji, W., Wang, R.: A multi-instance multi-label dual learning approach for video captioning [J]. ACM Trans. Multimid. Computing Commun. Appl. 17(2s), 1–18 (2021)
Xu, N., Mao, W., Chen, G.: Multi-interactive memory network for aspect based multimodal sentiment analysis. In: Proceedings of the AAAI Conference on Artificial Intelligence. 33(01), 371–378. (2019)
Xu, W., Yu, J., Miao, Z., et al.: Deep reinforcement polishing network for video captioning[J]. IEEE Trans. Multimed. 23, 1772–1784 (2020)
Ji, W., Wang, R., Tian, Y., et al.: An attention based dual learning approach for video captioning[J]. Appl. Soft Comput. 117, 108332 (2022)
Yao, L., Torabi, A., Cho, K., et al.: Describing videos by exploiting temporal structure. In: Proceedings of the IEEE international conference on computer vision. 4507–4515. (2015)
Zhang, K., Li, D., Huang, J., et al.: Automated video behavior recognition of pigs using two-stream convolutional networks[J]. Sensors 20(4), 1085 (2020)
Pan, B., Cai, H., Huang, D.A., et al.: Spatio-temporal graph for video captioning with knowledge distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10870–10879. (2020)
Liu, D., Qu, X., Dong, J., et al.: Reasoning step-by-step: Temporal sentence localization in videos via deep rectification-modulation network. In: Proceedings of the 28th International Conference on Computational Linguistics. (2020)
Chen, S., Zhong, X., Li, L., et al.: Adaptively converting auxiliary attributes and textual embedding for video captioning based on BiLSTM [J]. Neural Process. Lett. 52(3), 2353–2369 (2020)
Pan, Y., Xu, J., Li, Y., et al.: Pre-training for Video Captioning Challenge 2020 Summary[J]. arXiv preprint arXiv:2008.00947.(2020)
Yang, B., Zou, Y., Liu, F., et al.: Non-autoregressive coarse-to-fine video captioning[J]. arXiv preprint arXiv:1911.12018.(2019)
Zhu, M., Duan, C., Yu, C.: Video Captioning in Compressed Video[J]. arXiv preprint arXiv:2101.00359. (2021)
Chen, J., Pan, Y., Li, Y., et al.: Retrieval augmented convolutional encoder-decoder networks for video captioning[J]. ACM Trans. Multimed. Comput. Commun. Appl. 19(1s), 1–24 (2023)
Deng, J., Li, L., Zhang, B., et al.: Syntax-guided hierarchical attention network for video captioning [J]. IEEE Trans. Circuits Syst. Video Technol. 32(2), 880–892 (2021)
Liu, F., Wu, X., You, C., et al.: Aligning source visual and target language domains for zero-shot captioning [J]. IEEE Trans. Pattern Anal. Mach. Intell. 44(12), 1 (2022)
Zhao, H., Chen, Z., Guo, L., et al.: Video captioning based on vision transformer and reinforcement learning[J]. PeerJ Comput. Sci. 8, e916 (2022)
Krishnamoorthy, N., Malkarnenkar, G., Mooney, R., et al.: Generating natural-language video descriptions using text-mined knowledge. In: Twenty-Seventh AAAI Conference on Artificial Intelligence. 328–337. (2013)
Venugopalan, S., Xu, H., Donahue, J., et al.: Translating videos to natural language using deep recurrent neural networks [J]. arXiv preprint arXiv:1412.4729. (2014)
Baraldi, L., Grana, C., Cucchiara, R.: Hierarchical boundary-aware neural encoder for video captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 1657–1666. (2017)
Guo, J., Tan, X., He, D., et al.: Non-autoregressive neural machine translation with enhanced decoder input. In: Proceedings of the AAAI conference on artificial intelligence. 33(01), 3723–3730. (2019)
Xu, N., Liu, A.A., Nie, W., et al.: Multi-guiding long short-term memory for video captioning [J]. Multimed. Syst. 25, 663–672 (2019)
Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532–1543. (2014)
Lee, J.Y.: Deep multimodal embedding for video captioning[J]. Multimed. Tools Appl. 78(22), 31793–31805 (2019)
Ramanishka, V., Das, A., Park, D.H. et al.: Multimodal video description. In: Proceedings of the 24th ACM international conference on Multimedia. 1092–1096. (2016)
Nagrani, A., Yang, S., Arnab, A., et al.: Attention bottlenecks for multimodal fusion [J]. Adv. Neural. Inf. Process. Syst. 34, 14200–14213 (2021)
Liu, A.A., Xu, N., Wong, Y., et al.: Hierarchical & multimodal video captioning: discovering and transferring multimodal knowledge for vision to language [J]. Comput. Vis. Image Underst. 163, 113–125 (2017)
Chen, S., Jiang, Y.G.: Motion guided spatial attention for video captioning. In: Proceedings of the AAAI conference on artificial intelligence. 33(01), 8191–8198. (2019)
Zhang, J., Peng, Y.: Object-aware aggregation with bidirectional temporal graph for video captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8327–8336. (2019)