A New Attention-Based LSTM for Image Captioning

Springer Science and Business Media LLC - Tập 54 - Trang 3157-3171 - 2022
Fen Xiao1, Wenfeng Xue1, Yanqing Shen1, Xieping Gao2
1The MOE Key Laboratory of Intelligent Computing and Information Processing, Xiangtan University, Xiangtan, China
2Hunan Provincial Key Laboratory of Intelligent Computing and Language Information Processing, Hunan Normal University, Changsha, China

Tóm tắt

Image captioning aims to describe the content of an image with a complete and natural sentence. Recently, the image captioning methods with encoder-decoder architecture has made great progress, in which LSTM became a dominant decoder to generate word sequence. However, in the decoder stage, the input vector keep same and there is much uncorrelated with previously visual parts or generated words. In this paper, we propose an attentional LSTM (ALSTM) and show how to integrate it within state-of-the-art automatic image captioning framework. Instead of traditional LSTM in existing models, ALSTM learns to refine input vector from network hidden states and sequential context information. Thus ALSTM can attend more relevant features such as spatial attention, visual relations and pay more attention on the most relevant context words. Moreover, ALSTM is utilized as the decoder in some classical frameworks and shows how to get effective visual/context attention to update input vector. Extensive quantitative and qualitative evaluations on the Flickr30K and MSCOCO image datasets with modified network illustrate the superiority of ALSTM. ALSTM based methods can generate high quality descriptions by combining sequence context and relations.

Tài liệu tham khảo

Chen X, Fang H, Lin TY, Vedantam R, Gupta S, Dollar P, Zitnick CL, Microsoft coco captions: Data collection and evaluation server. Computer Science He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9 Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput. 9:1735 Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 375–383 Yang Z, Yuan Y, Wu Y, Cohen WW, Salakhutdinov RR (2016) Review networks for caption generation. In: Adv Neural Inf Process Syst. pp. 2361–2369 Xiao F, Gong X, Zhang Y, Shen Y, Li J, Gao X (2019) Daa: dual lstms with adaptive attention for image captioning. Neurocomputing 364:322–329 Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6077–6086 Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156–3164 Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3128–3137 Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7008–7024 Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, pp. 2048–2057 Xu Y, Hanwang Z, Jianfei C (2019) Learning to collocate neural modules for image captioning. In: Proc of the IEEE int conf on computer vision. Piscataway, NJ: IEEE, pp. 4249–4259 Zhou L, Zhang Y, Jiang Y-G, Zhang T, Fan W (2019) Re-caption: saliency-enhanced image captioning through two-phase learning. IEEE Trans Image Process 29:694–709 Yao T, Pan Y, Li Y, Mei T (2018) Exploring visual relationship for image captioning. In: Proceedings of the European conference on computer vision (ECCV), pp. 684–699 Yang X, Tang K, Zhang H, Cai J (2019) Auto-encoding scene graphs for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10685–10694 Gao L, Fan K, Song J, Liu X, Xu X, Shen HT (2019) Deliberate attention networks for image captioning. Proc AAAI Conf Artif Intell 33:8320–8327 Ke L, Pei W, Li R, Shen X, Tai Y-W (2019) Reflective decoding network for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 8888–8897 Shi Z, Zhou X, Qiu X, Zhu X (2020) Improving image captioning with better use of caption. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp. 7454–7464 Cao P, Yang Z, Sun L, Liang Y, Yang MQ, Guan R (2019) Image captioning with bidirectional semantic attention-based guiding of long short-term memory. Neural Process Lett 50(1):103–119 He C, Hu H (2019) Image captioning with text-based visual attention. Neural Process Lett 49(1):177–185 Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575 Wu Z, Cohen R, Encode, review, and decode: Reviewer module for caption generation, arXiv preprint arXiv:1605.07912 3 Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017) Boosting image captioning with attributes. In: Proceedings of the IEEE international conference on computer vision, pp. 4894–4902 You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4651–4659 Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, Association for Computational Linguistics, pp. 311–318 Banerjee S, Lavie A (2005) Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72 Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out, pp. 74–81 Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: semantic propositional image caption evaluation. In: European conference on computer vision, Springer, pp. 382–398 Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua T-S (2017) Sca-cnn: spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5659–5667 He X, Yang Y, Shi B, Bai X (2019) Vd-san: visual-densely semantic attention network for image caption generation. Neurocomputing 328:48–55 Yuan A, Li X, Lu X (2019) 3g structure for image caption generation. Neurocomputing 330:17–28 Zhu X, Li L, Liu J, Li Z, Peng H, Niu X (2018) Image captioning with triple-attention and stack parallel lstm. Neurocomputing 319:55–65 Tan YH, Chan CS (2019) Phrase-based image caption generator with hierarchical lstm network. Neurocomputing 333:86–100 Ji J, Xu C, Zhang X, Wang B, Song X (2020) Spatio-temporal memory attention for image captioning. IEEE Trans Image Process 29:7615–7628