A comprehensive survey on image captioning: from handcrafted to deep learning-based techniques, a taxonomy and open research issues

Artificial Intelligence Review - Tập 56 - Trang 13619-13661 - 2023
Himanshu Sharma1, Devanand Padha1
1Department of Computer Science and Information Technology, Central University of Jammu, Jammu, India

Tóm tắt

Image captioning is a pretty modern area of the convergence of computer vision and natural language processing and is widely used in a range of applications such as multi-modal search, robotics, security, remote sensing, medical, and visual aid. The image captioning techniques have witnessed a paradigm shift from classical machine-learning-based approaches to the most contemporary deep learning-based techniques. We present an in-depth investigation of image captioning methodologies in this survey using our proposed taxonomy. Furthermore, the study investigates several eras of image captioning advancements, including template-based, retrieval-based, and encoder-decoder-based models. We also explore captioning in languages other than English. A thorough investigation of benchmark image captioning datasets and assessment measures is also discussed. The effectiveness of real-time image captioning is a severe barrier that prevents its use in sensitive applications such as visual aid, security, and medicine. Another observation from our research is the scarcity of personalized domain datasets that limits its adoption into more advanced issues. Despite influential contributions from several academics, further efforts are required to construct substantially robust and reliable image captioning models.

Tài liệu tham khảo

Alam M, Samad MD, Vidyaratne L, Glandon A, Iftekharuddin KM (2020) Survey on deep neural networks in speech and vision systems. Neurocomputing 417:302–321 Amirian S, Rasheed K, Taha TR, Arabnia HR (2020) Automatic image and video caption generation with deep learning: a concise review and algorithmic overlap. IEEE Access 8:218386–218400 Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: Semantic propositional image caption evaluation. In: European conference on computer vision pp 382–398. Springer Aneja J, Deshpande A, Schwing AG (2018) Convolutional image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5561–5570 Bai S, An S (2018) A survey on automatic image caption generation. Neurocomputing 311:291–304 Banerjee S, Lavie A (2005) Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72 Ben H, Pan Y, Li Y, Yao T, Hong R, Wang M, Mei T (2022) Unpaired image captioning with semantic-constrained self-learning. IEEE Trans Multimedia 24:904–916 Bernardi R, Cakici R, Elliott D, Erdem A, Erdem E, Ikizler-Cinbis N, Keller F, Muscat A, Plank B (2016) Automatic description generation from images: a survey of models, datasets, and evaluation measures. J Artif Intell Res 55:409–442 Bhosale YH, Patnaik KS (2022) Application of deep learning techniques in diagnosis of covid-19 (coronavirus): a systematic review. Neural Process Lett 2:1–53 Bhosale YH, Patnaik KS (2022) Iot deployable lightweight deep learning application for covid-19 detection with lung diseases using raspberrypi. In: 2022 International Conference on IoT and Blockchain Technology (ICIBT), pp 1–6 Bhosale YH, Zanwar S, Ahmed Z, Nakrani M, Bhuyar D, Shinde U (2022) Deep convolutional neural network based covid-19 classification from radiology x-ray images for IOT enabled devices. In: 2022 8th International Conference on Advanced Computing and Communication Systems (ICACCS), vol 1, pp 1398–1402 Bin Y, Ding Y, Peng B, Peng L, Yang Y, Chua T-S (2022) Entity slot filling for visual captioning. IEEE Trans Circuits Syst Video Technol 32(1):52–62 Bryan CR, Antonio T, Murphy Kevin P, Freeman William T (2008) Labelme: a database and web-based tool for image annotation. Int J Comput Vision 77(1–3):157–173 Chen X, Zitnick C.L (2015) Mind’s eye: a recurrent visual representation for image caption generation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2422–2431, Boston. IEEE Cheng Q, Zhou Y, Peng F, Yuan X, Zhang L (2021) A deep semantic alignment network for the cross-modal image-text retrieval in remote sensing. IEEE J Select Top Appl Earth Observations Remote Sens 14:4284–4297 Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10578–10587 Dai B, Fidler S, Urtasun R, Lin D (2017) Towards diverse and natural image descriptions via a conditional GAN. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp 2989–2998, Venice. IEEE Devlin J, Cheng H, Fang H, Gupta S, Deng L, He X, Zweig G, Mitchell M (2015) Language models for image captioning: the quirks and what works. In: Annual Meeting of the Association for Computational Linguistics Dhir R, Mishra SK, Saha S, Bhattacharyya P (2019) A deep attention based framework for image caption generation in Hindi language. Computación y Sistemas 23(3):125 Effendi J, Sakti S, Nakamura S (2021) End-to-end image-to-speech generation for untranscribed unknown languages. IEEE Access 55(9):55144–55154 Elliott D, de Vries A (2015) Describing images using inferred visual dependency representations. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (vol 1: Long Papers), pp 42–52 Elliott D, Frank S, Hasler E (2015) Multilingual image description with neural sequence models. arXiv preprintarXiv:1510.04709 Elliott D, Keller F (2013) Image description using visual dependency representations. In: Proceedings of the 2013 conference on empirical methods in natural language processing, pp 1292–1302 Fang H, Gupta S, Iandola F, Srivastava RK, Deng L, Dollár P, Gao J, He X, Mitchell M, Platt JC, et al (2015) From captions to visual concepts and back. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1473–1482 Fang Z, Wang J, Hu X, Liang L, Gan Z, Wang L, Yang Y, Liu Z (2022) Injecting semantic concepts into end-to-end image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 18009–18019 Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth DA (2010) Every picture tells a story: generating sentences from images. In: European Conference on Computer Vision Fei Z (2021) Partially non-autoregressive image captioning. Proc AAAI Conf Artif Intell 35:1309–1316 Felzenszwalb PF, Girshick RB, McAllester D, Ramanan D (2009) Object detection with discriminatively trained part-based models. IEEE Trans Pattern Anal Mach Intell 32(9):1627–1645 Gan Z, Gan C, He X, Pu Y, Tran K, Gao J, Carin L, Deng L (2016) Semantic compositional networks for visual captioning. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 1141–1150 Gao L, Fan K, Song J, Liu X, Xu X, Shen HT (2019) Deliberate attention networks for image captioning. In: AAAI Conference on Artificial Intelligence Gilberto L, Ortiz M, Wolff C, Lapata M (2015) Learning to interpret and describe abstract scenes. In: Proceedings of the 2015 conference of the north american chapter of the association for computational linguistics: human language Technologies, pps 1505–1515, Denver, Colorado, 2015. Association for Computational Linguistics Girish K, Visruth P, Vicente O, Sagnik D, Siming L, Yejin C, Berg Alexander C, Berg Tamara L (2013) Babytalk: understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903 Gong Y, Wang L, Hodosh M, Hockenmaier Julia, Lazebnik Svetlana (2014) Improving image-sentence embeddings using large weakly annotated photo collections. In: European conference on computer vision, pp 529–545. Springer Grubinger M, Clough PM, Deselaers T (2006) The iapr tc-12 benchmark: a new evaluation resource for visual information systems. In: International workshop ontoImage, volume 2 Gu J, Wang G, Cai J, Chen T (2017) An empirical study of language cnn for image captioning. In: 2017 IEEE international conference on computer vision (ICCV), pp 1231–1240, Venice. IEEE Guo L, Liu J, Zhu X, He X, Jiang J, Lu H (2021) Non-autoregressive image captioning with counterfactuals-critical multi-agent learning. In: Proceedings of the twenty-ninth international joint conference on artificial intelligence, IJCAI’20, 2021 Guo L, Liu J, Zhu X, Yao P, Lu S, Lu H (2020) Normalized and geometry-aware self-attention network for image captioning. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10324–10333, Seattle. IEEE Gupta A, Verma Y, Jawahar CV (2012) Choosing linguistics over vision to describe images. In: Proceedings of the AAAI conference on artificial intelligence Haque AUI, Ghani S, Saeed M (2021) Image captioning with positional and geometrical semantics. IEEE Access 9:160917–160925 Herdade S, Kappeler A, Boakye K, Soares J (2019) Image captioning: Transforming objects into words. In: Conference and workshop on neural information processing systems Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47:853–899 Hossain MZ, Sohel F, Shiratuddin MF, Laga H (2019) A comprehensive survey of deep learning for image captioning. ACM Comput Surveys (CSUR) 51:1–36 Hossain MZ, Sohel F, Shiratuddin MF, Laga H, Bennamoun M (2021) Text to image synthesis for improved image captioning. IEEE Access 9:64918–64928 Hou D, Zhao Z, Liu Y, Chang F, Sanyuan H (2021) Automatic report generation for chest X-ray images via adversarial reinforcement learning. IEEE Access 9:21236–21250 Hoxha G, Melgani F, Demir B (2020) Toward remote sensing image retrieval under a deep image captioning perspective. IEEE J Select Top Appl Earth Observ Remote Sens 13:4462–4475 Huang Wei, Wang Qi, Li X (2021) Denoising-based multiscale feature fusion for remote sensing image captioning. IEEE Geosci Remote Sens Lett 18(3):436–440 Huang L, Wang W, Chen J, Wei X-Y (2019) Attention on attention for image captioning. In: 2019 IEEE/cvf international conference on computer vision (ICCV), pp 4633–4642 Jeff D, Anne HL, Marcus R, Subhashini V, Sergio G, Kate S, Trevor D (2017) Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans Pattern Anal Mach Intell 39(4):677–691 Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding the long-short term memory model for image caption generation. In: 2015 IEEE international conference on computer vision (ICCV) Jiang W, Li X, Haifeng H, Qiang L, Liu B (2021) Multi-gate attention network for image captioning. IEEE Access 9:69700–69709 Jie W, Chen T, Hefeng W, Yang Z, Luo G, Lin Liang (2021) Fine-grained image captioning with global-local discriminative objective. IEEE Trans Multimedia 23:2413–2427 Jin J, Fu K, Cui R, Sha F, Zhang C (2015) Aligning where to see and what to tell: image caption with region-based attention and scene factorization. arXiv:1506.06272 [cs, stat], June 2015 Jing Su, Li Jing (2021) Show auto-adaptive and tell: learned from the SEM image challenge. IEEE Access 9:51494–51500 Jun Y, Li J, Zhou Y, Huang Q (2020) Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans Circuits Syst Video Technol 30(12):4467–4480 Karpathy A, Fei-Fei L (2014) Deep visual-semantic alignments for generating image descriptions. IEEE Trans Pattern Anal Mach Intell 39:664–676 Karpathy A, Joulin A, Fei-Fei LF (2014) Deep fragment embeddings for bidirectional image sentence mapping. Adv Neural Inform Process Syst 27:1 Kastner MA, Umemura K, Ide I, Kawanishi Y, Hirayama T, Doman Keisuke, Deguchi Daisuke, Murase Hiroshi, Satoh Shin’Ichi (2021) Imageability- and length-controllable image captioning. IEEE Access 9:162951–162961 Kaur M, Josan G, Kaur J (2021) Automatic Punjabi caption generation for sports images. INFOCOMP J Comput Sci 20(1):2 Ke L, Pei W, Li R, Shen X, Tai Y-W (2019) Reflective decoding network for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 8888–8897 Kiros R, Salakhutdinov R, Zemel RS (2014) Multimodal neural language models. In: International Conference on Machine Learning Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539 [cs], November 2014 Kumar A, Goel S (2018) A survey of evolution of image captioning techniques. Int J Hybrid Intell Syst 14(3):123–139 Kusner M, Sun Y, Kolkin N, Weinberger K (2015) From word embeddings to document distances. In: International conference on machine learning, pp 957–966. PMLR Kuznetsova P, Ordonez V, Berg TL, Choi Y (2014) Treetalk : composition and compression of trees for image descriptions. Trans Assoc Comput Linguist 2:351–362 Kuznetsova P, Ordonez V, Berg AC, Berg TL, Choi Y (2012) Collective generation of natural image descriptions. In: Annual meeting of the association for computational linguistics Lan W, Li X, Dong J (2017) Fluency-guided cross-lingual image captioning. In: Proceedings of the 25th ACM international conference on multimedia, pp 1549–1557 Lebret R, Pinheiro PHO, Collobert R (2015) Phrase-based image captioning. In: International conference on machine learning Lebret R, Pinheiro PO, Collobert R (2015) Simple image description generator via a linear phrase-based approach. arXiv:1412.8419 [cs], April Li X, Chaoxi X, Wang X, Lan W, Jia Z, Yang G, Jieping X (2019) COCO-CN for cross-lingual image tagging, captioning, and retrieval. IEEE Trans Multimedia 21(9):2347–2360 Li S, Tao Z, Li K, Yun F (2019) Visual to text: survey of image and video captioning. IEEE Trans Emerging Top Comput Intell 3(4):297–312 Li J, Yao P, Guo L, Zhang W (2019) Boosted transformer for image captioning. Appl Sci 9(16):3260 Li X, Zhang X, Huang W, Wang Q (2021) Truncation cross entropy loss for remote sensing image captioning. IEEE Trans Geosci Remote Sens 59(6):5246–5257 Li W, Zhaowei Q, Song H, Wang P, Xue B (2021) The traffic scene understanding and prediction based on image captioning. IEEE Access 9:1420–1427 Li G, Zhu L, Liu P, Yang Y (2019) Entangled transformer for image captioning. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 8927–8936 Li S, Kulkarni G, Berg T, Berg A, Choi Y (2011) Composing simple image descriptions using web-scale n-grams. In: Proceedings of the fifteenth conference on computational natural language learning, pp 220–228 Li X, Lan W, Dong J, Liu H (2016) Adding Chinese captions to images. In: Proceedings of the 2016 ACM on international conference on multimedia retrieval, pp 271–275 Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lin D, Fidler S, Kong C, Urtasun R (2015) Generating multi-sentence natural language descriptions of indoor scenes. In: Proceedings of the British machine vision conference (BMVC), pp 93.1–93.13. BMVA Press, September 2015 Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: European conference on computer vision, pp 740–755. Springer Lingxiang W, Min X, Sang L, Yao T, Mei T (2021) Noise augmented double-stream graph convolutional networks for image captioning. IEEE Trans Circuits Syst Video Technol 31(8):3118–3127 Liu X, Qingyang X, Wang N (2019) A survey on deep neural network-based image captioning. Visual Comput 35(3):445–470 Liu H, Zhang S, Lin K, Wen J, Li J, Xiaolin H (2021) Vocabulary-wide credit assignment for training image captioning models. IEEE Trans Image Process 30:2450–2460 Liu C, Mao J, Sha F, Yuille A (2017) Attention correctness in neural image captioning. In: Thirty-first AAAI conference on artificial intelligence, Liu F, Ren X, Liu Y, Lei K, Sun X (2019) Exploring and distilling cross-modal information for image captioning. In: International joint conference on artificial intelligence Liu S, Zhu Z, Ye N, Guadarrama S, Murphy K (2017) Improved image captioning via policy gradient optimization of SPIDEr. In: 2017 IEEE international conference on computer vision (ICCV), pp 873–881, Venice. IEEE Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 375–383 Lu J, Yang J, Batra D, Parikh D (2018) Neural baby talk. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7219–7228 Luo Y, Ji J, Sun X, Cao L, Yongjian W, Huang F, Lin CW, Ji R (2021) Dual-level collaborative transformer for image captioning. Proc AAAI Conf Artif Intell 35:2286–2293 Ma X, Zhao R, Shi Z (2021) Multiscale methods for optical remote-sensing image captioning. IEEE Geosci Remote Sens Lett 18(11):2001–2005 Mao J, Xu W, Yang Y, Wang J, Huang Z, Yuille A (2018) Deep captioning with multimodal recurrent neural networks (m-RNN). In: Proceedings of the international conference on learning representations Mason R, Charniak E (2014) Nonparametric method for data-driven image captioning. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (Volume 2: Short Papers), pp 592–598, Baltimore, Maryland. Association for Computational Linguistics Min K, Dang M, Moon H (2021) Deep learning-based short story generation for an image using the encoder-decoder structure. IEEE Access 9:113550–113557 Mishra SK, Dhir R, Saha S, Bhattacharyya P, Singh AK (2021) Image captioning in Hindi language using transformer networks. Comput Electr Eng 92:107114 Mitchell M, Dodge J, Goyal A, Yamaguchi K, Stratos K, Han X, Mensch A, Berg A, Berg T, Daumé IIIH (2012) Midge: generating image descriptions from computer vision detections. In: Proceedings of the 13th conference of the European chapter of the association for computational linguistics, pp 747–756 Miyazaki T, Shimizu N (2016) Cross-lingual image caption generation. In: Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: long papers), pp 1780–1790 Ordonez V, Kulkarni G, Berg T (2011) Im2text: describing images using 1 million captioned photographs. Advn Neural Inform Process Syst 56:25 Pan Y, Yao T, Li Y, Mei T, (2020) X-linear attention networks for image captioning. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10968–10977, 2020 Papineni K, Roukos S, Ward T, Zhu W-Ji(2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311–318 Park H, Kim K, Park S, Choi J (2021) Medical image captioning model to convey more details: methodological comparison of feature difference generation. IEEE Access 9:150560–150568 Park CC, Kim B, Kim G (2017) Attend to you: personalized image captioning with context sequence memory networks. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 6432–6440 Patterson G, Chen X, Hang S, Hays J (2014) The SUN attribute database: beyond categories for deeper scene understanding. Int J Comput Vision 108(1–2):59–81 Pedersoli M, Lucas T, Schmid C, Verbeek J (2017) Areas of attention for image captioning. In: Proceedings of the IEEE international conference on computer vision, pp 1242–1250 Qi W, Shen C, Wang P, Dick A, Van Den Hengel A (2017) Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans Pattern Anal Mach Intell 40(6):1367–1381 Qin Y, Du J, Zhang Y, Lu H (2019) Look back and predict forward in image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8367–8375 Ranjay K, Yuke Z, Oliver G, Justin J, Kenji H, Joshua K, Stephanie C, Yannis K, Li-Jia L, Shamma DA et al (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vision 123(1):32–73 Rashtchian C, Young P, Hodosh M, Hockenmaier J (2010) Collecting image annotations using amazon’s mechanical turk. In: Proceedings of the NAACL HLT 2010 Workshop on creating speech and language data with amazon’s mechanical Turk, pp 139–147, 2010 Rathi A (2020) Deep learning apporach for image captioning in Hindi language. In: 2020 international conference on computer, electrical communication engineering (ICCECE), pp 1–8, 2020 Ren Z, Wang X, Zhang N, Lv X, Li L-J (2017) Deep reinforcement learning-based image captioning with embedding reward. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 1151–1159, Honolulu. IEEE Sammani F, Melas-Kyriazi L (2020) Show, edit and tell: a framework for editing image captions. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 4807–4815, Seattle. IEEE Shetty R, Rohrbach M, Hendricks LA, Fritz M, Schiele B (2017) Speaking the same language: matching machine to human captions by adversarial training. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp 4155–4164, Venice 2017. IEEE Socher R, Karpathy A, Le QV, Manning CD, Andrew YN (2014) Grounded compositional semantics for finding and describing images with sentences. Trans Assoc Comput Linguist 2:207–218 Sumbul G, Nayak S, Demir B (2021) SD-RSIC: summarization-driven deep remote sensing image captioning. IEEE Trans Geosci Remote Sens 59(8):6922–6934 Tavakoli HR, Shetty R, Borji A, Laaksonen J (2017) Paying attention to descriptions generated by image captioning models. In: Proceedings of the IEEE international conference on computer vision, pp 2487–2496 Tsutsui S, Crandall D (2017) Using artificial tokens to control languages for multilingual image caption generation. arXiv preprintarXiv:1706.06275, 2017 Ushiku Y, Yamaguchi M, Mukuta Y, Harada T (2015) Common subspace for model and similarity: phrase learning for caption generation from images. In: 2015 IEEE international conference on computer vision (ICCV), pp 2668–2676, Santiago, Chile 2015. IEEE van Miltenburg E, Elliott D, Vossen P (2017) Cross-linguistic differences and similarities in image descriptions. In: International conference on natural language generation, 2017 Vedantam R, Zitnick CL, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575, 2015 Verma Y, Jawahar CV (2014) Im2Text and Text2Im: associating images and texts for cross-modal retrieval. In Proceedings of the British machine vision conference 2014, pp 97.1–97.13, Nottingham, 2014. British Machine Vision Association Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 3156–3164, Boston. IEEE Wang L, Bai Z, Zhang Y, Hongtao L (2020) Show, recall, and tell: image captioning with recall mechanism. Proc AAAI Conf Artif Intell 34(07):12176–12183 Wang W, Chen Z, Haifeng H (2019) Hierarchical attention network for image captioning. Proc AAAI Conf Artif Intell 33:8957–8964 Wang Q, Huang W, Zhang X, Li X (2021) Word-sentence framework for remote sensing image captioning. IEEE Trans Geosci Remote Sens 59(12):10532–10543 Wang Y, Jungang X, Sun Y (2022) End-to-end transformer based model for image captioning. Proc AAAI Conf Artif Intell 36:2585–2594 Wang B, Zheng X, Bo Q, Xiaoqiang L (2020) Retrieval topic recurrent memory network for remote sensing image captioning. IEEE J Select Top Appl Earth Observ Remote Sens 13:256–270 Wang C, Gu X (2021) An image captioning approach using dynamical attention. In: 2021 international joint conference on neural networks (IJCNN), pp 1–8, 2021 Wang C, Yang H, Meinel C (2018) Image captioning with deep bidirectional lstms and multi-task learning. In: ACM transactions on multimedia computing, communications, and applications (TOMM), 14(2s):1–20 Wang D, Beck D, Cohn T (2019) On the role of scene graphs in image captioning. In: Proceedings of the beyond vision and language: integrating real-world knowledge (LANTERN) Wang Q, Chan AB (2018) CNN+CNN: convolutional decoders for image captioning. arXiv:1805.09019 [cs], May Wang Y, Lin Z, Shen X, Cohen S, Cottrell GW (2017) Skeleton key: image captioning by skeleton-attribute decomposition. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 7378–7387, Honolulu. IEEE Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel , Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057. PMLR, 2015 Yang L, Wang H, Tang P, Li Qinyu (2021) CaptionNet: a tailor-made recurrent neural network for generating image descriptions. IEEE Trans Multimedia 23:835–845 Yang X, Tang K, Zhang H, Cai J (2019) Auto-encoding scene graphs for image captioning. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 10677–10686, Long Beach 2019. IEEE Yang X, Zhang H, Cai J (2019) Learning to collocate neural modules for image captioning. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 4249–4259, Seoul 2019. IEEE Yang Y, Teo C, Daumé H, Aloimonos Y (2011) Corpus-guided sentence generation of natural images. In: Proceedings of the 2011 conference on empirical methods in natural language processing, pp 444–454 Yao T, Pan Y, Li Y, Mei T (2018) Exploring visual relationship for image captioning. In: Proceedings of the European conference on computer vision (ECCV), pp 684–699, 2018 Yao T, Pan Y, Li Y, Mei T (2019) Hierarchy parsing for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2621–2629 Yatskar M, Galley M, Vanderwende L, Zettlemoyer L (2014) See no evil, say no evil: Description generation from densely labeled images. In : Proceedings of the Third Joint Conference on Lexical and Computational Semantics (* SEM 2014), pp 110–120 Ye S, Han J, Liu N (2018) Attentive linear transformation for image captioning. IEEE Trans Image Process 27(11):5514–5524 Yoshikawa Y, Shigeto Y, Takeuchi A (2017) STAIR captions: constructing a large-scale Japanese image caption dataset. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp 417–421, Vancouver. Association for Computational Linguistics Yuan Z, Li X, Wang Q (2020) Exploring multi-level attention and semantic relationship for remote sensing image captioning. IEEE Access 8:2608–2620 Yumeng Z, Jing Y, Shuo G, Limin L (2021) News image-text matching with news knowledge graph. IEEE Access 9:108017–108027 Zhang J, Mei K, Zheng Y, Fan J (2021) Integrating part of speech guidance for image captioning. IEEE Trans Multimedia 23:92–104 Zhang X, Sun X, Luo Y, Ji J, Zhou Y, Wu Y, Huang F, Ji R (2021) Rstnet: Captioning with adaptive attention on visual and non-visual words. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15465–15474 Zhao W, Xinxiao W, Luo J (2021) Cross-domain image captioning via cross-modal retrieval and model adaptation. IEEE Trans Image Process 30:1180–1192 Zhou Z, Liang X, Wang C, Xie W, Wang S, Ge S, Zhang Y (2021) An image captioning model based on bidirectional depth residuals and its application. IEEE Access 9:25360–25370 Zhou Y, Wang M, Liu D, Hu Z, Zhang H (2020) More grounded image captioning by distilling image-text matching model. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR) Zohourianshahzadi Z, Kalita JK (2021) Neural attention for image captioning: review of outstanding methods. Artif Intell Rev 55(5):3833–3862