An encoder-decoder based framework for hindi image caption generation

Alok Singh1, Thoudam Doren Singh1, Sivaji Bandyopadhyay1
1Center for Natural Language Processing (CNLP), Department of Computer Science and Engineering, National Institute of Technology, Silchar, India

Tóm tắt

Từ khóa


Tài liệu tham khảo

Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086

Banerjee S, Lavie A (2005) Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72

Dhir R, Mishra SK, Saha S, Bhattacharyya P (2019) A deep attention based framework for image caption generation in hindi language. Comput y Sist 23(3)

Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. In: European conference on computer vision, pp 15–29. Springer

Felzenszwalb PF, Girshick RB, McAllester D, Ramanan D (2010) Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence 32(9):1627–1645. https://doi.org/10.1109/TPAMI.2009.167

Gong Y, Wang L, Hodosh M, Hockenmaier J, Lazebnik S (2014) Improving image-sentence embeddings using large weakly annotated photo collections. In: European conference on computer vision, pp 529–545. Springer

He X, Deng L (2017) Deep learning for image-to-text generation: a technical overview. IEEE Signal Proc Mag 34(6):109–116

Hironobu YM, Takahashi H, Oka R (1999) Image-to-word transformation based on dividing and vector quantizing images with words. In: Boltzmann machines, neural networks, pp 405409

Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Computation 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735

Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47:853–899

Hossain M, Sohel F, Shiratuddin MF, Laga H (2019) A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CSUR) 51(6):118

Isozaki H, Hirao T, Duh K, Sudoh K, Tsukada H (2010) Automatic evaluation of translation quality for distant language pairs. In: Proceedings of the 2010 conference on empirical methods in natural language processing, pp 944–952. Association for Computational Linguistics, Cambridge. https://www.aclweb.org/anthology/D10-1092

Jaffe A (2017) Generating image descriptions using multilingual data. In: Proceedings of the second conference on machine translation, pp 458–464

Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding the long-short term memory model for image caption generation. In: Proceedings of the IEEE international conference on computer vision, pp 2407–2415

Kiros R, Salakhutdinov R, Zemel R (2014) Multimodal neural language models. In: Proceedings of the 31st international conference on international conference on machine learning - vol 32, ICML’14, pp II–595–II–603. JMLR.org. http://dl.acm.org/citation.cfm?id=3044805.3044959

Laskar SR, Singh RP, Pakray P, Bandyopadhyay S (2019) English to Hindi multi-modal neural machine translation and Hindi image captioning. In: Proceedings of the 6th workshop on asian translation. Association for Computational Linguistics, Hong Kong, pp 62–67. https://doi.org/10.18653/v1/D19-5205. https://www.aclweb.org/anthology/D19-5205

Liu M, Hu H, Li L, Yu Y, Guan W (2020) Chinese image caption generation via visual attention and topic modeling. IEEE Transactions on Cybernetics

Ma S, Han Y (2016) Describing images by feeding lstm with structural words. In: 2016 IEEE international conference on multimedia and expo (ICME), pp 1–6. https://doi.org/10.1109/ICME.2016.7552883

Ordonez V, Kulkarni G, Berg TL (2011) Im2text: describing images using 1 million captioned photographs. In: Advances in neural information processing systems, pp 1143–1151

Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, ACL ’02, pp 311–318. Association for Computational Linguistics, Stroudsburg. https://doi.org/10.3115/1073083.1073135

Parida S, Bojar O, Dash SR (2019) Hindi visual genome: a dataset for multimodal english-to-hindi machine translation. Computación y Sistemas. In print. Presented at CICLing 2019, La Rochelle, France

Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Cortes C, Lawrence N, Lee D, Sugiyama M, Garnett R (eds) Advances in neural information processing systems, vol 28. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2015/file/14bfa6bb14875e45bba028a21ed38046-Paper.pdf

Ren Z, Wang X, Zhang N, Lv X, Li LJ (2017) Deep reinforcement learning-based image captioning with embedding reward. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 290–298

Sanayai Meetei L, Singh TD, Bandyopadhyay S (2019) WAT2019: English-Hindi translation on Hindi visual genome dataset. In: Proceedings of the 6th workshop on asian translation. https://doi.org/10.18653/v1/D19-5224. https://www.aclweb.org/anthology/D19-5224. Association for Computational Linguistics, Hong Kong, pp 181–188

Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556

Singh A, Meetei LS, Singh TD, Bandyopadhyay S (2021) Generation and evaluation of hindi image captions of visual genome. In: Maji AK, Saha G, Das S, Basu S, Tavares JMRS (eds) Proceedings of the international conference on computing and communication systems. Springer Singapore, Singapore, pp 65–73. https://doi.org/10.1007/978-981-33-4084-8_7

Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems, pp 3104–3112

Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164

Wang C, Yang H, Meinel C (2018) Image captioning with deep bidirectional lstms and multi-task learning. ACM Trans Multimedia Comput Commun Appl 14(2s):40:1–40:20. https://doi.org/10.1145/3115432

Wang M, Li S, Yang X, Luo C (2016) A parallel-fusion rnn-lstm architecture for image caption generation. In: 2016 IEEE international conference on image processing (ICIP), pp 4448–4452

Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057

Yang Y, Teo CL, Daumé H III, Aloimonos Y (2011) Corpus-guided sentence generation of natural images. In: Proceedings of the conference on empirical methods in natural language processing, pp 444–454. Association for Computational Linguistics

Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. TACL 2:67–78