Boosting convolutional image captioning with semantic content and visual relationship

Displays - Tập 70 - Trang 102069 - 2021
Cong Bai1, Anqi Zheng1, Yuan Huang1, Xiang Pan1, Nan Chen2
1Zhejiang University of Technology, HangZhou 310000, China
2Qilu Normal University, JiNan 250013, China

Tài liệu tham khảo

LeCun, 1998, Gradient-based learning applied to document recognition, Proc. IEEE, 86, 2278, 10.1109/5.726791 I. Sutskever, J. Martens, G.E. Hinton, Generating text with recurrent neural networks, in: ICML, 2011. Hochreiter, 1997, Long short-term memory, Neural Comput., 9, 1735, 10.1162/neco.1997.9.8.1735 Hossain, 2019, A Comprehensive Survey of Deep Learning for Image Captioning, ACM Comput. Surv., 51, 1, 10.1145/3295748 K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, Y. Bengio, Show, attend and tell: Neural image caption generation with visual attention, in: International conference on machine learning, 2015, pp. 2048–2057. I. Schwartz, A. Schwing, T. Hazan, High-order attention models for visual question answering, in: Advances in Neural Information Processing Systems, 2017, pp. 3664–3674. Anderson, 2018, Bottom-up and top-down attention for image captioning and visual question answering, 6077 J. Gehring, M. Auli, D. Grangier, D. Yarats, Y.N. Dauphin, Convolutional sequence to sequence learning, arXiv preprint arXiv:1705.03122 (2017). Gu, 2017, An empirical study of language cnn for image captioning, 1222 Aneja, 2018, Convolutional image captioning, 5561 Q. Wang, A.B. Chan, Cnn+ cnn: Convolutional decoders for image captioning, arXiv preprint arXiv:1805.09019 (2018). Dauphin, 2017, Language modeling with gated convolutional networks, 933 J.H. Kim, G.S. Hong, B.G. Kim, D.P. Dogra, deepgesture: Deep learning-based gesture recognition scheme using motion sensors, Displays 55 (2018) 38–45. Advances in Smart Content-Oriented Display Technology. A.K. Dash, S.K. Behera, D.P. Dogra, P.P. Roy, Designing of marker-based augmented reality learning environment for kids using convolutional neural network architecture, Displays 55 (2018) 46–54. Advances in Smart Content-Oriented Display Technology. J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, A. Yuille, Deep captioning with multimodal recurrent neural networks (m-RNN), in: 3rd International Conference on Learning Representations, ICLR 2015, 2015. Karpathy, 2015, Deep visual-semantic alignments for generating image descriptions, 3128 Jia, 2015, Guiding the long-short term memory model for image caption generation Y. Kim, Convolutional neural networks for sentence classification, arXiv preprint arXiv:1408.5882 (2014). A. Conneau, H. Schwenk, L. Barrault, Y. Lecun, Very deep convolutional networks for text classification, arXiv preprint arXiv:1606.01781 (2016). Divvala, 2014, Learning everything about anything: Webly-supervised visual concept learning D. Teney, L. Liu, A. Van Den Hengel, Graph-structured representations for visual question answering, in: Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, 2017. Shi, 2019, Explainable and explicit visual reasoning over scene graphs, 8376 Liu, 2019, Learning to assemble neural module tree networks for visual grounding, 4673 Wu, 2016, What value do explicit high level concepts have in vision to language problems? Yao, 2017, Boosting image captioning with attributes, 4894 Kipf, 2017, Semi-supervised classification with graph convolutional networks Yao, 2018, Exploring visual relationship for image captioning, 684 Yang, 2019, Auto-encoding scene graphs for image captioning, 10685 T.Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C.L. Zitnick, Microsoft coco: Common objects in context, in: European conference on computer vision, Springer, 2014, pp. 740–755. Krishna, 2017, Visual genome: Connecting language and vision using crowdsourced dense image annotations, Int. J. Comput. Vision, 123, 32, 10.1007/s11263-016-0981-7 Dai, 2017, Towards diverse and natural image descriptions via a conditional gan, 2970 Papineni, 2002, Bleu: a method for automatic evaluation of machine translation, 311 Denkowski, 2014, Meteor universal: Language specific translation evaluation for any target language, 376 C.Y. Lin, Rouge: A package for automatic evaluation of summaries, in: Text summarization branches out, 2004, pp. 74–81. Vedantam, 2015, Cider: Consensus-based image description evaluation, 4566 P. Anderson, B. Fernando, M. Johnson, S. Gould, Spice: Semantic propositional image caption evaluation, in: European Conference on Computer Vision, Springer, 2016, pp. 382–398. X. Chen, H. Fang, T.Y. Lin, R. Vedantam, S. Gupta, P. Dollár, C.L. Zitnick, Microsoft coco captions: Data collection and evaluation server, arXiv preprint arXiv:1504.00325 (2015).