VQA: Visual Question Answering

Springer Science and Business Media LLC - Tập 123 - Trang 4-31 - 2016
Aishwarya Agrawal1, Jiasen Lu1, Stanislaw Antol1, Margaret Mitchell2, C. Lawrence Zitnick3, Devi Parikh4, Dhruv Batra4
1Virginia Tech, Blacksburg, USA
2Microsoft Research, Redmond, USA
3Facebook AI Research, Menlo Park, USA
4Georgia Institute of Technology, Blacksburg, USA

Tóm tắt

We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring real-world scenarios, such as helping the visually impaired, both the questions and answers are open-ended. Visual questions selectively target different areas of an image, including background details and underlying context. As a result, a system that succeeds at VQA typically needs a more detailed understanding of the image and complex reasoning than a system producing generic image captions. Moreover, VQA is amenable to automatic evaluation, since many open-ended answers contain only a few words or a closed set of answers that can be provided in a multiple-choice format. We provide a dataset containing $$\sim $$ 0.25 M images, $$\sim $$ 0.76 M questions, and $$\sim $$ 10 M answers ( www.visualqa.org ), and discuss the information it provides. Numerous baselines and methods for VQA are provided and compared with human performance. Our VQA demo is available on CloudCV ( http://cloudcv.org/vqa ).

Tài liệu tham khảo

Agrawal, H., Mathialagan, C.S., Goyal, Y., Chavali, N., Banik, P., Mohapatra, A., et al. (2015). Cloudcv: Large-scale distributed computer vision as a cloud service. In G. Hua & X.-S. Hua (Eds.), Mobile cloud visual media computing (pp. 265–290). Switzerland: Springer International Publishing. Antol, S., Zitnick, C.L., Parikh, D. (2014). Zero-Shot learning via visual abstraction. In ECCV Bigham, J.P., Jayant, C., Ji, H., Little, G., Miller, A., Miller, R.C., Miller, R., Tatarowicz, A., White, B., White, S., Yeh, T. (2010). VizWiz: Nearly real-time answers to visual questions. In User interface software and technology Bollacker, K., Evans, C., Paritosh, P., Sturge, T., & Taylor, J. (2008). Freebase: A collaboratively created graph database for structuring human knowledge. In International conference on management of data. doi:10.1145/1376616.1376746. Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Jr., E.R.H., Mitchell, T.M. (2010). Toward an architecture for never-ending language learning. In AAAI Chen, X., Fang, H., Lin, T., Vedantam, R., Gupta, S., Dollár, P., Zitnick, C.L. (2015). Microsoft COCO captions: Data collection and evaluation server. CoRR arXiv:1504.00325 Chen, X., Fang, H., Lin, T.Y., Vedantam, R., Gupta, S., Dollár, P., Zitnick, C.L. (2015). Microsoft COCO Captions: Data collection and evaluation server. arXiv:1504.00325 Chen, X., Shrivastava, A., Gupta, A. (2013). NEIL: Extracting visual knowledge from web data. In ICCV Chen, X., Zitnick, C.L. (2015). Mind’s eye: A recurrent visual representation for image caption generation. In CVPR Coppersmith, G., Kelly, E. (2014). Dynamic wordclouds and vennclouds for exploratory data analysis. In ACL workshop on interactive language learning and visualization Deng, J., Berg, A.C., Fei-Fei, L. (2011). Hierarchical semantic indexing for large scale image retrieval. In CVPR Donahue, J., Hendricks, L.A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In CVPR Elliott, D., Keller, F. (2014). Comparing automatic evaluation measures for image description. In ACL Fader, A., Zettlemoyer, L., Etzioni, O. (2013). Paraphrase-driven learning for open question answering. In ACL. http://www.aclweb.org/anthology/P13-1158 Fader, A., Zettlemoyer, L., Etzioni, O. (2014). Open Question answering over curated and extracted knowledge bases. In International conference on knowledge discovery and data mining Fang, H., Gupta, S., Iandola, F.N., Srivastava, R., Deng, L., Dollár, P., Gao, J., He, X., Mitchell, M., Platt, J.C., Zitnick, C.L., Zweig, G. (2015). From captions to visual concepts and back. In CVPR Farhadi, A., Hejrati, M., Sadeghi, A., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D. (2010). Every picture tells a story: Generating sentences for images. In ECCV Gao, H., Mao, J., Zhou, J., Huang, Z., Yuille, A. (2015). Are you talking to a machine? dataset and methods for multilingual image question answering. In NIPS Geman, D., Geman, S., Hallonquist, N., Younes, L. (2014). A visual turing test for computer vision systems. In PNAS Gordon, J., Durme, B.V. (2013). Reporting bias and knowledge extraction. In Proceedings of the 3rd Workshop on Knowledge Extraction, at CIKM 2013 Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R., Darrell, T., Saenko, K. (2013). YouTube2Text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In ICCV. http://www.eecs.berkeley.edu/~sguada/pdfs/2013-ICCV-youtube2text-final.pdf Hodosh, M., Young, P., & Hockenmaier, J. (2013). Framing image description as a ranking task: Data models and evaluation metrics. ournal of Artificial Intelligence Research, 47, 853–899. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093 Karpathy, A., Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In CVPR . http://arxiv.org/abs/1412.2306 Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.L. (2014). ReferItGame: Referring to objects in photographs of natural scenes. In EMNLP Kiros, R., Salakhutdinov, R., Zemel, R.S. (2015). Unifying visual-semantic embeddings with multimodal neural language models. In TACL Kiros, R., Zhu, Y., Salakhutdinov, R., Zemel, R.S., Torralba, A., Urtasun, R., Fidler, S. (2015). Skip-thought vectors. arXiv:1506.06726 Kong, C., Lin, D., Bansal, M., Urtasun, R., & Fidler, S. (2014). What Are You Talking About?. In CVPR: Text-to-image coreference. Krizhevsky, A., Sutskever, I., Hinton, G.E. (2012). ImageNet classification with deep convolutional neural networks. In NIPS Kulkarni, G., Premraj, V., Sagnik Dhar and, S.L., Choi, Y., Berg, A.C., Berg, T.L. (2011). Baby talk: Understanding and generating simple image descriptions. In CVPR Lenat, D. B., & Guha, R. V. (1989). Building large knowledge-based systems; representation and inference in the cyc project. Chicago: Addison-Wesley Longman Publishing Co., Inc. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L. (2014). Microsoft COCO: Common objects in context. In ECCV Lin, X., Parikh, D. (2015). Don’t just listen, use your imagination: Leveraging visual common sense for non-visual tasks. In CVPR Liu, H., & Singh, P. (2014). ConceptNet-A Practical Commonsense Reasoning Tool-Kit. BT Technology Journal, 22(4), 211–226. doi:10.1023/B:BTTJ.0000047600.45421.6d. Malinowski, M., Fritz, M. (2014). A multi-world approach to question answering about real-world scenes based on uncertain input. In NIPS Malinowski, M., Rohrbach, M., Fritz, M. (2015). Ask your neurons: A neural-based approach to answering questions about images. In ICCV Mao, J., Xu, W., Yang, Y., Wang, J., Yuille, A.L. (2014). Explain images with multimodal recurrent neural networks. CoRR arXiv:1410.1090 Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In NIPS Mitchell, M., van Deemter, K., Reiter, E. (2013). Attributes in visual reference. In PRE-CogSci Mitchell, M., Dodge, J., Goyal, A., Yamaguchi, K., Stratos, K., Han, X., Mensch, A., Berg, A., Berg, T.L., Daume III, H. (2012). Midge: Generating image descriptions from computer vision detections. In ACL Mitchell, M., Van Deemter, K., Reiter, E. (2013). Generating expressions that refer to visible objects. In HLT-NAACL Ramanathan, V., Joulin, A., Liang, P., Fei-Fei, L. (2014). Linking People with “Their” names using coreference resolution. In ECCV Ren, M., Kiros, R., Zemel, R. (2015). Exploring models and data for image question answering. In NIPS Richardson, M., Burges, C.J., Renshaw, E. (2013). MCTest: A challenge dataset for the open-domain machine comprehension of text. In EMNLP Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., Schiele, B. (2013). Translating video content to natural language descriptions. In ICCV Sadeghi, F., Kumar Divvala, S.K., Farhadi, A. (2015). Viske: Visual knowledge extraction and question answering by visual verification of relation phrases. In CVPR Simonyan, K., Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR arxiv:1409.1556 Toutanova, K., Klein, D., Manning, C.D., Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. In ACL Tu, K., Meng, M., Lee, M. W., Choe, T. E., & Zhu, S. C. (2014). Joint video and text parsing for understanding events and answering queries. IEEE MultiMedia, 21(2), 42–70. doi:10.1109/MMUL.2014.29. Vedantam, R., Zitnick, C.L., Parikh, D.(2015). CIDEr: Consensus-based image description evaluation. In CVPR Vendantam, R., Lin, X., Batra, T., Zitnick, C.L., Parikh, D. (2015). Learning common sense through visual abstraction. In ICCV Vinyals, O., Toshev, A., Bengio, S., Erhan, D. (2015). Show and tell: A neural image caption generator. In CVPR. arXiv:1411.4555 Weston, J., Bordes, A., Chopra, S., Mikolov, T. (2015). Towards AI-complete question answering: A set of prerequisite toy tasks. CoRR arXiv:1502.05698 Yu, L., Park, E., Berg, A.C., Berg, T.L. (2015). Visual madlibs: Fill-in-the-blank description generation and question answering. In ICCV Zhang, P., Goyal, Y., Summers-Stay, D., Batra, D., Parikh, D. (2015). Yin and yang: Balancing and answering binary visual questions. CoRR arXiv:1511.05099 Zitnick, C.L., Parikh, D. (2013). Bringing semantics into focus using visual abstraction. In CVPR Zitnick, C.L., Parikh, D., Vanderwende, L. (2013). Learning the visual interpretation of sentences. In ICCV Zitnick, C. L., Vedantam, R., & Parikh, D. (2015). Adopting abstract images for semantic scene understanding. IEEE transactions on pattern analysis and machine intelligence, 38, 627–638.