Spatio-Temporal Two-stage Fusion for video question answering

Computer Vision and Image Understanding - Tập 237 - Trang 103821 - 2023

Feifei Xu¹, Yitao Zhu¹, Chun Wang¹, Yangze Cao¹, Zheng Zhong¹, Xiongmin Li²

¹Shanghai University of Electric Power, No. 1851 , Hucheng Ring Road, Pudong New Area, Shanghai 201306, China

²Cognizant Technology Solutions U.S. Corporation, 211 Quality Circle College Station, TX 77845, United States

Tài liệu tham khảo

Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C., 2021. Vivit: A video vision transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6836–6846. Ba, 2016 Ben-Younes, H., Cadene, R., Cord, M., Thome, N., 2017. Mutan: Multimodal tucker fusion for visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2612–2620. Ben-Younes, H., Cadene, R., Thome, N., Cord, M., 2019. Block: Bilinear superdiagonal fusion for visual question answering and visual relationship detection. In: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. (01), pp. 8102–8109. Cai, J., Yuan, C., Shi, C., Li, L., Cheng, Y., Shan, Y., 2021. Feature augmented memory with global attention network for videoqa. In: Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence. pp. 998–1004. Chen, D., Dolan, W.B., 2011. Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. pp. 190–200. Dang, 2021 Dosovitskiy, 2020 Fan, C., Zhang, X., Zhang, S., Wang, W., Zhang, C., Huang, H., 2019. Heterogeneous memory enhanced multimodal attention model for video question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1999–2007. Fukui, 2016 Gao, 2021, Generalized pyramid co-attention with learnable aggregation net for video question answering, Pattern Recognit., 120, 10.1016/j.patcog.2021.108145 Gao, J., Ge, R., Chen, K., Nevatia, R., 2018. Motion-appearance co-memory networks for video question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6576–6585. Gao, L., Zeng, P., Song, J., Li, Y.F., Liu, W., Mei, T., Shen, H.T., 2019. Structured two-stream attention network for video question answering. In: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. (01), pp. 6391–6398. Girdhar, R., Carreira, J., Doersch, C., Zisserman, A., 2019. Video action transformer network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 244–253. Gu, 2021, Graph-based multi-interaction network for video question answering, IEEE Trans. Image Process., 30, 2758, 10.1109/TIP.2021.3051756 Huang, D., Chen, P., Zeng, R., Du, Q., Tan, M., Gan, C., 2020. Location-aware graph convolutional networks for video question answering. In: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. (07), pp. 11021–11028. Jang, Y., Song, Y., Yu, Y., Kim, Y., Kim, G., 2017. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2758–2766. Jiang, J., Chen, Z., Lin, H., Zhao, X., Gao, Y., 2020. Divide and conquer: Question-guided spatio-temporal contextual attention for video question answering. In: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. (07), pp. 11101–11108. Jiang, P., Han, Y., 2020. Reasoning with heterogeneous graph alignment for video question answering. In: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. (07), pp. 11109–11116. Jin, W., Zhao, Z., Gu, M., Yu, J., Xiao, J., Zhuang, Y., 2019. Multi-interaction network with object relation for video question answering. In: Proceedings of the 27th ACM International Conference on Multimedia. pp. 1193–1201. Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., Carion, N., 2021. MDETR-modulated detection for end-to-end multi-modal understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 1780–1790. Kim, J., Ma, M., Pham, T., Kim, K., Yoo, C.D., 2020. Modality shifting attention network for multi-modal video question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10106–10115. Kingma, 2014 Krishna, 2017, Visual genome: Connecting language and vision using crowdsourced dense image annotations, Int. J. Comput. Vis., 123, 32, 10.1007/s11263-016-0981-7 Le, T.M., Le, V., Venkatesh, S., Tran, T., 2020. Hierarchical conditional relation networks for video question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9972–9981. Lei, 2018 Lei, 2019 Li, X., Gao, L., Wang, X., Liu, W., Xu, X., Shen, H.T., Song, J., 2019. Learnable aggregating net with diversity learning for video question answering. In: Proceedings of the 27th ACM International Conference on Multimedia. pp. 1166–1174. Li, 2022, Complementary spatiotemporal network for video question answering, Multimedia Syst., 28, 161, 10.1007/s00530-021-00805-6 Lin, T.Y., RoyChowdhury, A., Maji, S., 2015. Bilinear cnn models for fine-grained visual recognition. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1449–1457. Pennington, J., Socher, R., Manning, C.D., 2014. Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. (EMNLP), pp. 1532–1543. Ren, 2015, Faster r-cnn: Towards real-time object detection with region proposal networks, Adv. Neural Inf. Process. Syst., 28 Ren, 2015, Exploring models and data for image question answering, Adv. Neural Inf. Process. Syst., 28 Simonyan, 2014 Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M., 2015. Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 4489–4497. Tucker, 1966, Some mathematical notes on three-mode factor analysis, Psychometrika, 31, 279, 10.1007/BF02289464 Vaswani, 2017, Attention is all you need, Adv. Neural Inf. Process. Syst., 30 Wang, 2020, Long video question answering: A matching-guided attention model, Pattern Recognit., 102, 10.1016/j.patcog.2020.107248 Woo, S., Park, J., Lee, J.Y., Kweon, I.S., 2018. Cbam: Convolutional block attention module. In: Proceedings of the European Conference on Computer Vision. (ECCV), pp. 3–19. Xiao, 2020, Hierarchical temporal fusion of multi-grained attention features for video question answering, Neural Process. Lett., 52, 993, 10.1007/s11063-019-10003-1 Xu, J., Mei, T., Yao, T., Rui, Y., 2016. Msr-vtt: A large video description dataset for bridging video and language. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 5288–5296. Xu, D., Zhao, Z., Xiao, J., Wu, F., Zhang, H., He, X., Zhuang, Y., 2017. Video question answering via gradually refined attention over appearance and motion. In: Proceedings of the 25th ACM International Conference on Multimedia. pp. 1645–1653. Ye, 2020, Video question answering via grounded cross-attention network learning, Inf. Process. Manage., 57, 10.1016/j.ipm.2020.102265 Ye, Y., Zhao, Z., Li, Y., Chen, L., Xiao, J., Zhuang, Y., 2017. Video question answering via attribute-augmented attention network learning. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 829–832. Yu, Z., Yu, J., Fan, J., Tao, D., 2017. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1821–1830. Yu, 2020, Long-term video question answering via multimodal hierarchical memory attentive networks, IEEE Trans. Circuits Syst. Video Technol., 31, 931, 10.1109/TCSVT.2020.2995959 Yu, 2019, Compositional attention networks with two-stream fusion for video question answering, IEEE Trans. Image Process., 29, 1204, 10.1109/TIP.2019.2940677 Zadeh, A., Liang, P.P., Mazumder, N., Poria, S., Cambria, E., Morency, L.-P., 2018. Memory fusion network for multi-view sequential learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32. (1). Zeng, K.H., Chen, T.H., Chuang, C.Y., Liao, Y.H., Niebles, J.C., Sun, M., 2017. Leveraging video descriptions to learn video question answering. In: Thirty-First AAAI Conference on Artificial Intelligence. Zha, 2019, Spatiotemporal-textual co-attention network for video question answering, ACM Trans. Multimed. Comput., Commun. Appl. (TOMM), 15, 1, 10.1145/3320061 Zhang, 2022, Action-centric relation transformer network for video question answering, IEEE Trans. Circuits Syst. Video Technol., 32, 63, 10.1109/TCSVT.2020.3048440 Zhao, Z., Lin, J., Jiang, X., Cai, D., He, X., Zhuang, Y., 2017. Video question answering via hierarchical dual-level attention network learning. In: Proceedings of the 25th ACM International Conference on Multimedia. pp. 1050–1058.

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA