Spatio-Temporal Two-stage Fusion for video question answering
Tài liệu tham khảo
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C., 2021. Vivit: A video vision transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6836–6846.
Ba, 2016
Ben-Younes, H., Cadene, R., Cord, M., Thome, N., 2017. Mutan: Multimodal tucker fusion for visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2612–2620.
Ben-Younes, H., Cadene, R., Thome, N., Cord, M., 2019. Block: Bilinear superdiagonal fusion for visual question answering and visual relationship detection. In: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. (01), pp. 8102–8109.
Cai, J., Yuan, C., Shi, C., Li, L., Cheng, Y., Shan, Y., 2021. Feature augmented memory with global attention network for videoqa. In: Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence. pp. 998–1004.
Chen, D., Dolan, W.B., 2011. Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. pp. 190–200.
Dang, 2021
Dosovitskiy, 2020
Fan, C., Zhang, X., Zhang, S., Wang, W., Zhang, C., Huang, H., 2019. Heterogeneous memory enhanced multimodal attention model for video question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1999–2007.
Fukui, 2016
Gao, 2021, Generalized pyramid co-attention with learnable aggregation net for video question answering, Pattern Recognit., 120, 10.1016/j.patcog.2021.108145
Gao, J., Ge, R., Chen, K., Nevatia, R., 2018. Motion-appearance co-memory networks for video question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6576–6585.
Gao, L., Zeng, P., Song, J., Li, Y.F., Liu, W., Mei, T., Shen, H.T., 2019. Structured two-stream attention network for video question answering. In: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. (01), pp. 6391–6398.
Girdhar, R., Carreira, J., Doersch, C., Zisserman, A., 2019. Video action transformer network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 244–253.
Gu, 2021, Graph-based multi-interaction network for video question answering, IEEE Trans. Image Process., 30, 2758, 10.1109/TIP.2021.3051756
Huang, D., Chen, P., Zeng, R., Du, Q., Tan, M., Gan, C., 2020. Location-aware graph convolutional networks for video question answering. In: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. (07), pp. 11021–11028.
Jang, Y., Song, Y., Yu, Y., Kim, Y., Kim, G., 2017. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2758–2766.
Jiang, J., Chen, Z., Lin, H., Zhao, X., Gao, Y., 2020. Divide and conquer: Question-guided spatio-temporal contextual attention for video question answering. In: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. (07), pp. 11101–11108.
Jiang, P., Han, Y., 2020. Reasoning with heterogeneous graph alignment for video question answering. In: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. (07), pp. 11109–11116.
Jin, W., Zhao, Z., Gu, M., Yu, J., Xiao, J., Zhuang, Y., 2019. Multi-interaction network with object relation for video question answering. In: Proceedings of the 27th ACM International Conference on Multimedia. pp. 1193–1201.
Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., Carion, N., 2021. MDETR-modulated detection for end-to-end multi-modal understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 1780–1790.
Kim, J., Ma, M., Pham, T., Kim, K., Yoo, C.D., 2020. Modality shifting attention network for multi-modal video question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10106–10115.
Kingma, 2014
Krishna, 2017, Visual genome: Connecting language and vision using crowdsourced dense image annotations, Int. J. Comput. Vis., 123, 32, 10.1007/s11263-016-0981-7
Le, T.M., Le, V., Venkatesh, S., Tran, T., 2020. Hierarchical conditional relation networks for video question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9972–9981.
Lei, 2018
Lei, 2019
Li, X., Gao, L., Wang, X., Liu, W., Xu, X., Shen, H.T., Song, J., 2019. Learnable aggregating net with diversity learning for video question answering. In: Proceedings of the 27th ACM International Conference on Multimedia. pp. 1166–1174.
Li, 2022, Complementary spatiotemporal network for video question answering, Multimedia Syst., 28, 161, 10.1007/s00530-021-00805-6
Lin, T.Y., RoyChowdhury, A., Maji, S., 2015. Bilinear cnn models for fine-grained visual recognition. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1449–1457.
Pennington, J., Socher, R., Manning, C.D., 2014. Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. (EMNLP), pp. 1532–1543.
Ren, 2015, Faster r-cnn: Towards real-time object detection with region proposal networks, Adv. Neural Inf. Process. Syst., 28
Ren, 2015, Exploring models and data for image question answering, Adv. Neural Inf. Process. Syst., 28
Simonyan, 2014
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M., 2015. Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 4489–4497.
Tucker, 1966, Some mathematical notes on three-mode factor analysis, Psychometrika, 31, 279, 10.1007/BF02289464
Vaswani, 2017, Attention is all you need, Adv. Neural Inf. Process. Syst., 30
Wang, 2020, Long video question answering: A matching-guided attention model, Pattern Recognit., 102, 10.1016/j.patcog.2020.107248
Woo, S., Park, J., Lee, J.Y., Kweon, I.S., 2018. Cbam: Convolutional block attention module. In: Proceedings of the European Conference on Computer Vision. (ECCV), pp. 3–19.
Xiao, 2020, Hierarchical temporal fusion of multi-grained attention features for video question answering, Neural Process. Lett., 52, 993, 10.1007/s11063-019-10003-1
Xu, J., Mei, T., Yao, T., Rui, Y., 2016. Msr-vtt: A large video description dataset for bridging video and language. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 5288–5296.
Xu, D., Zhao, Z., Xiao, J., Wu, F., Zhang, H., He, X., Zhuang, Y., 2017. Video question answering via gradually refined attention over appearance and motion. In: Proceedings of the 25th ACM International Conference on Multimedia. pp. 1645–1653.
Ye, 2020, Video question answering via grounded cross-attention network learning, Inf. Process. Manage., 57, 10.1016/j.ipm.2020.102265
Ye, Y., Zhao, Z., Li, Y., Chen, L., Xiao, J., Zhuang, Y., 2017. Video question answering via attribute-augmented attention network learning. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 829–832.
Yu, Z., Yu, J., Fan, J., Tao, D., 2017. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1821–1830.
Yu, 2020, Long-term video question answering via multimodal hierarchical memory attentive networks, IEEE Trans. Circuits Syst. Video Technol., 31, 931, 10.1109/TCSVT.2020.2995959
Yu, 2019, Compositional attention networks with two-stream fusion for video question answering, IEEE Trans. Image Process., 29, 1204, 10.1109/TIP.2019.2940677
Zadeh, A., Liang, P.P., Mazumder, N., Poria, S., Cambria, E., Morency, L.-P., 2018. Memory fusion network for multi-view sequential learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32. (1).
Zeng, K.H., Chen, T.H., Chuang, C.Y., Liao, Y.H., Niebles, J.C., Sun, M., 2017. Leveraging video descriptions to learn video question answering. In: Thirty-First AAAI Conference on Artificial Intelligence.
Zha, 2019, Spatiotemporal-textual co-attention network for video question answering, ACM Trans. Multimed. Comput., Commun. Appl. (TOMM), 15, 1, 10.1145/3320061
Zhang, 2022, Action-centric relation transformer network for video question answering, IEEE Trans. Circuits Syst. Video Technol., 32, 63, 10.1109/TCSVT.2020.3048440
Zhao, Z., Lin, J., Jiang, X., Cai, D., He, X., Zhuang, Y., 2017. Video question answering via hierarchical dual-level attention network learning. In: Proceedings of the 25th ACM International Conference on Multimedia. pp. 1050–1058.
