Nội dung được dịch bởi AI, chỉ mang tính chất tham khảo
Tạo đồ thị cảnh không gian-thời gian dựa trên video với các nhiệm vụ tự giám sát hiệu quả
Tóm tắt
Tạo Đồ Thị Cảnh Không Gian-Thời Gian (STSGG) nhằm mục đích trích xuất một chuỗi biểu diễn ngữ nghĩa dựa trên đồ thị cho các nhiệm vụ trực quan cấp cao. Các công trình hiện tại thường không khai thác tốt mối tương quan thời gian mạnh mẽ và các chi tiết về đặc điểm cục bộ, điều này dẫn đến việc không thể phân biệt hành động giữa quan hệ động (ví dụ: uống) và quan hệ tĩnh (ví dụ: cầm). Hơn nữa, do định kiến dài đuôi kém, các kết quả dự đoán gặp khó khăn với việc phân loại các quy predicate ở đuôi không chính xác. Để giải quyết các vấn đề này, một Mạng Chậm-Nhanh Nhận Thức Địa Phương (SFLA) được đề xuất cho mô hình hóa thời gian trong STSGG. Đầu tiên, một mạng nhánh đôi được sử dụng để trích xuất các đặc điểm quan hệ tĩnh và động tương ứng. Thứ hai, một module Nhận Thức Địa Phương (LRA) được đề xuất để gán tầm quan trọng lớn hơn cho các yếu tố quan trọng trong các mối quan hệ cục bộ. Thứ ba, ba nhiệm vụ tự giám sát mới được đề xuất, đó là, vị trí không gian, trạng thái chú ý của con người và biến đổi khoảng cách. Các nhiệm vụ tự giám sát này được đào tạo đồng thời với mô hình chính để giảm thiểu vấn đề định kiến dài đuôi và tăng cường sự phân biệt đặc điểm. Các thí nghiệm có hệ thống cho thấy phương pháp của chúng tôi đạt được hiệu suất tốt nhất trong bộ dữ liệu Action Genome (AG) được đề xuất gần đây và bộ dữ liệu Video ImageNet phổ biến.
Từ khóa
#Tạo đồ thị không gian-thời gian #Mạng Chậm-Nhanh Nhận Thức Địa Phương #Tự giám sát #Phân tích hành động #Mô hình hóa thời gianTài liệu tham khảo
Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 6299–6308. Institute of Electrical and Electronics Engineers (IEEE), Piscataway
Chen VS, Varma P, Krishna R et al (2019) Scene graph prediction with limited labels. In: Proceedings of the IEEE international conference on computer vision, pp 2580–2590. Institute of Electrical and Electronics Engineers (IEEE), Piscataway
Chen L, Wang G, Hou G (2021) Multi-scale and multi-column convolutional neural network for crowd density estimation. Multimed Tools Appl 80 (5):6661–6674
Chen Y, Wang Y, Zhang Y, et al (2019) Panet: a context based predicate association network for scene graph generation. In: 2019 IEEE international conference on multimedia and expo (ICME), IEEE, pp 508–513
Feichtenhofer C, Fan H, Malik J et al (2019) Slowfast networks for video recognition. In: Proceedings of the IEEE international conference on computer vision, pp 6202–6211. Institute of Electrical and Electronics Engineers (IEEE), Piscataway
Gao K, Chen L, Huang Y et al (2021) Video relation detection via tracklet based visual transformer. In: Proceedings of the 29th ACM international conference on multimedia, pp 4833–4837. ACM MM Association for Computing Machinery (ACM), New York
Geng S, Gao P, Hori C et al (2020) Spatio-temporal scene graphs for video dialog. arXiv:2007.04365, 2007
Gu C, Sun C, Ross DA et al (2018) Ava: A video dataset of spatio-temporally localized atomic visual actions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6047–6056. Institute of Electrical and Electronics Engineers (IEEE), Piscataway
He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778. Institute of Electrical and Electronics Engineers (IEEE), Piscataway
Ji J, Krishna R, Fei-Fei L et al (2020) Action genome: Actions as compositions of spatio-temporal scene graphs. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10236–10247. Institute of Electrical and Electronics Engineers (IEEE), Piscataway
Johnson J, Gupta A, Fei-Fei L (2018) Image generation from scene graphs. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1219–1228. Institute of Electrical and Electronics Engineers (IEEE), Piscataway
Krizhevsky A, Sutskever I, Hinton G E (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90
Li Y, Ouyang W, Zhou B et al (2017) Scene graph generation from objects, phrases and region captions. In: Proceedings of the IEEE international conference on computer vision, pp 1261–1270. Institute of Electrical and Electronics Engineers (IEEE), Piscataway
Li R, Zhang S, Wan B et al (2021) Bipartite graph network with adaptive message passing for unbiased scene graph generation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11109–11119. Institute of Electrical and Electronics Engineers (IEEE), Piscataway
Lin TY, Maire M, Belongie S et al (2014) Microsoft coco: Common objects in context. In: European conference on computer vision, Springer, pp 740–755
Liu C, Jin Y, Xu K et al (2020) Beyond short-term snippet: Video relation detection with spatio-temporal global context. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10840–10849. Institute of Electrical and Electronics Engineers (IEEE), Piscataway
Lu C, Krishna R, Bernstein M et al (2016) Visual relationship detection with language priors. In: European conference on computer vision, Springer, pp 852–869
Lyu F, Feng W, Wang S (2020) vtgraphnet: Learning weakly-supervised scene graph for complex visual grounding. Neurocomputing 413:51–60
Mi L, Chen Z (2020) Hierarchical graph attention network for visual relationship detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13886–13895. Institute of Electrical and Electronics Engineers (IEEE), Piscataway
Peyre J, Sivic J, Laptev I et al (2017) Weakly-supervised learning of visual relations. In: Proceedings of the ieee international conference on computer vision, pp 5179–5188. Institute of Electrical and Electronics Engineers (IEEE), Piscataway
Qian X, Zhuang Y, Li Y et al (2019) Video relation detection with spatio-temporal graph. In: Proceedings of the 27th ACM international conference on multimedia, pp 84–93. ACM MM Association for Computing Machinery (ACM), New York
Ren S, He K, Girshick R et al (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99. Neural Information Processing Systems Foundation, Cupertino
Shang X, Ren T, Guo J et al (2017) Video visual relation detection. In: Proceedings of the 25th ACM international conference on Multimedia, pp 1300–1308. ACM MM Association for Computing Machinery (ACM), New York
Shen K, Wu L, Xu F et al (2020) Hierarchical attention based spatial-temporal graph-to-sequence learning for grounded video description. In: Proceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI), pp 3406–3412
Sigurdsson GA, Varol G, Wang X et al (2016) Hollywood in homes: Crowdsourcing data collection for activity understanding. In: European conference on computer vision, Springer, pp 510–526
Tang K, Niu Y, Huang J et al (2020) Unbiased scene graph generation from biased training. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3716–3725. Institute of Electrical and Electronics Engineers (IEEE), Piscataway
Tang K, Zhang H, Wu B et al (2019) Learning to compose dynamic tree structures for visual contexts. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6619–6628. Institute of Electrical and Electronics Engineers (IEEE), Piscataway
Trojahn T H, Goularte R (2021) Temporal video scene segmentation using deep-learning. Multimed Tools Appl 80(12):17487–17513
Tsai YHH, Divvala S, Morency LP et al (2019) Video relationship reasoning using gated spatio-temporal energy graph. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10424–10433. Institute of Electrical and Electronics Engineers (IEEE), Piscataway
Wang X, Gupta A (2018) Videos as space-time region graphs. In: Proceedings of the European conference on computer vision (ECCV), pp 399–417. Springer Science+Business Media, New York
Wang R, Wei Z, Li P et al (2020) Storytelling from an image stream using scene graphs. In: AAAI, pp 9185–9192. AAAI, Palo Alto
Xu D, Zhu Y, Choy CB et al (2017) Scene graph generation by iterative message passing. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5410–5419. Institute of Electrical and Electronics Engineers (IEEE), Piscataway
Yan S, Shen C, Jin Z et al (2020) Pcpl: Predicate-correlation perception learning for unbiased scene graph generation. In: Proceedings of the 28th ACM international conference on multimedia, pp 265–273. ACM MM Association for Computing Machinery (ACM), New York
Yang J, Lu J, Lee S et al (2018) Graph r-cnn for scene graph generation. In: Proceedings of the European conference on computer vision (ECCV), pp 670–685. Springer Science+Business Media, New York
Zareian A, Karaman S, Chang SF (2020a) Bridging knowledge graphs to generate scene graphs. arXiv:2001.02314
Zareian A, Karaman S, Chang SF (2020b) Weakly supervised visual semantic parsing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3736–3745. Institute of Electrical and Electronics Engineers (IEEE), Piscataway
Zellers R, Yatskar M, Thomson S et al (2018) Neural motifs: Scene graph parsing with global context. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5831–5840. Institute of Electrical and Electronics Engineers (IEEE), Piscataway
Zhang J, Shih KJ, Elgammal A et al (2019) Graphical contrastive losses for scene graph parsing. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp 11535–11543. Institute of Electrical and Electronics Engineers (IEEE), Piscataway