Tạo đồ thị cảnh không gian-thời gian dựa trên video với các nhiệm vụ tự giám sát hiệu quả

Multimedia Tools and Applications - Tập 82 - Trang 38947-38966 - 2023
Lianggangxu Chen1,2, Yiqing Cai1, Changhong Lu3, Changbo Wang1, Gaoqi He1,2
1Chongqing Key Laboratory of Precision Optics, Chongqing Institute of East China Normal University, Chongqing, China
2School of Computer Science and Technology, East China Normal University, Shanghai, China
3School of Mathematical Sciences, East China Normal University, Shanghai, China

Tóm tắt

Tạo Đồ Thị Cảnh Không Gian-Thời Gian (STSGG) nhằm mục đích trích xuất một chuỗi biểu diễn ngữ nghĩa dựa trên đồ thị cho các nhiệm vụ trực quan cấp cao. Các công trình hiện tại thường không khai thác tốt mối tương quan thời gian mạnh mẽ và các chi tiết về đặc điểm cục bộ, điều này dẫn đến việc không thể phân biệt hành động giữa quan hệ động (ví dụ: uống) và quan hệ tĩnh (ví dụ: cầm). Hơn nữa, do định kiến dài đuôi kém, các kết quả dự đoán gặp khó khăn với việc phân loại các quy predicate ở đuôi không chính xác. Để giải quyết các vấn đề này, một Mạng Chậm-Nhanh Nhận Thức Địa Phương (SFLA) được đề xuất cho mô hình hóa thời gian trong STSGG. Đầu tiên, một mạng nhánh đôi được sử dụng để trích xuất các đặc điểm quan hệ tĩnh và động tương ứng. Thứ hai, một module Nhận Thức Địa Phương (LRA) được đề xuất để gán tầm quan trọng lớn hơn cho các yếu tố quan trọng trong các mối quan hệ cục bộ. Thứ ba, ba nhiệm vụ tự giám sát mới được đề xuất, đó là, vị trí không gian, trạng thái chú ý của con người và biến đổi khoảng cách. Các nhiệm vụ tự giám sát này được đào tạo đồng thời với mô hình chính để giảm thiểu vấn đề định kiến dài đuôi và tăng cường sự phân biệt đặc điểm. Các thí nghiệm có hệ thống cho thấy phương pháp của chúng tôi đạt được hiệu suất tốt nhất trong bộ dữ liệu Action Genome (AG) được đề xuất gần đây và bộ dữ liệu Video ImageNet phổ biến.

Từ khóa

#Tạo đồ thị không gian-thời gian #Mạng Chậm-Nhanh Nhận Thức Địa Phương #Tự giám sát #Phân tích hành động #Mô hình hóa thời gian

Tài liệu tham khảo

Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 6299–6308. Institute of Electrical and Electronics Engineers (IEEE), Piscataway Chen VS, Varma P, Krishna R et al (2019) Scene graph prediction with limited labels. In: Proceedings of the IEEE international conference on computer vision, pp 2580–2590. Institute of Electrical and Electronics Engineers (IEEE), Piscataway Chen L, Wang G, Hou G (2021) Multi-scale and multi-column convolutional neural network for crowd density estimation. Multimed Tools Appl 80 (5):6661–6674 Chen Y, Wang Y, Zhang Y, et al (2019) Panet: a context based predicate association network for scene graph generation. In: 2019 IEEE international conference on multimedia and expo (ICME), IEEE, pp 508–513 Feichtenhofer C, Fan H, Malik J et al (2019) Slowfast networks for video recognition. In: Proceedings of the IEEE international conference on computer vision, pp 6202–6211. Institute of Electrical and Electronics Engineers (IEEE), Piscataway Gao K, Chen L, Huang Y et al (2021) Video relation detection via tracklet based visual transformer. In: Proceedings of the 29th ACM international conference on multimedia, pp 4833–4837. ACM MM Association for Computing Machinery (ACM), New York Geng S, Gao P, Hori C et al (2020) Spatio-temporal scene graphs for video dialog. arXiv:2007.04365, 2007 Gu C, Sun C, Ross DA et al (2018) Ava: A video dataset of spatio-temporally localized atomic visual actions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6047–6056. Institute of Electrical and Electronics Engineers (IEEE), Piscataway He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778. Institute of Electrical and Electronics Engineers (IEEE), Piscataway Ji J, Krishna R, Fei-Fei L et al (2020) Action genome: Actions as compositions of spatio-temporal scene graphs. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10236–10247. Institute of Electrical and Electronics Engineers (IEEE), Piscataway Johnson J, Gupta A, Fei-Fei L (2018) Image generation from scene graphs. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1219–1228. Institute of Electrical and Electronics Engineers (IEEE), Piscataway Krizhevsky A, Sutskever I, Hinton G E (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90 Li Y, Ouyang W, Zhou B et al (2017) Scene graph generation from objects, phrases and region captions. In: Proceedings of the IEEE international conference on computer vision, pp 1261–1270. Institute of Electrical and Electronics Engineers (IEEE), Piscataway Li R, Zhang S, Wan B et al (2021) Bipartite graph network with adaptive message passing for unbiased scene graph generation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11109–11119. Institute of Electrical and Electronics Engineers (IEEE), Piscataway Lin TY, Maire M, Belongie S et al (2014) Microsoft coco: Common objects in context. In: European conference on computer vision, Springer, pp 740–755 Liu C, Jin Y, Xu K et al (2020) Beyond short-term snippet: Video relation detection with spatio-temporal global context. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10840–10849. Institute of Electrical and Electronics Engineers (IEEE), Piscataway Lu C, Krishna R, Bernstein M et al (2016) Visual relationship detection with language priors. In: European conference on computer vision, Springer, pp 852–869 Lyu F, Feng W, Wang S (2020) vtgraphnet: Learning weakly-supervised scene graph for complex visual grounding. Neurocomputing 413:51–60 Mi L, Chen Z (2020) Hierarchical graph attention network for visual relationship detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13886–13895. Institute of Electrical and Electronics Engineers (IEEE), Piscataway Peyre J, Sivic J, Laptev I et al (2017) Weakly-supervised learning of visual relations. In: Proceedings of the ieee international conference on computer vision, pp 5179–5188. Institute of Electrical and Electronics Engineers (IEEE), Piscataway Qian X, Zhuang Y, Li Y et al (2019) Video relation detection with spatio-temporal graph. In: Proceedings of the 27th ACM international conference on multimedia, pp 84–93. ACM MM Association for Computing Machinery (ACM), New York Ren S, He K, Girshick R et al (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99. Neural Information Processing Systems Foundation, Cupertino Shang X, Ren T, Guo J et al (2017) Video visual relation detection. In: Proceedings of the 25th ACM international conference on Multimedia, pp 1300–1308. ACM MM Association for Computing Machinery (ACM), New York Shen K, Wu L, Xu F et al (2020) Hierarchical attention based spatial-temporal graph-to-sequence learning for grounded video description. In: Proceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI), pp 3406–3412 Sigurdsson GA, Varol G, Wang X et al (2016) Hollywood in homes: Crowdsourcing data collection for activity understanding. In: European conference on computer vision, Springer, pp 510–526 Tang K, Niu Y, Huang J et al (2020) Unbiased scene graph generation from biased training. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3716–3725. Institute of Electrical and Electronics Engineers (IEEE), Piscataway Tang K, Zhang H, Wu B et al (2019) Learning to compose dynamic tree structures for visual contexts. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6619–6628. Institute of Electrical and Electronics Engineers (IEEE), Piscataway Trojahn T H, Goularte R (2021) Temporal video scene segmentation using deep-learning. Multimed Tools Appl 80(12):17487–17513 Tsai YHH, Divvala S, Morency LP et al (2019) Video relationship reasoning using gated spatio-temporal energy graph. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10424–10433. Institute of Electrical and Electronics Engineers (IEEE), Piscataway Wang X, Gupta A (2018) Videos as space-time region graphs. In: Proceedings of the European conference on computer vision (ECCV), pp 399–417. Springer Science+Business Media, New York Wang R, Wei Z, Li P et al (2020) Storytelling from an image stream using scene graphs. In: AAAI, pp 9185–9192. AAAI, Palo Alto Xu D, Zhu Y, Choy CB et al (2017) Scene graph generation by iterative message passing. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5410–5419. Institute of Electrical and Electronics Engineers (IEEE), Piscataway Yan S, Shen C, Jin Z et al (2020) Pcpl: Predicate-correlation perception learning for unbiased scene graph generation. In: Proceedings of the 28th ACM international conference on multimedia, pp 265–273. ACM MM Association for Computing Machinery (ACM), New York Yang J, Lu J, Lee S et al (2018) Graph r-cnn for scene graph generation. In: Proceedings of the European conference on computer vision (ECCV), pp 670–685. Springer Science+Business Media, New York Zareian A, Karaman S, Chang SF (2020a) Bridging knowledge graphs to generate scene graphs. arXiv:2001.02314 Zareian A, Karaman S, Chang SF (2020b) Weakly supervised visual semantic parsing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3736–3745. Institute of Electrical and Electronics Engineers (IEEE), Piscataway Zellers R, Yatskar M, Thomson S et al (2018) Neural motifs: Scene graph parsing with global context. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5831–5840. Institute of Electrical and Electronics Engineers (IEEE), Piscataway Zhang J, Shih KJ, Elgammal A et al (2019) Graphical contrastive losses for scene graph parsing. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp 11535–11543. Institute of Electrical and Electronics Engineers (IEEE), Piscataway