JPPF: Multi-task Fusion for Consistent Panoptic-Part Segmentation

SN Computer Science - Tập 5 - Trang 1-16 - 2024
Shishir Muralidhara1, Sravan Kumar Jagadeesh1, René Schuster1,2, Didier Stricker1,2
1Augmented Vision, German Research Center for Artificial Intelligence-DFKI, Kaiserslautern, Germany
2Augmented Vision, University of Kaiserslautern-Landau, Kaiserslautern, Germany

Tóm tắt

Part-aware panoptic segmentation is a problem of computer vision that aims to provide a semantic understanding of the scene at multiple levels of granularity. More precisely, semantic areas, object instances, and semantic parts are predicted simultaneously. In this paper, we present our joint panoptic part fusion (JPPF) that combines the three individual segmentations effectively to obtain a panoptic-part segmentation. Two aspects are of utmost importance for this: first, a unified model for the three problems is desired that allows for mutually improved and consistent representation learning. Second, balancing the combination so that it gives equal importance to all individual results during fusion. Our proposed JPPF is parameter-free and dynamically balances its input. The method is evaluated and compared on the Cityscapes panoptic parts (CPP) and Pascal panoptic parts (PPP) datasets in terms of PartPQ and Part-Whole Quality (PWQ). In extensive experiments, we verify the importance of our fair fusion, highlight its most significant impact for areas that can be further segmented into parts, and demonstrate the generalization capabilities of our design without fine-tuning on 5 additional datasets.

Tài liệu tham khảo

Bulo SR, Porzi L, Kontschieder P. In-place activated batchnorm for memory-optimized training of DNNs. In: Conference on Computer Vision and Pattern Recognition (CVPR). 2018. Chen L, Papandreou G, Schroff F, et al. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587. 2017. Chen L, Collins MD, Zhu Y, et al. Searching for efficient multi-scale architectures for dense image prediction. Adv Neural Inf Process Syst (NeurIPS). 2018. Chen LC, Zhu Y, Papandreou G, et al. Encoder–decoder with atrous separable convolution for semantic image segmentation. In: European Conference on Computer Vision (ECCV). 2018. Cheng B, Collins MD, Zhu Y, et al. Panoptic-DeepLab: a simple, strong, and fast baseline for bottom-up panoptic segmentation. In: Conference on Computer Vision and Pattern Recognition (CVPR). 2020. Cordts M, Omran M, Ramos S, et al. The Cityscapes dataset for semantic urban scene understanding. In: Conference on Computer Vision and Pattern Recognition (CVPR). 2016. Dong J, Chen Q, Xia W, et al. A deformable mixture parsing model with Parselets. In: International Conference on Computer Vision (ICCV). 2013. Gao N, Shan Y, Wang Y, et al. SSAP: single-shot instance segmentation with affinity pyramid. In: International Conference on Computer Vision (ICCV). 2019. Geiger A, Lenz P, Stiller C, et al. Vision meets robotics: the KITTI dataset. Int J Rob Res (IJRR). 2013. de Geus D, Meletis P, Lu C, et al. Part-aware panoptic segmentation. In: Conference on Computer Vision and Pattern Recognition (CVPR). 2021. Gong K, Liang X, Li Y, et al. Instance-level human parsing via part grouping network. In: European Conference on Computer Vision (ECCV). 2018. Gong K, Gao Y, Liang X, et al. Graphonomy: universal human parsing via graph transfer learning. In: Conference on Computer Vision and Pattern Recognition (CVPR). 2019. Hariharan B, Arbeláez P, Girshick R, et al. Simultaneous detection and segmentation. In: European Conference on Computer Vision (ECCV). 2014. Hariharan B, Arbeláez P, Girshick R, et al. Hypercolumns for object segmentation and fine-grained localization. In: Conference on Computer Vision and Pattern Recognition (CVPR). 2015. He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: Conference on Computer Vision and Pattern Recognition (CVPR). 2016. He K, Gkioxari G, Dollár P, et al. Mask R-CNN. In: International Conference on Computer Vision (ICCV). 2017. Jagadeesh SK, Schuster R, Stricker D. Multi-task fusion for efficient panoptic-part segmentation. In: International Conference on Pattern Recognition Applications and Methods (ICPRAM). 2023. Jiang Y, Chi Z. A CNN model for semantic person part segmentation with capacity optimization. Trans Image Process (T-IP). 2018. Jiang Y, Chi Z. A CNN model for human parsing based on capacity optimization. Appl Sci. 2019. Kirillov A, Girshick R, He K, et al. Panoptic feature pyramid networks. In: Conference on Computer Vision and Pattern Recognition (CVPR). 2019. Kirillov A, He K, Girshick R, et al. Panoptic segmentation. In: Conference on Computer Vision and Pattern Recognition (CVPR). 2019. Ladicky L, Torr PH, Zisserman A. Human pose estimation using a joint pixel-wise and part-wise formulation. In: Conference on Computer Vision and Pattern Recognition (CVPR). 2013. Li J, Raventos A, Bhargava A, et al. Learning to fuse things and stuff. arXiv preprint arXiv:1812.01192. 2018. Li P, Xu Y, Wei Y, et al. Self-correction for human parsing. Trans Pattern Anal Mach Intell (T-PAMI). 2020. Li Q, Arnab A, Torr PH. Holistic, instance-level human parsing. In: British Machine Vision Conference (BMVC). 2017. Li Q, Arnab A, Torr PH. Weakly-and semi-supervised panoptic segmentation. In: European Conference on Computer Vision (ECCV). 2018. Li Q, Qi X, Torr PH. Unifying training and inference for panoptic segmentation. In: Conference on Computer Vision and Pattern Recognition (CVPR). 2020. Li X, Xu S, Yang Y, et al. Panoptic-PartFormer: learning a unified model for panoptic part segmentation. In: European Conference on Computer Vision (ECCV). 2022. Li X, Xu S, Yang Y, et al. Panopticpartformer++: a unified and decoupled view for panoptic part segmentation. arXiv preprint arXiv:2301.00954. 2023. Li Y, Qi H, Dai J, et al. Fully convolutional instance-aware semantic segmentation. In: Conference on Computer Vision and Pattern Recognition (CVPR). 2017. Liang X, Gong K, Shen X, et al. Look into person: joint body parsing & pose estimation network and a new benchmark. Trans Pattern Anal Mach Intell (T-PAMI). 2018. Lin K, Wang L, Luo K, et al. Cross-domain complementary learning using pose for multi-person part segmentation. Trans Circuits Syst Video Technol (T-CSVT). 2020. Lin TY, Maire M, Belongie S, et al. Microsoft coco: common objects in context. In: European Conference on Computer Vision (ECCV). 2014. Liu H, Peng C, Yu C, et al. An end-to-end network for panoptic segmentation. In: Conference on Computer Vision and Pattern Recognition (CVPR). 2019. Liu S, Sun Y, Zhu D, et al. Cross-domain human parsing via adversarial feature and label adaptation. In: Conference on Artificial Intelligence (AAAI). 2018. Liu Z, Lin Y, Cao Y, et al. Swin transformer: hierarchical vision transformer using shifted windows. In: International Conference on Computer Vision (ICCV). 2021. Liu Z, Mao H, Wu CY, et al. A convnet for the 2020s. In: Conference on Computer Vision and Pattern Recognition (CVPR). 2022. Luo P, Wang X, Tang X. Pedestrian parsing via deep decompositional network. In: International Conference on Computer Vision (ICCV). 2013. Luo X, Su Z, Guo J, et al. Trusted guidance pyramid network for human parsing. In: ACM International Conference on Multimedia (ACM-MM). 2018. Meletis P, Wen X, Lu C, et al. Cityscapes-panoptic-parts and pascal-panoptic-parts datasets for scene understanding. arXiv preprint arXiv:2004.07944. 2020. Michieli U, Borsato E, Rossi L, et al. GMNet: graph matching network for large scale part semantic segmentation in the wild. In: European Conference on Computer Vision (ECCV). 2020. Mohan R, Valada A. EfficientPS: efficient panoptic segmentation. Int J Comput Vis (IJCV). 2021. Neuhold G, Ollmann T, Rota Bulo S, et al. The Mapillary Vistas Dataset for semantic understanding of street scenes. In: International Conference on Computer Vision (ICCV). 2017. Pinheiro PO, Collobert R, Dollár P. Learning to segment object candidates. Adv Neural Inf Process Syst (NeurIPS). 2015. Pont-Tuset J, Arbelaez P, Barron JT, et al. Multiscale combinatorial grouping for image segmentation and object proposal generation. Trans Pattern Anal Mach Intell (T-PAMI). 2016. Porzi L, Bulo SR, Colovic A, et al. Seamless scene segmentation. In: Conference on Computer Vision and Pattern Recognition (CVPR). 2019. Qiao S, Chen LC, Yuille A. Detectors: detecting objects with recursive feature pyramid and switchable atrous convolution. arXiv preprint arXiv:2006.02334. 2020. Ren S, He K, Girshick RB, et al. Faster R-CNN: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst (NeurIPS). 2015. Ruan T, Liu T, Huang Z, et al. Devil in the details: towards accurate single and multiple human parsing. In: Conference on Artificial Intelligence (AAAI). 2019. Sakaridis C, Dai D, Van Gool L. ACDC: the adverse conditions dataset with correspondences for semantic driving scene understanding. In: International Conference on Computer Vision (ICCV). 2021. Sofiiuk K, Barinova O, Konushin A. Adaptis: adaptive instance selection network. In: International Conference on Computer Vision (ICCV). 2019. Tan M, Le Q. EfficientNet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning (ICML). 2019. Tian Z, He T, Shen C, et al. Decoders matter for semantic segmentation: data-dependent decoding enables flexible feature aggregation. In: Conference on Computer Vision and Pattern Recognition (CVPR). 2019. Valada A, Mohan R, Burgard W. Self-supervised model adaptation for multimodal semantic segmentation. Int J Comput Vis (IJCV). 2018. Varma G, Subramanian A, Namboodiri A, et al. IDD: a dataset for exploring problems of autonomous navigation in unconstrained environments. In: Winter Conference on Applications of Computer Vision (WACV). 2019. Xie Q, Luong MT, Hovy E, et al. Self-training with noisy student improves ImageNet classification. In: Conference on Computer Vision and Pattern Recognition (CVPR). 2020. Xiong Y, Liao R, Zhao H, et al. UPSNet: a unified panoptic segmentation network. In: Conference on Computer Vision and Pattern Recognition CVPR). 2019. Yang L, Song Q, Wang Z, et al. Parsing R-CNN for instance-level human analysis. In: Conference on Computer Vision and Pattern Recognition (CVPR). 2019. Yang T, Collins MD, Zhu Y, et al. DeeperLab: single-shot image parser. arXiv preprint arXiv:1902.05093. 2019. Yu F, Chen H, Wang X, et al. BDD100K: a diverse driving dataset for heterogeneous multitask learning. In: Conference on Computer Vision and Pattern Recognition (CVPR). 2020. Zhang H, Wu C, Zhang Z, et al. ResNeSt: split-attention networks. In: Conference on Computer Vision and Pattern Recognition (CVPR). 2022. Zhao H, Shi J, Qi X, et al. Pyramid scene parsing network. In: Conference on Computer Vision and Pattern Recognition (CVPR). 2017. Zhao J, Li J, Cheng Y, et al. Understanding humans in crowded scenes: deep nested adversarial learning and a new benchmark for multi-human parsing. In: ACM International Conference on Multimedia (ACM-MM). 2018. Zhao Y, Li J, Zhang Y, et al. Multi-class part parsing with joint boundary-semantic awareness. In: International Conference on Computer Vision (ICCV). 2019.