P2T: Pyramid Pooling Transformer for Scene Understanding

IEEE Transactions on Pattern Analysis and Machine Intelligence - Tập 45 Số 11 - Trang 12760-12771 - 2023
Yu-Huan Wu1,2, Yun Liu3, Xin Zhan1, Ming‐Ming Cheng2
1Alibaba DAMO Academy, Hangzhou, Hangzhou, China
2TMCC, College of Computer Science, Nankai University, Tianjin, China
3Institute for Infocomm Research (I2R), Agency for Science, Technology and Research (A*STAR), Singapore

Tóm tắt

Từ khóa


Tài liệu tham khảo

10.1201/9781420010749

Simonyan, Very deep convolutional networks for large-scale image recognition, Proc. Int. Conf. Learn. Represent.

10.1109/CVPR.2015.7298594

10.1109/CVPR.2016.90

10.1109/tpami.2019.2918284

10.1109/TPAMI.2019.2913372

Tan, EfficientNet: Rethinking model scaling for convolutional neural networks, Proc. Int. Conf. Mach. Learn., 6105

10.1109/tpami.2021.3134684

10.1109/TPAMI.2019.2938758

10.1007/s11263-015-0816-y

10.1007/978-3-319-10602-1_48

10.1007/s11263-009-0275-4

10.1109/CVPR.2016.350

10.1109/CVPR.2017.544

Vaswani, Attention is all you need, Proc. Adv. Neural Inform. Process. Syst., 6000

10.1007/978-3-030-58452-8_13

Zhu, 2020, Deformable DETR: Deformable transformers for end-to-end object detection

10.1109/iccv48922.2021.00147

Dosovitskiy, An image is worth 16x16 words: Transformers for image recognition at scale, Proc. Int. Conf. Learn. Represent.

10.1109/ICCV48922.2021.01172

10.1109/ICCV48922.2021.00061

10.1109/ICCV48922.2021.00986

10.1109/ICCV48922.2021.00675

Liu, 2021, Transformer in convolutional neural networks

10.1109/ICCVW54120.2021.00210

10.1109/CVPR.2019.00656

10.1109/CVPR.2017.634

Chu, Twins: Revisiting the design of spatial attention in vision transformers, Proc. Adv. Neural Inform. Process. Syst., 9355

10.1007/s41095-022-0274-8

10.3115/v1/W14-3302

10.1109/ICCV48922.2021.00009

10.1109/ICCV48922.2021.01204

Han, 2021, Demystifying local vision transformer: Sparse connectivity, weight sharing, and dynamic weight

10.1109/ICCV.2005.239

10.1109/CVPR.2006.68

10.1109/TPAMI.2015.2389824

10.1109/CVPR.2017.660

Howard, 2017, MobileNets: Efficient convolutional neural networks for mobile vision applications

10.1109/CVPR.2018.00474

10.1007/978-3-030-01264-9_8

10.1109/CVPR.2018.00716

10.1109/CVPR.2019.00293

10.1007/s10462-020-09825-6

10.1016/j.neucom.2016.12.038

10.1109/CVPR46437.2021.00542

10.1109/WACV48630.2021.00374

Hu, 2021, ISTR: End-to-end instance segmentation with transformers

Touvron, Training data-efficient image transformers distillation through attention, Proc. Int. Conf. Mach. Learn., 10347

10.1109/ICCV48922.2021.00060

Chu, 2021, Conditional positional encodings for vision transformers

Jiang, 2021, Token labeling: Training a 85.5% top-1 accuracy vision transformer with 56M parameters on ImageNet

Li, 2021, LocalViT: Bringing locality to vision transformers

10.1109/ICCV48922.2021.00062

10.1109/TPAMI.2017.2699184

10.1007/978-3-030-00934-2_3

10.1007/s11263-021-01465-9

10.1016/j.patcog.2020.107622

10.1109/CVPRW.2015.7301274

10.1007/978-3-030-01228-1_15

10.1016/j.ins.2020.02.067

10.1109/ICCV.2017.31

10.1109/CVPR.2018.00567

10.1109/tpami.2021.3140168

10.1109/TIP.2021.3065822

10.1109/CVPRW.2018.00133

10.1109/TCSVT.2019.2920407

10.1109/CVPR.2018.00337

10.1109/ICCV.2017.433

Ba, 2016, Layer normalization

Dong, 2021, Attention is not all you need: Pure attention loses rank doubly exponentially with depth

10.1109/ICCV.2019.00140

Hendrycks, 2016, Gaussian error linear units (GELUs)

10.1109/ICCV48922.2021.00299

10.1109/ICCV.2017.324

10.1109/TPAMI.2018.2844175

Loshchilov, Decoupled weight decay regularization, Proc. Int. Conf. Learn. Represent.

Glorot, Understanding the difficulty of training deep feedforward neural networks, Proc. Int. Conf. Artif. Intell. Statist., 249

Contributors, 2020, MMSegmentation: OpenMMLab semantic segmentation toolbox and benchmark

Chen, 2019, MMDetection: Open MMLab detection toolbox and benchmark