Are Vision Transformers Robust to Spurious Correlations?

Soumya Suvra Ghosal1, Yixuan Li1
1Department of Computer Sciences, University of Wisconsin-Madison, Madison, USA

Tóm tắt

Deep neural networks may be susceptible to learning spurious correlations that hold on average but not in atypical test samples. As with the recent emergence of vision transformer (ViT) models, it remains unexplored how spurious correlations are manifested in such architectures. In this paper, we systematically investigate the robustness of different transformer architectures to spurious correlations on three challenging benchmark datasets. Our study reveals that for transformers, larger models and more pre-training data significantly improve robustness to spurious correlations. Key to their success is the ability to generalize better from the examples where spurious correlations do not hold. Further, we perform extensive ablations and experiments to understand the role of the self-attention mechanism in providing robustness under spuriously correlated environments. We hope that our work will inspire future research on further understanding the robustness of ViT models to spurious correlations.

Tài liệu tham khảo

Abnar, S., & Zuidema, W. (2020). Quantifying attention flow in transformers. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 4190–4197). Association for Computational Linguistics. Bai, Y., Mei, J., Yuille, A., & Xie C. (2021). Are transformers more robust than cnns? In Thirty-fifth conference on neural information processing systems. Beery, S., Horn, G. Van, & Perona, P. (2018). Recognition in terra incognita. In Proceedings of the European conference on computer vision (ECCV) (pp. 456–473). Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., & Veit, A. (2021). Understanding robustness of transformers for image classification. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 10231–10241). Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., & Le, Q. V. (2019). Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 113–123). Cubuk, E. D., Zoph, B., Shlens, J., & Le, Q. V. (2020). Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops (pp. 702–703). Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q., & Salakhutdinov, R. (2019). Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th annual meeting of the association for computational linguistics (pp. 2978–2988). Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF international conference on computer vision and pattern recognition (pp. 248–255). Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT (pp. 4171–4186). Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In International conference on learning representations. Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F. A., & Brendel, W. (2019). Imagenet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. Goel, K., Gu, A., Li, Y., & Ré C. (2021). Model patching: Closing the subgroup performance gap with data augmentation. In International conference on learning representations. He, H., Zha, S., & Wang, H. (2019). Unlearn dataset bias in natural language inference by fitting the residual. In Proceedings of the 2nd workshop on deep learning approaches for low-resource NLP DeepLo (pp. 132–142). He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Hendrycks, D., & Dietterich, T. (2019). Benchmarking neural network robustness to common corruptions and perturbations. Hendrycks, D., Lee, K., & Mazeika, M. (2019). Using pre-training can improve model robustness and uncertainty. In International conference on machine learning (pp. 2712–2721). PMLR. Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., & Song, D. (2021). Natural adversarial examples. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 15262–15271. Kolesnikov, A., Beyer, L., Zhai, X., Puigcerver, J., Yung, J., Gelly, S., & Houlsby, N. (2020). Big transfer (BiT): General visual representation learning. In Lecture Notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics). Liu, H., HaoChen, J. Z., Gaidon, A., & Ma T. (2021). Self-supervised learning is more robust to dataset imbalance. arXiv preprint arXiv:2110.05025 . Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 10012–10022). Liu, Z., Luo, P., Wang, X., & Tang, X. (2015). Deep learning face attributes in the wild. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 3730–3738). Liu, Z., Mao, H., Wu, C. Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11976–11986). Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 . Liu, W., Wang, X., Owens, J., & Li, Y. (2020). Energy-based out-of-distribution detection. Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., & Xue, H. (2022). Towards robust vision transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12042–12051). McCoy, R. T., Pavlick, E., & Linzen, T. (2019). Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. arXiv preprint arXiv:1902.01007. Ming, Y., Yin, H., & Li, Y. (2022). On the impact of spurious correlation for out-of-distribution detection. In Proceedings of the AAAI conference on artificial intelligence. Naseer, M., Ranasinghe, K., Khan, S., Hayat, M., Khan, F., & Yang, M. H. (2021). Intriguing properties of vision transformers. In Thirty-fifth conference on neural information processing systems. Park, N., & Kim, S. (2022). How do vision transformers work? Paul, S. & Chen, P. Y. (2022). Vision transformers are robust learners. In Proceedings of the AAAI conference on artificial intelligence (pp. 2071–2081). Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S. Sastry, G., Askell, A., Mishkin, P., Clark, J., & Krueger, G. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748–8763). PMLR. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 9. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., & Khosla, A. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115, 211–252. Sagawa, S., Koh, P. W., Hashimoto, T. B., & Liang P. (2020). Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. In International conference on learning representations (ICLR). Sagawa, S., Raghunathan, A., Koh, P. W., & Liang, P. (2020). An investigation of why overparameterization exacerbates spurious correlations. In International conference on machine learning (pp. 8346–8356). PMLR. Selvaraju, R. R., Das, A., Vedantam, R., Cogswell, M., Parikh, D., & Batra, D. (2016). Grad-cam: Why did you say that? arXiv e-prints: arXiv–1611 . Shi, Y., Daunhawer, I., Vogt, J. E., Torr, P. H., & Sanyal, A. (2022). How robust are pre-trained models to distribution shift? arXiv preprint arXiv:2206.08871. Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., & Beyer, L. (2021). How to train your vit? Data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 . Taori, R., Dave, A., Shankar, V., Carlini, N., Recht, B., & Schmidt, L. (2019). When robustness doesn’t promote robustness: synthetic versus natural distribution shifts on imagenet. Tian, R., Wu, Z., Dai, Q., Hu, H., & Jiang, Y. G. (2022). Deeper insights into vits robustness towards common corruptions. arXiv preprint arXiv:2204.12143. Touvron, H., Cord, M., & Jégou, H. (2022). Deit iii: Revenge of vit. In European conference on computer vision. Springer. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021). Training data-efficient image transformers & distillation through attention. In International conference on machine learning (pp. 10347–10357). PMLR. Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., & Jégou, H. (2021). Going deeper with image transformers. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 32–42). Tu, L., Lalwani, G., Gella, S., & He, H. (2020). An empirical study on robustness to spurious correlations using pre-trained language models. Transactions of the Association for Computational Linguistics, 8, 621–633. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In Proceedings of the 31st international conference on neural information processing systems (pp. 6000–6010). Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology. Wang, W., Xie, E., Li, X., Fan, D. P., Song, K., Liang, D., Lu, T., Luo, P., & Shao, L. (2021). Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 568–578). Xue, F., Shi, Z., Wei, F., Lou, Y., Liu, Y., & You, Y. (2021). Go wider instead of deeper. arXiv preprint arXiv:2107.11817. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. R., & Le, Q. V. (2019). Xlnet: Generalized autoregressive pretraining for language understanding. Advances in Neural Information Processing Systems. Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Jiang, Z. H., Tay, F. E., Feng, J., & Yan, S. (2021). Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 558–567). Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., & Yoo, Y. (2019). Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision. Zeiler, M. D. & Fergus, R. (2014). Visualizing and understanding convolutional networks. In european conference on computer vision (pp. 818–833). Zhai, X., Kolesnikov, A., Houlsby, N., & Beyer, L. (2021). Scaling vision transformers. Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2018). Mixup: Beyond empirical risk minimization. In International conference on learning representations. Zhang, T., Wu, F., Katiyar, A., Weinberger, K. Q., & Artzi, Y. (2020). Revisiting few-sample bert fine-tuning. arXiv preprint arXiv:2006.05987. Zhang, C., Zhang, M., Zhang, S., Jin, D., Zhou, Q., Cai, Z., Zhao, H., Liu, X., & Liu, Z. (2022). Delving deep into the generalization of vision transformers under distribution shifts. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7277–7286). Zhou, D., Yu, Z., Xie, E., Xiao, C., Anandkumar, A., Feng, J., & Alvarez J. M. (2022). Understanding the robustness in vision transformers. In International conference on machine learning (pp. 27378–27394). PMLR. Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., & Torralba, A. (2017). Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40, 1452–1464.