Spatial–temporal transformer for end-to-end sign language recognition

Complex & Intelligent Systems - Tập 9 - Trang 4645-4656 - 2023
Zhenchao Cui1,2, Wenbo Zhang1,2,3, Zhaoxin Li3, Zhaoqi Wang3
1School of Cyber Security and Computer, Hebei University, Baoding, China
2Hebei Machine Vision Engineering Research Center, Hebei University, Baoding, China
3Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China

Tóm tắt

Continuous sign language recognition (CSLR) is an essential task for communication between hearing-impaired and people without limitations, which aims at aligning low-density video sequences with high-density text sequences. The current methods for CSLR were mainly based on convolutional neural networks. However, these methods perform poorly in balancing spatial and temporal features during visual feature extraction, making them difficult to improve the accuracy of recognition. To address this issue, we designed an end-to-end CSLR network: Spatial–Temporal Transformer Network (STTN). The model encodes and decodes the sign language video as a predicted sequence that is aligned with a given text sequence. First, since the image sequences are too long for the model to handle directly, we chunk the sign language video frames, i.e., ”image to patch”, which reduces the computational complexity. Second, global features of the sign language video are modeled at the beginning of the model, and the spatial action features of the current video frame and the semantic features of consecutive frames in the temporal dimension are extracted separately, giving rise to fully extracting visual features. Finally, the model uses a simple cross-entropy loss to align video and text. We extensively evaluated the proposed network on two publicly available datasets, CSL and RWTH-PHOENIX-Weather multi-signer 2014 (PHOENIX-2014), which demonstrated the superior performance of our work in CSLR task compared to the state-of-the-art methods.

Tài liệu tham khảo

Organization WH (2020) Deafness and hearing loss. [Online]. Available: https://www.who.int/health-topics/hearing-loss/. Accessed 3 Mar 2021 Slimane FB, Bouguessa M (2021) “Context Matters: Self-Attention for Sign Language Recognition,” 2020 25th International Conference on Pattern Recognition (ICPR), pp. 7884-7891, https://doi.org/10.1109/ICPR48806.2021.9412916 Li D, Opazo CR, Yu X, Li H (2020) Word-level Deep Sign Language Recognition from Video: A New Large-scale Dataset and Methods Comparison. IEEE Winter Conference on Applications of Computer Vision (WACV) 2020:1448–1458. https://doi.org/10.1109/WACV45572.2020.9093512 Konstantinidis D, Dimitropoulos K, Daras P (2018) “SIGN LANGUAGE RECOGNITION BASED ON HAND AND BODY SKELETAL DATA,” 2018 - 3DTV-Conference: The True Vision - Capture, Transmission and Display of 3D Video (3DTV-CON), pp. 1-4, https://doi.org/10.1109/3DTV.2018.8478467 Cao C, Zhang Y, Wu Y, Lu H, Cheng J (2017) Egocentric Gesture Recognition Using Recurrent 3D Convolutional Neural Networks with Spatiotemporal Transformer Modules. IEEE International Conference on Computer Vision (ICCV) 2017:3783–3791. https://doi.org/10.1109/ICCV.2017.406 Oscar K, Jens F, Hermann N (2015) Continuous sign language recognition: towards large vocabulary statistical recognition systems handling multiple signers. Computer Vision and Image Understanding 141:108–125. https://doi.org/10.1016/j.cviu.2015.09.013. (ISSN 1077-3142) Camgoz NC, Hadfield S, Koller O, Bowden R (2017) SubUNets: End-to-End Hand Shape and Continuous Sign Language Recognition. IEEE International Conference on Computer Vision (ICCV) 2017:3075–3084. https://doi.org/10.1109/ICCV.2017.332 Huang J, Zhou W, Zhang Q, Li H, Li W (2018) Video-Based Sign Language Recognition Without Temporal Segmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1). https://doi.org/10.1609/aaai.v32i1.11903 Pu J, Zhou W, Hu H, et al (2020) Boosting continuous sign language recognition via cross modality augmentation[C]. Proceedings of the 28th ACM International Conference on Multimedia. 1497-1505 Cheng KL, Yang Z, Chen Q, Tai YW (2020) Fully Convolutional Networks for Continuous Sign Language Recognition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision - ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12369. Springer, Cham. https://doi.org/10.1007/978-3-030-58586-0_41 Zhou H, Zhou W, Zhou Y, Li H (2020) Spatial-Temporal Multi-Cue Network for Continuous Sign Language Recognition. Proceedings of the AAAI Conference on Artificial Intelligence 34(07):13009–13016. https://doi.org/10.1609/aaai.v34i07.7001 Zihang D, Zhilin Y, Yiming Y, Jaime C, Quoc L, Ruslan S (2019) Transformer-XL: Attentive Language Models beyond a Fixed-Length Context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978-2988, Florence, Italy. Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1285 Cui R, Liu H, Zhang C (2019) A Deep Neural Framework for Continuous Sign Language Recognition by Iterative Training. IEEE Trans Multimedia 21(7):1880–1891. https://doi.org/10.1109/TMM.2018.2889563 Xie P, Cui Z, Du Y, et al (2021) Multi-Scale Local-Temporal Similarity Fusion for Continuous Sign Language Recognition[J]. arXiv preprint arXiv:2107.12762 Huang J, Zhou W, Zhang Q, Li H, Li W (2018) Video-Based Sign Language Recognition Without Temporal Segmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1). https://doi.org/10.1609/aaai.v32i1.11903 Yang Z, Shi Z, Shen X, et al (2019) SF-Net: Structured feature network for continuous sign language recognition[J]. arXiv preprint arXiv:1908.01341 Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N, Kaiser Łukasz, Polosukhin Illia (2017) Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 6000-6010 Alexey D, Lucas B, Alexander K, Dirk W, Xiaohua Z, Thomas U, Mostafa D, Matthias M, Georg H, Sylvain G et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 1, 2, 3, 5, 7 Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-End Object Detection with Transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision - ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12346. Springer, Cham. https://doi.org/10.1007/978-3-030-58452-8_13 Zheng S et al (2021) Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021:6877–6886. https://doi.org/10.1109/CVPR46437.2021.00681 Valanarasu JMJ, Oza P, Hacihaliloglu I, Patel VM (2021) Medical Transformer: Gated Axial-Attention for Medical Image Segmentation. In: , et al. Medical Image Computing and Computer Assisted Intervention - MICCAI 2021. MICCAI 2021. Lecture Notes in Computer Science(), vol 12901. Springer, Cham. https://doi.org/10.1007/978-3-030-87193-2_4 Hudson DA, Zitnick L (2021) Generative adversarial transformers[C]. International Conference on Machine Learning. PMLR, 4487-4499. https://proceedings.mlr.press/v139/hudson21a.html Bertasius G, Wang H, Torresani L (2021) Is space-time attention all you need for video understanding[J]. arXiv preprint arXiv:2102.05095, 2(3):4 Rosso M, Marasco G, Aiello S et al. Convolutional networks and transformers for intelligent road tunnel investigations, Computers and Structures, https://doi.org/10.1016/j.compstruc.2022.106918 Tanzi L, Audisio A, Cirrincione G, Aprato A, Vezzetti E (2021) Vision Transformer for femur fracture classification. arXiv:2108.03414 Cihan Camgöz N, Koller O, Hadfield S, Bowden R (2020) “Sign Language Transformers: Joint End-to-End Sign Language Recognition and Translation”, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10020-10030, https://doi.org/10.1109/CVPR42600.2020.01004 Pu Junfu, Zhou Wengang, Li Houqiang (2018) Dilated convolutional network with iterative optimization for continuous sign language recognition. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI’18). AAAI Press, 885-891 Koller O, Camgoz NC, Ney H, Bowden R (1 Sept. 2020) “Weakly Supervised Learning with Multi-Stream CNN-LSTM-HMMs to Discover Sequential Parallelism in Sign Language Videos”, in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 9, pp. 2306-2320, https://doi.org/10.1109/TPAMI.2019.2911077 Min Y, Hao A, Chai X, Chen X (2021) Visual Alignment Constraint for Continuous Sign Language Recognition. IEEE/CVF International Conference on Computer Vision (ICCV) 2021:11522–11531. https://doi.org/10.1109/ICCV48922.2021.01134 Hao A, Min Y, Chen X (2021) Self-Mutual Distillation Learning for Continuous Sign Language Recognition. IEEE/CVF International Conference on Computer Vision (ICCV) 2021:11283–11292. https://doi.org/10.1109/ICCV48922.2021.01111 Guo Dan, Zhou Wengang, Li Houqiang, Wang Meng (2018) Hierarchical LSTM for sign language translation. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence (AAAI’18/IAAI’18/EAAI’18). AAAI Press, Article 838, 6845-6852 Cho Kyunghyun, van Merriënboer Bart, Gulcehre Caglar, Bahdanau Dzmitry, Bougares Fethi, Schwenk Holger, Bengio Yoshua (2014) Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724-1734, Doha, Qatar. Association for Computational Linguistics Paszke Adam, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zach DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, Adam Lerer (2017) “Automatic differentiation in PyTorch.”