Pushing the boundaries of audiovisual word recognition using Residual Networks and LSTMs

Computer Vision and Image Understanding - Tập 176 - Trang 22-32 - 2018
Themos Stafylakis1, Muhammad Haris Khan1,2, Georgios Tzimiropoulos1
1Computer Vision Laboratory, University of Nottingham, UK
2Electrical Engineering Department, COMSATS Lahore Campus, Pakistan

Tài liệu tham khảo

Afouras, 2018, The conversation: Deep audio-visual speech enhancement Afouras, 2018, Deep lip reading: a comparison of models and an online application Anina, 2015, OuluVS2: a multi-view audiovisual database for non-rigid mouth motion analysis, 1 Assael, Y.M., Shillingford, B., Whiteson, S., de Freitas, N., LipNet: sentence-level lipreading. arXiv preprint arXiv:1611.01599, 2016. Audhkhasi, K., Ramabhadran, B., Saon, G., Picheny, M., Nahamoo, D., Direct acoustics-to-word models for english conversational speech recognition. arXiv preprint arXiv:1703.07754, 2017. Bear, 2014, Which phoneme-to-viseme maps best improve visual-only computer lip-reading?, 230 Black, 1993, A framework for the robust estimation of optical flow, 231 Chan, 2001, Hmm-based audio-visual speech recognition integrating geometric- and appearance-based visual features, 9 Chan, 2016, Listen, attend and spell: A neural network for large vocabulary conversational speech recognition, 4960 Cheng, 2017, An exploration of dropout with LSTMs Chung, 2017, Lip reading sentences in the wild Chung, 2016, Lip reading in the wild, 87 Chung, 2018, Learning to lip read words by watching videos, Comput. Vis. Image Underst., 10.1016/j.cviu.2018.02.001 Dalton, 1996, Automatic speechreading using dynamic contours, 373 Gal, 2016, A theoretically grounded application of dropout in recurrent neural networks, 1019 Ganin, 2015, Unsupervised domain adaptation by backpropagation, 1180 Graves, 2006, Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks, 369 He, 2016, Identity mappings in deep residual networks, 630 Hori, 2017, Advances in Joint CTC-attention based end-to-end speech recognition with a deep CNN Encoder and RNN-LM Ioffe, 2015, Batch normalization: accelerating deep network training by reducing internal covariate shift, 448 Jha, 2018, Word spotting in silent lip videos Ji, 2013, 3D convolutional neural networks for human action recognition, IEEE Trans. Pattern Anal. Mach. Intell., 35, 221, 10.1109/TPAMI.2012.59 Katsaggelos, 2015, Audiovisual fusion: Challenges and new approaches, Proc. IEEE, 103, 1635, 10.1109/JPROC.2015.2459017 Kingma, 2014, Adam: A method for stochastic optimization Koumparoulis, 2017, Exploring ROI size in deep learning based lipreading Li, J., Ye, G., Das, A., Zhao, R., Gong, Y., Advancing Acoustic-to-Word CTC Model. arXiv preprint arXiv:1803.05566, 2018. Li, J., Ye, G., Zhao, R., Droppo, J., Gong, Y., Acoustic-to-word model without OOV, arXiv preprint arXiv:1711.10136, 2017. Luettin, J., Thacker, N.A., Speechreading using probabilistic models. Technical Report, IDIAP, 1997. MacDonald, 1978, Visual influences on speech perception processes, Percept. Psychophys., 24, 253, 10.3758/BF03206096 Matthews, 2002, Extraction of visual features for lipreading, IEEE Trans. Pattern Anal. Mach. Intell., 24, 198, 10.1109/34.982900 McGurk, 1976, Hearing lips and seeing voices, Nature, 264, 746, 10.1038/264746a0 Mroueh, 2015, Deep multimodal learning for audio-visual speech recognition, 2130 Papandreou, 2009, Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition, IEEE Trans. Audio, Speech, Lang. Process., 17, 423, 10.1109/TASL.2008.2011515 Petridis, 2018, End-to-end audiovisual speech recognition Petridis, 2017, End-to-End audiovisual fusion with LSTMs Potamianos, 2003, Audio-visual speech recognition in challenging environments., 1293 Potamianos, 2003, Recent advances in the automatic recognition of audiovisual speech, Proc. IEEE, 91, 1306, 10.1109/JPROC.2003.817150 Sak, 2014, Long short-term memory recurrent neural network architectures for large scale acoustic modeling Shillingford, B., Assael, Y., Hoffman, M.W., Paine, T., Hughes, C., Prabhu, U., Liao, H., Sak, H., Rao, K., Bennett, L., et al., Large-Scale Visual Speech Recognition. arXiv preprint arXiv:1807.05162, 2018. Shiraishi, 2015, Optical flow based lip reading using non rectangular ROI and head motion reduction, 1 Simonyan, 2014, Two-stream convolutional networks for action recognition in videos, 568 Stafylakis, 2017, Combining residual networks with LSTMs for lipreading Stafylakis, 2018, Deep word embeddings for visual speech recognition Stafylakis, 2018, Zero-shot keyword spotting for visual speech recognition in-the-wild Tao, 2018, Gating neural network for large vocabulary audiovisual speech recognition, IEEE/ACM Trans. Audio Speech Lang. Process., 26, 1286, 10.1109/TASLP.2018.2815268 Taylor, 2012, Dynamic units of visual speech, 275 Thiemann, J., Ito, N., Vincent, E., DEMAND: a collection of multi-channel recordings of acoustic noise in diverse environments, 2013. Tran, 2015, Learning spatiotemporal features with 3D convolutional networks Vaswani, 2017, Attention is all you need, 5998 Wand, 2016, Lipreading with long short-term memory, 6115 Wand, 2017, Improving speaker-independent lipreading with domain-adversarial training Zhou, 2014, A review of recent advances in visual speech decoding, Image Vision Comput., 32, 590, 10.1016/j.imavis.2014.06.004