Pushing the boundaries of audiovisual word recognition using Residual Networks and LSTMs
Tài liệu tham khảo
Afouras, 2018, The conversation: Deep audio-visual speech enhancement
Afouras, 2018, Deep lip reading: a comparison of models and an online application
Anina, 2015, OuluVS2: a multi-view audiovisual database for non-rigid mouth motion analysis, 1
Assael, Y.M., Shillingford, B., Whiteson, S., de Freitas, N., LipNet: sentence-level lipreading. arXiv preprint arXiv:1611.01599, 2016.
Audhkhasi, K., Ramabhadran, B., Saon, G., Picheny, M., Nahamoo, D., Direct acoustics-to-word models for english conversational speech recognition. arXiv preprint arXiv:1703.07754, 2017.
Bear, 2014, Which phoneme-to-viseme maps best improve visual-only computer lip-reading?, 230
Black, 1993, A framework for the robust estimation of optical flow, 231
Chan, 2001, Hmm-based audio-visual speech recognition integrating geometric- and appearance-based visual features, 9
Chan, 2016, Listen, attend and spell: A neural network for large vocabulary conversational speech recognition, 4960
Cheng, 2017, An exploration of dropout with LSTMs
Chung, 2017, Lip reading sentences in the wild
Chung, 2016, Lip reading in the wild, 87
Chung, 2018, Learning to lip read words by watching videos, Comput. Vis. Image Underst., 10.1016/j.cviu.2018.02.001
Dalton, 1996, Automatic speechreading using dynamic contours, 373
Gal, 2016, A theoretically grounded application of dropout in recurrent neural networks, 1019
Ganin, 2015, Unsupervised domain adaptation by backpropagation, 1180
Graves, 2006, Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks, 369
He, 2016, Identity mappings in deep residual networks, 630
Hori, 2017, Advances in Joint CTC-attention based end-to-end speech recognition with a deep CNN Encoder and RNN-LM
Ioffe, 2015, Batch normalization: accelerating deep network training by reducing internal covariate shift, 448
Jha, 2018, Word spotting in silent lip videos
Ji, 2013, 3D convolutional neural networks for human action recognition, IEEE Trans. Pattern Anal. Mach. Intell., 35, 221, 10.1109/TPAMI.2012.59
Katsaggelos, 2015, Audiovisual fusion: Challenges and new approaches, Proc. IEEE, 103, 1635, 10.1109/JPROC.2015.2459017
Kingma, 2014, Adam: A method for stochastic optimization
Koumparoulis, 2017, Exploring ROI size in deep learning based lipreading
Li, J., Ye, G., Das, A., Zhao, R., Gong, Y., Advancing Acoustic-to-Word CTC Model. arXiv preprint arXiv:1803.05566, 2018.
Li, J., Ye, G., Zhao, R., Droppo, J., Gong, Y., Acoustic-to-word model without OOV, arXiv preprint arXiv:1711.10136, 2017.
Luettin, J., Thacker, N.A., Speechreading using probabilistic models. Technical Report, IDIAP, 1997.
MacDonald, 1978, Visual influences on speech perception processes, Percept. Psychophys., 24, 253, 10.3758/BF03206096
Matthews, 2002, Extraction of visual features for lipreading, IEEE Trans. Pattern Anal. Mach. Intell., 24, 198, 10.1109/34.982900
McGurk, 1976, Hearing lips and seeing voices, Nature, 264, 746, 10.1038/264746a0
Mroueh, 2015, Deep multimodal learning for audio-visual speech recognition, 2130
Papandreou, 2009, Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition, IEEE Trans. Audio, Speech, Lang. Process., 17, 423, 10.1109/TASL.2008.2011515
Petridis, 2018, End-to-end audiovisual speech recognition
Petridis, 2017, End-to-End audiovisual fusion with LSTMs
Potamianos, 2003, Audio-visual speech recognition in challenging environments., 1293
Potamianos, 2003, Recent advances in the automatic recognition of audiovisual speech, Proc. IEEE, 91, 1306, 10.1109/JPROC.2003.817150
Sak, 2014, Long short-term memory recurrent neural network architectures for large scale acoustic modeling
Shillingford, B., Assael, Y., Hoffman, M.W., Paine, T., Hughes, C., Prabhu, U., Liao, H., Sak, H., Rao, K., Bennett, L., et al., Large-Scale Visual Speech Recognition. arXiv preprint arXiv:1807.05162, 2018.
Shiraishi, 2015, Optical flow based lip reading using non rectangular ROI and head motion reduction, 1
Simonyan, 2014, Two-stream convolutional networks for action recognition in videos, 568
Stafylakis, 2017, Combining residual networks with LSTMs for lipreading
Stafylakis, 2018, Deep word embeddings for visual speech recognition
Stafylakis, 2018, Zero-shot keyword spotting for visual speech recognition in-the-wild
Tao, 2018, Gating neural network for large vocabulary audiovisual speech recognition, IEEE/ACM Trans. Audio Speech Lang. Process., 26, 1286, 10.1109/TASLP.2018.2815268
Taylor, 2012, Dynamic units of visual speech, 275
Thiemann, J., Ito, N., Vincent, E., DEMAND: a collection of multi-channel recordings of acoustic noise in diverse environments, 2013.
Tran, 2015, Learning spatiotemporal features with 3D convolutional networks
Vaswani, 2017, Attention is all you need, 5998
Wand, 2016, Lipreading with long short-term memory, 6115
Wand, 2017, Improving speaker-independent lipreading with domain-adversarial training
Zhou, 2014, A review of recent advances in visual speech decoding, Image Vision Comput., 32, 590, 10.1016/j.imavis.2014.06.004