Comparison between different feature extraction techniques for audio-visual speech recognition

Journal on Multimodal User Interfaces - Tập 1 Số 1 - Trang 7-20 - 2007

Alin Chiţu¹, Léon J. M. Rothkrantz¹, Pascal Wiggers¹, Jacek C. Wojdeł²

¹Man Machine Interaction Group, Delft University of Technology, The Netherlands#TAB#

²Quantum Chemistry of Materials Research Group, University of Barcelona, Spain

Tóm tắt

Từ khóa

Tài liệu tham khảo

L. J. M. Rothkrantz, J. C. Wojdel, and P. Wiggers, “Comparison between different feature extraction techniques in lipreading applications”, inSpecom’2006, SpIIRAS Petersburg, 2006. 7, 8, 17

J. C. Wojdel and L. J. M. Rothkrantz, “Visually based speech onset/offset detection”, inProceedings of 5th Annual Scientific Conference on Web Technology, New Media, Communications and Telematics Theory, Methods, Tools and Application (Euromedia 2000), (Antwerp, Belgium), pp. 156–160, 2000. 7, 13

L. J. M. Rothkrantz, J. C. Wojdel, and P. Wiggers, “Fusing Data Streams in Continuous Audio-Visual Speech Recognition”, inText, Speech and Dialogue: 8th International Conference, TSD 2005, vol. 3658, (Karlovy Vary, Czech Republic), pp. 33–44, Springer Berlin/Heidelberg, September 2005. 7

H. McGurk and J. MacDonald, “Hearing lips and seeing voices”,Nature, vol. 264, pp. 746–748, December 1976. 7

K. P. Green, P. K. Kuhl, A. N. Meltzoff, and E. B. Stevens, “Integrating speech information across talkers, gender, and sensory modality: female faces and male voices in the McGurk effect”,Perception and psychophysics, vol. 50, no. 6, pp. 524–536, 1991. 7

N. Li, S. Dettmer, and M. Shah, “Lipreading using eigen sequences”, inProc. International Workshop on Automatic Face- and Gesture-Recognition, (Zurich, Switzerland), pp. 30–34, 1995. 7

N. Li, S. Dettmer, and M. Shah, “Visually recognizing speech using eigensequences”,Motion-based recognition, 1997. 7

X. Hong, H. Yao, Y. Wan, and R. Chen, “A PCA Based Visual DCT Feature Extraction Method for Lip-Reading”,iih-msp, vol. 0, pp. 321–326, 2006. 7

C. Bregler and Y. Konig, ““Eigenlips” for robust speech recognition”, inAcoustics, Speech, and Signal Processing, 1994. ICASSP-94 IEEE International Conference on, 1994. 7

P. Duchnowski, M. Hunke, D. Büsching, U. Meier, and A. Waibel, “Toward Movement-Invariant Automatic Lip-Reading and Speech Recognition”, inInternational Conference on Acoustics, Speech, and Signal Processing, 1995 (ICASSP-95), vol. 1, pp. 109–112, 1995. 8

I. A. Essa and A. Pentland, “A Vision System for Observing and Extracting Facial Action Parameters”, inProceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 76–83, IEEE, June 1994. 8, 13

S. Tamura, K. Iwano, and S. Furui, “A Robust Multi-Modal Speech Recognition Method Using Optical-Flow Analysis”, inExtended summary of IDS02, (Kloster Irsee, Germany), pp. 2–4, June 2002. 8, 13

T. Yoshinaga, S. Tamura, K. Iwano, and S. Furui, “Audio-Visual Speech Recognition Using Lip Movement Extracted from Side-Face Images”, inAVSP2003, pp. 117–120, September 2003. 8, 13

T. Yoshinaga, S. Tamura, K. iwano, and S. Furui, “Audio-Visual Speech Recognition Using New Lip Features Extracted from Side-Face Images”, inRobust 2004, August 2004. 8, 13

K. Mase and A. Pentland., “Automatic Lipreading by Optical-Flow Analysis”, inSystems and Computers in Japan, vol. 22, pp. 67–76, 1991. 8, 13

K. Iwano, S. Tamura, and S. Furui, “Bimodal Speech Recognition Using Lip Movement Measured By Optical-Flow analysis”, inHSC2001, 2001. 8, 13

D. J. Fleet, M. J. Black, Y. Yacoob, and A. D. Jepson, “Design and Use of Linear Models for Image Motion Analysis”,International Journal of Computer Vision, vol. 36, no. 3, pp. 171–193, 2000. 8, 13

A. Martin, “Lipreading by Optical Flow Correlation”, tech. rep., Compute Science Department University of Central Florida, 1995. 8, 13

S. Tamura, K. Iwano, and S. Furui, “Multi-Modal Speech Recognition Using Optical-Flow Analysis for Lip Images”,J. VLSI Signal Process. Syst., vol. 36, no. 2–3, pp. 117–124, 2004. 8, 13

S. Furui, “Robust Methods in Automatic Speech Recognition and Understanding”, inEUROSPEECH 2003 — GENEVA, 2003. 8, 13

B. K. Horn and B. G. Schunck, “Determining optical flow.”,Artificial Intelligence, vol. 17, pp. 185–203, 1981. 8, 9

B. D. Lucas and T. Kanade, “An iterative image registration technique with an application to stereo vision.”, inProc. Seventh International Joint Conference on Artificial Intelligence, pp. 674–679, 1981. 8

A. Bruhn and J. Weickert, “Lucas/Kanade Meets Horn/Schunck: Combining Local and Global Optic Flow Methods”,International Journal of Computer Vision, vol. 61, no. 3, pp. 211–231, 2005. 9

S. Uras, F. Girosi, A. Verri, and V. Torre, “A computational approach to motion perception”, inBiological Cybernetics, vol. 60, pp. 79–87, December 1988. 9

H.-H. Nagel, “On the estimation of optical flow: relations between different approaches and some new results”,Artificial Intelligence, vol. 33, no. 3, pp. 298–324, 1987. 9

P. Anandan, “A Computational Framework and an Algorithm for the Measurement of Visual Motion”,International Journal of Computer Vision, vol. 2, pp. 283–310, 1989. 9

A. Singh, “Optic Flow Computation. A Unified Perspective”. IEEE Computer Society Press, 1991. 9

D. J. Heeger, “Model for the extraction of image flow”,Journal Opt. Soc. Amer., vol. 4, pp. 1455–1471, August 1987. 9

A. Waxman, J. Wu, and F. Bergholm, “Convected activation profiles and receptive fields for real time measurement of short range visual motion”, inProceedings of Conference Computational Visual Pattern Recognition, pp. 771–723, 1988. 9

D. J. Fleet and A. D. Jepson, “Computation of Component Image Velocity from Local Phase Information”,International Journal of Computer Vision, vol. 5, pp. 77–104, August 1990. 9

J. L. Barron, D. J. Fleet, and S. S. Beauchemin, “Performance of optical flow techniques”,International Journal of Computer Vision, vol. 12, pp. 43–77, February 1994. 9

B. Galvin, B. McCane, K. Novins, D. Mason, and S. Mills, “Recovering Motion Fields: An Evaluation of Eight Optical Flow Algorithms”, inProceedings of the British Machine Vision Converence (BMVC) ’98, September 1998. 9

L. Rabiner and B. Juang,Fundamentals of Speech Recognition. N.J, USA: Prentice Hall, 1993. 9

R. P. Lippmann, “Review of neural networks for speech recognition”,Neural Comput., vol. 1, no. 1, pp. 1–38, 1990. 9

L. J. Rothkrantz and D. Nollen, “Automatic speech recognition using recurrent neural networks”, inNeural Network World, vol. 10, pp. 445–453, July 2000. 9

A. Ganapathiraju,Support vector machines for speech recognition. PhD thesis, Mississippi State University, 2002. Major Professor-Joseph Picone. 9

T. S. Andersen, K. Tiippana, and M. Lampien, “Modeling of audio-visual speech perception in noise”, inIn Proceedings of AVSP 2001, (Aalborg, Denmark), September 2001. 10

P. Smeele,Perceiving speech: Integrating auditory and visual speech. PhD thesis, Delft University of Technology, 1995. 10

D. Massaro, “A fuzzy logical model of speech perception”, inProceedings of XXIV International Congress of Psychology. Human Information Processing: Measures, Mechanisms and Models (D. Vickers and P. Smith, eds.), (Amsterdam, North Holland), pp. 367–379, 1989. 10

G. Meyer, J. Mulligan, and S. Wuerger, “Continuous audio-visual digit recognition using N-best decision fusion”,Information Fusion, vol. 5, pp. 91–101, 2004. 10

S. Dupont, H. Bourlard, and C. Ris, “Robust Speech recognition Based on Multi-Stream Features”, inProc. of ESCA/NATO Workshop on Robust Speech Recognition for Unknown Communication Channels, (Pont- Mousson, France), pp. 95–98, 1997. 10

J. Luettin and S. Dupont, “Continuous Audio-Visual Speech Recognition”, inIDIAP, Dalle Molle Institute for Perceptual Artificial Intelligence, 1998. 10

Z. Ghahramani and M. I. Jordan, “Factorial Hidden Markov Models”, inProc. Conf. Advances in Neural Information Processing Systems, NIPS, 1997. 11

P. Viola and M. Jones, “Robust Real-time Object Detection”, inSecond International Workshop On Statistical And Computational Theories Of Vision Modeling, Learning, Computing, And Sampling, (Vancouver, Canada), July 2001. 11

T. Coianiz, L. Torresani, and B. Caprile, “2D deformable models for visual speech analysis”, inSpeechreading by humans and machines: models, systems, and applications. (D. G. Stork and M. E. Hennecke, eds.), vol. 150 ofNATO ASI Series F: Computer and Systems Sciences, Berlin and New York: Springer, 1996. 11

J. Millar, M. Wagner, and R. Goecke, “Aspects of Speaking-Face Data Corpus Design Methodology”, inProceedings of the 8th International Conference on Spoken Language Processing ICSLP2004, vol. II, (Jeju, Korea), pp. 1157–1160, oct 2004. 15

J. R. Movellan, “Visual Speech Recognition with Stochastic Networks”, inAdvances in Neural Information Processing Systems, vol. 7, (Cambridge), MIT Press, 1995. 15

N. A. Fox,Audio and Video Based Person Identification. PhD thesis, Department of Electronic and Electrical Engineering Faculty of Engineering and Architecture University College Dublin, 2005. 15

K. Messer, J. Matas, and J. Kittler, “Acquisition of a large database for biometric identity verification”, inBIOSIGNAL 98 (J. Jan, J. Kozumplík, and Z. Szabó, eds.), (Technical University Brno, Purkynova 188, 612 00, Brno, Czech Republic), pp. 70–72, Vutium Press, June 1998. 15

E. Bailly-Baillire, S. Bengio, F. Bimbot, M. Hamouz, J. Kittler, J. Mariéthoz, J. Matas, K. Messer, V. Popovici, F. Porée, B. Ruiz, and J.-P. Thiran, “The BANCA Database and Evaluation Protocol”, inAudio and Video Based Biometric Person Authentication, vol. 2688, pp. 625–638, Springer Berlin/Heidelberg, 2003. 15

E. Patterson, S. Gurbuz, Z. Tufekci, and J. Gowdy, “CUAVE: A New Audio-Visual Database for Multimodal Human-Computer Interface Research”, inProceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2002. 15

S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. A. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland,The HTK Book (for HTK Version 3.4). 2005. 15, 16

K. Murphy,Dynamic Bayesian Networks: Representation, Inference and Learning. PhD thesis, University of California, Berkeley, 2002. 18

J. Pearl,Probabilistic Reasoning in Intelligent Systems — Networks of Plausible Inference. Morgan Kaufmann Publishers, Inc., 1988. 18

A. V. Nefian, L. Liang, X. Pi, X. Liu, and K. Murphy, “Dynamic Bayesian Networks for Audio-Visual Speech Recognition”,EURASIP Journal on Applied Signal Processing, vol. 11, pp. 1274–1288, 2002. 18

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA