Audio segmentation-by-classification approach based on factor analysis in broadcast news domain

EURASIP Journal on Audio, Speech, and Music Processing - Tập 2014 - Trang 1-13 - 2014

Diego Castán¹, Alfonso Ortega¹, Antonio Miguel¹, Eduardo Lleida¹

¹Departamento Ingeniería Electrónica y Comunicaciones, Universidad de Zaragoza, Zaragoza, Spain

Tóm tắt

This paper studies a novel audio segmentation-by-classification approach based on factor analysis. The proposed technique compensates the within-class variability by using class-dependent factor loading matrices and obtains the scores by computing the log-likelihood ratio for the class model to a non-class model over fixed-length windows. Afterwards, these scores are smoothed to yield longer contiguous segments of the same class by means of different back-end systems. Unlike previous solutions, our proposal does not make use of specific acoustic features and does not need a hierarchical structure. The proposed method is applied to segment and classify audios coming from TV shows into five different acoustic classes: speech, music, speech with music, speech with noise, and others. The technique is compared to a hierarchical system with specific acoustic features achieving a significant error reduction.

Tài liệu tham khảo

NIST, TREC NIST Evaluations. . Accessed 6 Aug 2014., [http://www.itl.nist.gov/iad/mig//tests/sdr/] S Galliano, E Geoffrois, D Mostefa, in Interspeech, Lisbon, 4–8 Sept 2005. The ESTER phase II evaluation campaign for the rich transcription of French broadcast news, pp. 3–6. J Zibert, F Mihelic, J Martens, H Meinedo, J Neto, L Docio, C Garcia-Mateo, P David, E Al, in Interspeech, Lisbon, 4–8 Sept 2005. The COST278 broadcast news segmentation and speaker clustering evaluation-overview, methodology, systems, results. Lavner Y, Ruinskiy D: A decision-tree-based algorithm for speech/music classification and segmentation. EURASIP J. Audio Speech Music Process 2009, 2009: 1-15. 10.1155/2009/239892 S Imai, in IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, Boston, 14–16 Apr 1983. Cepstral analysis synthesis on the mel frequency scale, pp. 93–96. R Vergin, D O’Shaughnessy, V Gupta, in IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, vol. 1, Atlanta, 7–10 May 1996. Compensated mel frequency cepstrum coefficients, pp. 323–326. Vergin R: Generalized mel frequency cepstral coefficients for large-vocabulary speaker-independent continuous-speech recognition. IEEE Trans. Speech Audio Process 1999, 7(5):525-532. 10.1109/89.784104 E Wong, S Sridharan, in International Symposium on Intelligent Multimedia, Video and Speech Processing, Kowloon Shangri-La, Hong Kong, 2–4 May 2001. Comparison of linear prediction cepstrum coefficients and mel-frequency cepstrum coefficients for language identification, pp. 95–98. M Hasan, M Jamil, M Rahman, in International Conference on Computer and Electrical Engineering, Speaker identification using mel frequency cepstral coefficients. Dhaka, 28–30 Dec 2004. Dhanalakshmi P, Palanivel S, Ramalingam V: Classification of audio signals using AANN and GMM. Appl. Soft Comput 2011, 11(1):716-723. 10.1016/j.asoc.2009.12.033 Xie L, Fu Z-H, Feng W, Luo Y: Pitch-density-based features and an SVM binary tree approach for multi-class audio classification in broadcast news. Multimed. Syst 2011, 17(2):101-112. 10.1007/s00530-010-0205-x J Saunders, in IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, Atlanta, 7–10 May 1996. Real-time discrimination of broadcast speech/music, pp. 993–996. Li D, Sethi I, Dimitrova N, McGee T: Classification of general audio data for content-based retrieval. Pattern Recogn. Lett 2001, 22: 533-544. 10.1016/S0167-8655(00)00119-7 Lu L, Zhang H, Jiang H: Content analysis for audio classification and segmentation. IEEE Trans. Speech Audio Process 2002, 10(7):504-516. 10.1109/TSA.2002.804546 TL Nwe, H Li, in IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 2, Philadelphia, 18–23 Mar 2005. Broadcast news segmentation by audio type analysis, p. 1065. A Hauptmann, R Baron, M Chen, in Proc. TRECVID, Informedia at TRECVID 2003: analyzing and searching broadcast news video. Gaithersburg, 17–18 Nov 2003. S Dharanipragada, M Franz, in DARPA Broadcast News Workshop, Herndon, 28 Feb–3 Mar 1999. Story segmentation and topic detection in the broadcast news domain, pp. 1–4. Gallardo-Antolín A, Montero J: Histogram equalization-based features for speech, music, and song discrimination. IEEE Signal Process. Lett 2010, 17(7):659-662. 10.1109/LSP.2010.2049877 Butko T, Nadeu C: Audio segmentation of broadcast news in the Albayzin-2010 evaluation: overview, results, and discussion. EURASIP J. Audio Speech Music Process 2011, 2011(1):1. 10.1186/1687-4722-2011-1 Markaki M, Stylianou Y: Discrimination of speech from nonspeeech in broadcast news based on modulation frequency features. Speech Commun 2011, 53(5):726-735. 10.1016/j.specom.2010.08.007 Huang R, Hansen J: Advances in unsupervised audio classification and segmentation for the broadcast news and NGSW corpora. IEEE Trans. Audio Speech Lang. Process 2006, 14(3):907-919. 10.1109/TSA.2005.858057 Nguyen N, Haque M, Kim C-h, Kim J: Audio segmentation and classification using a temporally weighted fuzzy C-means algorithm. Adv. Neural Netw 2011, 6676: 447-456. SS Chen, PS Gopalakrishnan, in Proc. DARPA Broadcast News Workshop, Speaker, environment and channel change detection and clustering via the Bayesian information criterion. Lansdowne, 8–11 Feb 1998. Wu C-h, Chiu Y: Automatic segmentation and identification of mixed-language speech using delta-BIC and LSA-based GMMs. IEEE Trans. Audio Speech Lang. Process 2006, 14(1):266-276. 10.1109/TSA.2005.852992 Kotti M, Benetos E, Kotropoulos C: Computationally efficient and robust BIC-based speaker segmentation. IEEE Trans. Audio Speech Lang. Process 2008, 16(5):920-933. 10.1109/TASL.2008.925152 Wu C-h, Hsieh C-h: Multiple change-point audio segmentation and classification using an MDL-based Gaussian model. IEEE Trans. Audio Speech Lang. Process 2006, 14(2):647-657. 10.1109/TSA.2005.852988 A Misra, in Proc. Interspeech, Speech/nonspeech segmentation in web videos. Portland, 9–13 Sept 2012. Lu L, Zhang H-J, Li SZ: Content-based audio classification and segmentation by using support vector machines. Multimed. Syst 2003, 8(6):482-492. 10.1007/s00530-002-0065-0 H Aronowitz, in IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, Honolulu, 15–20 Apr 2007. Segmental modeling for audio segmentation, pp. 393–396. J Foote, in American Association for Artificial Intelligence: Intelligence Integration and Use of Text, Image, Video, and Audio Corpora A similarity measure for automatic audio classification. (Stanford, March 1997). A Gallardo, R San Segundo, in II Iberian SLTech, Vigo, 10–12 Nov 2010. UPM-UC3M system for music and speech segmentation, pp. 421–424. D Castan, C Vaquero, A Ortega, D Martínez, E Lleida, in Proc. Interspeech, Hierarchical audio segmentation with HMM and factor analysis in broadcast news domain. Florence, 15 Aug 2011. T Butko, CN Camprubí, H Schulz, in II Iberian SLTech, Vigo, 10–12 Nov 2010. Albayzin-2010 audio segmentation evaluation: evaluation setup and results, pp. 305–308. Kenny P, Boulianne G, Dumouchel P: Eigenvoice modeling with sparse training data. IEEE Trans. Speech Audio Process 2005, 13(3):345-354. 10.1109/TSA.2004.840940 P Kenny, Joint factor analysis of speaker and session variability: theory and algorithms, 1–17 (2006). . Accessed 6 Aug 2014., [http://www.crim.ca/perso/patrick.kenny] Kenny P, Boulianne G, Ouellet P, Dumouchel P: Joint factor analysis versus eigenchannels in speaker recognition. IEEE Trans. Audio Speech Lang 2007, 15(4):1435-1447. 10.1109/TASL.2006.881693 C Vaquero, A Ortega, J Villalba, A Miguel, E Lleida, in Proc Interspeech 2010, vol. 2010, Makuhari, 26–30 Sept 2010. Confidence measures for speaker segmentation and their relation to speaker verification, pp. 2310–2313. C Vaquero, A Ortega, E Lleida, in IEEE International Conference on Acoustics, Speech and Signal Processing, Prague, 22–27 May 2011. Intra-session variability compensation and a hypothesis generation and selection strategy for speaker segmentation, pp. 3–6. N Brummer, A Strasheim, V Hubeika, P Matějka, L Burget, O Glembek, in Proc Interspeech, Brighton, 6–10 Sept 2009. Discriminative acoustic language recognition via channel-compensated GMM statistics, pp. 2187–2190. D Castan, A Ortega, A Miguel, E Lleida, in Proc. SLAM Workshop, Broadcast news segmentation with factor analysis system. Marseille, 22–23 Aug 2012. D Castan, A Ortega, J Villalba, E Lleida, in IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, Segmentation-by-classification system based on factor analysis. Vancouver, 26–31 May 2013. NIST, The 2009 (RT-09) Rich Transcription Meeting Recognition Evaluation Plan, (Melbourne, 28–29 May 2009. Reynolds D, Quatieri TF, Dunn RB: Speaker verification using adapted gaussian mixture models. Digit. Signal Process 2000, 10(1–3):19-41. 10.1006/dspr.1999.0361 CM Bishop, Pattern Recognition and Machine Learning, vol. 4 Computers - Springer, Aug 17, 2006. Kenny P, Reynolds D, Castaldo F: Diarization of telephone conversations using factor analysis. IEEE J. Selected Topics Signal Process 2010, 4(6):1059-1070. 10.1109/JSTSP.2010.2081790 Li H, Ma B, Lee K: Spoken language recognition: from fundamentals to practice. Proceedings of IEEE 2013, 101(5):1136-1159. 10.1109/JPROC.2012.2237151 Castaldo F, Colibro D, Dalmasso E, Laface P, Vair C: Compensation of nuisance factors for speaker and language recognition. IEEE Trans. Audio Speech Lang. Process 2007, 15(7):1969-1978. 10.1109/TASL.2007.901823 Vogt R, Sridharan S: Explicit modelling of session variability for speaker verification. Comput. Speech Lang 2008, 22(1):17-38. 10.1016/j.csl.2007.05.003 D Castan, A Ortega, E Lleida, in Proc. III Iberian SLTech, Factor analysis segmentation and classification in broadcast news domain. Madrid, 21–23 Nov 2012. O Glembek, L Burget, N Dehak, N Brummer, P Kenny, in IEEE International Conference on Acoustics, Speech and Signal Processing, Taipei, 19–24 Apr 2009. Comparison of scoring methods used in speaker recognition with joint factor analysis, pp. 4057–4060. P Kenny, G Boulianne, P Ouellet, P Dumouchel, in IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1, Philadelphia, 18–23 Mar 2005. Factor analysis simplified, pp. 637–640. Kittler J: Combining classifiers: a theoretical framework. Pattern Anal. Appl 1998, 1(1):18-27. 10.1007/BF01238023 N Brummer, Measuring, refining and calibrating speaker and language information extracted from speech PhD thesis, University of Stellenbosch, (2010). V Hubeika, A Strasheim, in Odyssey, Brno, 28 June–1 July 2010. Data selection and calibration issues in automatic language recognition - investigation with BUT-AGNITIO NIST LRE 2009 system, pp. 215–221. D Martínez, A Miguel, A Ortega, E Lleida, in Proc. Interspeech, I3A language recognition system for Albayzin 2010 LRE. Florence, 15 Aug 2011.

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA