Audio segmentation-by-classification approach based on factor analysis in broadcast news domain

Diego Castán1, Alfonso Ortega1, Antonio Miguel1, Eduardo Lleida1
1Departamento Ingeniería Electrónica y Comunicaciones, Universidad de Zaragoza, Zaragoza, Spain

Tóm tắt

This paper studies a novel audio segmentation-by-classification approach based on factor analysis. The proposed technique compensates the within-class variability by using class-dependent factor loading matrices and obtains the scores by computing the log-likelihood ratio for the class model to a non-class model over fixed-length windows. Afterwards, these scores are smoothed to yield longer contiguous segments of the same class by means of different back-end systems. Unlike previous solutions, our proposal does not make use of specific acoustic features and does not need a hierarchical structure. The proposed method is applied to segment and classify audios coming from TV shows into five different acoustic classes: speech, music, speech with music, speech with noise, and others. The technique is compared to a hierarchical system with specific acoustic features achieving a significant error reduction.

Tài liệu tham khảo

NIST, TREC NIST Evaluations. . Accessed 6 Aug 2014., [http://www.itl.nist.gov/iad/mig//tests/sdr/] S Galliano, E Geoffrois, D Mostefa, in Interspeech, Lisbon, 4–8 Sept 2005. The ESTER phase II evaluation campaign for the rich transcription of French broadcast news, pp. 3–6. J Zibert, F Mihelic, J Martens, H Meinedo, J Neto, L Docio, C Garcia-Mateo, P David, E Al, in Interspeech, Lisbon, 4–8 Sept 2005. The COST278 broadcast news segmentation and speaker clustering evaluation-overview, methodology, systems, results. Lavner Y, Ruinskiy D: A decision-tree-based algorithm for speech/music classification and segmentation. EURASIP J. Audio Speech Music Process 2009, 2009: 1-15. 10.1155/2009/239892 S Imai, in IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, Boston, 14–16 Apr 1983. Cepstral analysis synthesis on the mel frequency scale, pp. 93–96. R Vergin, D O’Shaughnessy, V Gupta, in IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, vol. 1, Atlanta, 7–10 May 1996. Compensated mel frequency cepstrum coefficients, pp. 323–326. Vergin R: Generalized mel frequency cepstral coefficients for large-vocabulary speaker-independent continuous-speech recognition. IEEE Trans. Speech Audio Process 1999, 7(5):525-532. 10.1109/89.784104 E Wong, S Sridharan, in International Symposium on Intelligent Multimedia, Video and Speech Processing, Kowloon Shangri-La, Hong Kong, 2–4 May 2001. Comparison of linear prediction cepstrum coefficients and mel-frequency cepstrum coefficients for language identification, pp. 95–98. M Hasan, M Jamil, M Rahman, in International Conference on Computer and Electrical Engineering, Speaker identification using mel frequency cepstral coefficients. Dhaka, 28–30 Dec 2004. Dhanalakshmi P, Palanivel S, Ramalingam V: Classification of audio signals using AANN and GMM. Appl. Soft Comput 2011, 11(1):716-723. 10.1016/j.asoc.2009.12.033 Xie L, Fu Z-H, Feng W, Luo Y: Pitch-density-based features and an SVM binary tree approach for multi-class audio classification in broadcast news. Multimed. Syst 2011, 17(2):101-112. 10.1007/s00530-010-0205-x J Saunders, in IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, Atlanta, 7–10 May 1996. Real-time discrimination of broadcast speech/music, pp. 993–996. Li D, Sethi I, Dimitrova N, McGee T: Classification of general audio data for content-based retrieval. Pattern Recogn. Lett 2001, 22: 533-544. 10.1016/S0167-8655(00)00119-7 Lu L, Zhang H, Jiang H: Content analysis for audio classification and segmentation. IEEE Trans. Speech Audio Process 2002, 10(7):504-516. 10.1109/TSA.2002.804546 TL Nwe, H Li, in IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 2, Philadelphia, 18–23 Mar 2005. Broadcast news segmentation by audio type analysis, p. 1065. A Hauptmann, R Baron, M Chen, in Proc. TRECVID, Informedia at TRECVID 2003: analyzing and searching broadcast news video. Gaithersburg, 17–18 Nov 2003. S Dharanipragada, M Franz, in DARPA Broadcast News Workshop, Herndon, 28 Feb–3 Mar 1999. Story segmentation and topic detection in the broadcast news domain, pp. 1–4. Gallardo-Antolín A, Montero J: Histogram equalization-based features for speech, music, and song discrimination. IEEE Signal Process. Lett 2010, 17(7):659-662. 10.1109/LSP.2010.2049877 Butko T, Nadeu C: Audio segmentation of broadcast news in the Albayzin-2010 evaluation: overview, results, and discussion. EURASIP J. Audio Speech Music Process 2011, 2011(1):1. 10.1186/1687-4722-2011-1 Markaki M, Stylianou Y: Discrimination of speech from nonspeeech in broadcast news based on modulation frequency features. Speech Commun 2011, 53(5):726-735. 10.1016/j.specom.2010.08.007 Huang R, Hansen J: Advances in unsupervised audio classification and segmentation for the broadcast news and NGSW corpora. IEEE Trans. Audio Speech Lang. Process 2006, 14(3):907-919. 10.1109/TSA.2005.858057 Nguyen N, Haque M, Kim C-h, Kim J: Audio segmentation and classification using a temporally weighted fuzzy C-means algorithm. Adv. Neural Netw 2011, 6676: 447-456. SS Chen, PS Gopalakrishnan, in Proc. DARPA Broadcast News Workshop, Speaker, environment and channel change detection and clustering via the Bayesian information criterion. Lansdowne, 8–11 Feb 1998. Wu C-h, Chiu Y: Automatic segmentation and identification of mixed-language speech using delta-BIC and LSA-based GMMs. IEEE Trans. Audio Speech Lang. Process 2006, 14(1):266-276. 10.1109/TSA.2005.852992 Kotti M, Benetos E, Kotropoulos C: Computationally efficient and robust BIC-based speaker segmentation. IEEE Trans. Audio Speech Lang. Process 2008, 16(5):920-933. 10.1109/TASL.2008.925152 Wu C-h, Hsieh C-h: Multiple change-point audio segmentation and classification using an MDL-based Gaussian model. IEEE Trans. Audio Speech Lang. Process 2006, 14(2):647-657. 10.1109/TSA.2005.852988 A Misra, in Proc. Interspeech, Speech/nonspeech segmentation in web videos. Portland, 9–13 Sept 2012. Lu L, Zhang H-J, Li SZ: Content-based audio classification and segmentation by using support vector machines. Multimed. Syst 2003, 8(6):482-492. 10.1007/s00530-002-0065-0 H Aronowitz, in IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, Honolulu, 15–20 Apr 2007. Segmental modeling for audio segmentation, pp. 393–396. J Foote, in American Association for Artificial Intelligence: Intelligence Integration and Use of Text, Image, Video, and Audio Corpora A similarity measure for automatic audio classification. (Stanford, March 1997). A Gallardo, R San Segundo, in II Iberian SLTech, Vigo, 10–12 Nov 2010. UPM-UC3M system for music and speech segmentation, pp. 421–424. D Castan, C Vaquero, A Ortega, D Martínez, E Lleida, in Proc. Interspeech, Hierarchical audio segmentation with HMM and factor analysis in broadcast news domain. Florence, 15 Aug 2011. T Butko, CN Camprubí, H Schulz, in II Iberian SLTech, Vigo, 10–12 Nov 2010. Albayzin-2010 audio segmentation evaluation: evaluation setup and results, pp. 305–308. Kenny P, Boulianne G, Dumouchel P: Eigenvoice modeling with sparse training data. IEEE Trans. Speech Audio Process 2005, 13(3):345-354. 10.1109/TSA.2004.840940 P Kenny, Joint factor analysis of speaker and session variability: theory and algorithms, 1–17 (2006). . Accessed 6 Aug 2014., [http://www.crim.ca/perso/patrick.kenny] Kenny P, Boulianne G, Ouellet P, Dumouchel P: Joint factor analysis versus eigenchannels in speaker recognition. IEEE Trans. Audio Speech Lang 2007, 15(4):1435-1447. 10.1109/TASL.2006.881693 C Vaquero, A Ortega, J Villalba, A Miguel, E Lleida, in Proc Interspeech 2010, vol. 2010, Makuhari, 26–30 Sept 2010. Confidence measures for speaker segmentation and their relation to speaker verification, pp. 2310–2313. C Vaquero, A Ortega, E Lleida, in IEEE International Conference on Acoustics, Speech and Signal Processing, Prague, 22–27 May 2011. Intra-session variability compensation and a hypothesis generation and selection strategy for speaker segmentation, pp. 3–6. N Brummer, A Strasheim, V Hubeika, P Matějka, L Burget, O Glembek, in Proc Interspeech, Brighton, 6–10 Sept 2009. Discriminative acoustic language recognition via channel-compensated GMM statistics, pp. 2187–2190. D Castan, A Ortega, A Miguel, E Lleida, in Proc. SLAM Workshop, Broadcast news segmentation with factor analysis system. Marseille, 22–23 Aug 2012. D Castan, A Ortega, J Villalba, E Lleida, in IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, Segmentation-by-classification system based on factor analysis. Vancouver, 26–31 May 2013. NIST, The 2009 (RT-09) Rich Transcription Meeting Recognition Evaluation Plan, (Melbourne, 28–29 May 2009. Reynolds D, Quatieri TF, Dunn RB: Speaker verification using adapted gaussian mixture models. Digit. Signal Process 2000, 10(1–3):19-41. 10.1006/dspr.1999.0361 CM Bishop, Pattern Recognition and Machine Learning, vol. 4 Computers - Springer, Aug 17, 2006. Kenny P, Reynolds D, Castaldo F: Diarization of telephone conversations using factor analysis. IEEE J. Selected Topics Signal Process 2010, 4(6):1059-1070. 10.1109/JSTSP.2010.2081790 Li H, Ma B, Lee K: Spoken language recognition: from fundamentals to practice. Proceedings of IEEE 2013, 101(5):1136-1159. 10.1109/JPROC.2012.2237151 Castaldo F, Colibro D, Dalmasso E, Laface P, Vair C: Compensation of nuisance factors for speaker and language recognition. IEEE Trans. Audio Speech Lang. Process 2007, 15(7):1969-1978. 10.1109/TASL.2007.901823 Vogt R, Sridharan S: Explicit modelling of session variability for speaker verification. Comput. Speech Lang 2008, 22(1):17-38. 10.1016/j.csl.2007.05.003 D Castan, A Ortega, E Lleida, in Proc. III Iberian SLTech, Factor analysis segmentation and classification in broadcast news domain. Madrid, 21–23 Nov 2012. O Glembek, L Burget, N Dehak, N Brummer, P Kenny, in IEEE International Conference on Acoustics, Speech and Signal Processing, Taipei, 19–24 Apr 2009. Comparison of scoring methods used in speaker recognition with joint factor analysis, pp. 4057–4060. P Kenny, G Boulianne, P Ouellet, P Dumouchel, in IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1, Philadelphia, 18–23 Mar 2005. Factor analysis simplified, pp. 637–640. Kittler J: Combining classifiers: a theoretical framework. Pattern Anal. Appl 1998, 1(1):18-27. 10.1007/BF01238023 N Brummer, Measuring, refining and calibrating speaker and language information extracted from speech PhD thesis, University of Stellenbosch, (2010). V Hubeika, A Strasheim, in Odyssey, Brno, 28 June–1 July 2010. Data selection and calibration issues in automatic language recognition - investigation with BUT-AGNITIO NIST LRE 2009 system, pp. 215–221. D Martínez, A Miguel, A Ortega, E Lleida, in Proc. Interspeech, I3A language recognition system for Albayzin 2010 LRE. Florence, 15 Aug 2011.