Structure of pauses in speech in the context of speaker verification and classification of speech type

EURASIP Journal on Audio, Speech, and Music Processing - Tập 2016 - Trang 1-16 - 2016

Magdalena Igras-Cybulska¹, Bartosz Ziółko^1,2, Piotr Żelasko^1,2, Marcin Witkowski^1,2

¹Department of Computer Science, Electronics and Telecommunications, AGH University of Science and Technology, Kraków, Poland

²Techmo, Kraków, Poland

Tóm tắt

Statistics of pauses appearing in Polish as a potential source of biometry information for automatic speaker recognition were described. The usage of three main types of acoustic pauses (silent, filled and breath pauses) and syntactic pauses (punctuation marks in speech transcripts) was investigated quantitatively in three types of spontaneous speech (presentations, simultaneous interpretation and radio interviews) and read speech (audio books). Selected parameters of pauses extracted for each speaker separately or for speaker groups were examined statistically to verify usefulness of information on pauses for speaker recognition and speaker profile estimation. Quantity and duration of filled pauses, audible breaths, and correlation between the temporal structure of speech and the syntax structure of the spoken language were the features which characterize speakers most. The experiment of using pauses in speaker biometry system (using Universal Background Model and i-vectors) resulted in 30 % equal error rate. Including pause-related features to the baseline Mel-frequency cepstral coefficient system has not significantly improved its performance. In the experiment with automatic recognition of three types of spontaneous speech, we achieved 78 % accuracy, using GMM classifier. Silent pause-related features allowed distinguishing between read and spontaneous speech by extreme gradient boosting with 75 % accuracy.

Tài liệu tham khảo

F Batista, H Moniz, I Trancoso, N Mamede, A Mata, Extending automatic transcripts in a unified data representation towards a prosodic-based metadata annotation and evaluation. Journal of Speech Sciences 2(2), 115–138 (2012)

M Igras, B Ziółko: Wavelet method for breath detection in audio signals. In: IEEE International Conference on Multimedia and Expo (ICME 2013), San Jose (2013). doi:10.1109/ICME.2013.6607428

T Kendall, Speech rate, pause and linguistic variation: an examination through the sociolinguistic archive and analysis project. Doctoral dissertation (Duke University, Durham, 2009)

E Campione, J Véronis.(2002). A large-scale multilingual study of silent pause duration. In: Proceedings of the Speech Prosody Conference, 199–202

M Demol, W Verhelst, P Verhoeve. (2006). A study of speech pauses for multilingual time-scaling applications. In: Proc. ISCA-ITRW Multiling, (Stellenbosch, South Africa).

I Homma, Y Masaoka, Breathing rhythms and emotions. Experimental physiology 93(9), 1011–1021 (2008)

American Thoracic Society and American College of Chest Physicians, ATS/ACCP Statement on cardiopulmonary exercise testing. American Journal of Respiratory and Critical Care Medicine 167(2), 211–277 (2003)

V Rapcan, S D’Arcy, S Yeap, N Afzal, J Thakore, RB Reilly, Acoustic and temporal analysis of speech: a potential biomarker for schizophrenia. Medical Engineering & Physics 32, 1074–1079 (2010)

D Baron, E Shriberg, A Stolcke. (2002). Automatic punctuation and disfluency detection in multi-party meetings using prosodic and lexical cues. In: Proceedings of the International Conference on Spoken Language Processing, 949–952

E Shriberg, A Stolcke, D Hakkani- Tür, G Tür, Prosody-based automatic segmentation of speech into sentences and topics. Journal Speech Communication - Special issue on accessing information in spoken audio archive 32(1–2), 127–154 (2000)

WA Lea, Trends in speech recognition (Academic Press, New York, 1980)

V Ramanarayanan, E Bresch, D Byrd, L Goldstein, SS Narayanan, Analysis of pausing behavior in spontaneous speech using real-time magnetic resonance imaging of articulation. The Journal of the Acoustical Society of America 126, 160–165 (2009)

T Kinnunen, H Li, An overview of text-independent speaker recognition: from features to supervectors. Speech communication 52(1), 12–40 (2010)

B Ziółko, W Kozłowski, M Ziółko, R Samborski, D Sierra, J Gałka, Hybrid wavelet-Fourier-HMM speaker recognition. International Journal of Hybrid Information Technology 4(4), 25–41 (2011)

E Shriberg, Higher-level features in speaker recognition. Speaker Classification I. Lecture Notes in Computer Science / Artificial Intelligence (Springer, Berlin/Heidelberg, 2007), pp. 241–259

B Peskin, J Navratil, J Abramson, D Klusacek, DA Reynolds, X Bing: Using prosodic and conversational features for high-performance speaker recognition: report from JHU WS'02. IEEE International Conference on Acoustics, Speech, and Signal Processing (2003). doi: 10.1109/ICASSP.2003.1202762

K Sönmez, E Shriberg, L Heck, M Weintraub. (1998). Modeling dynamic prosodic variation for speaker verification. In: Proc. ICSLP, 3189–3192

G Adami, Modeling prosodic differences for speaker recognition. Speech Communication 49(4), 277–291 (2007)

M Backes, G Doychev, M Dürmuth, B Köpf. (2010). Speaker recognition in encrypted voice streams. In: Proceedings of the 15th European Conference on Research in Computer Security, 508–523

J Lööf, C Gollan, H Ney. (2009). Cross-language bootstrapping for unsupervised acoustic model training: rapid development of a Polish speech recognition system. In: Proceedings of Interspeech, Brighton, 88–91

DA Reynolds, TF Quatieri, RB Dunn, Speaker verification using adapted Gaussian mixture models. Digital Signal Processing 10(1–3), 19–41 (2000)

J Pelecanos, S Sridharan: Feature warping for robust speaker verification. In: Proc. Speaker Odyssey: the Speaker Recognition Workshop (Odyssey 2001), Crete, Greece, 213–218 (2001)

M Igras, B Ziółko, Different types of pauses as a source of information for biometry. Models and analysis of vocal emissions for biomedical applications: 8th international workshop (Firenze University Press, Firenze, 2013), pp. 197–200

K Barczewska, M Igras, Detection of disfluencies in speech signal. Challenges of modern technology 32(1–2), 3–10 (2013)

F Beritelli, A Spadaccini. (2012). Performance evaluation of automatic speaker recognition techniques for forensic applications. New Trends and Developments in Biometrics, 129–148

E Shriberg, L Ferrer, S Kajarekar, A Venkataraman, A Stolcke, Modeling prosodic feature sequences for speaker recognition. Speech Communication 46(3), 455–472 (2005)

B Zellner, Pauses and the temporal structure of speech, in Fundamentals of speech synthesis and speech recognition, ed. by E Keller (Wiley, Chichester, 1994), pp. 41–62

E Shriberg: Spontaneous speech: How people really talk and why engineers should care. Proceedings of European Conference on Speech Communication and Technology, Eurospeech, 1781–1784 (2005)

B. Ziółko, T. Jadczyk, D. Skurzok, P. Żelasko, J. Gałka, T. Pędzimąż, I. Gawlik, S. Pałka .2015. “SARMATA 2.0 Automatic Polish Language Speech Recognition System”, Interspeech, Dresden,

P Kenny. (2012). A small footprint i-vector extractor. Odyssey 2012: 1–6

https://sites.google.com/site/bosaristoolkit/ Accessed: 30 May 2016

R Dufour, Y Estève, P Deléglise, Characterizing and detecting spontaneous speech: application to speaker role recognition. Speech Communication 56, 1–18 (2014)

A Tóth, Speech disfluencies in simultaneous interpreting: a mirror on cognitive processes. SKASE Journal of Translation and Interpretation 5(2), 23–31 (2011)

B Tissi, Silent pauses and disfluencies in simultaneous interpretation: a descriptive analysis. The Interpreters’ Newsletter 10, 103–127 (2000)

L Ten Bosch, N. Oostdijk, J P De Ruiter. (2004). Turn-taking in social talk dialogues: temporal, formal and functional aspects. In 9th International Conference Speech and Computer (SPECOM'2004). 454–461

J H Friedman, Greedy function approximation: a gradient boosting machine. Annals of statistics, 29(5), 1189–1232 (2001)

Pedregosa et al. (2011). Scikit-learn: Machine Learning in Python, JMLR 12, pp. 2825–2830

P Żelasko, B Ziółko, T Jadczyk, D Skurzok, “AGH Corpus of Polish Speech”. Language Resources and Evaluation 50, 585–601 (2016)

A Martin, G Doddington, T Kamm, M Ordowski, M Przybocki, “The DET curve in assessment of detection task performance”, in Proceedings of the 5th European Conference on Speech Communication and Technology (Greece, EUROSPEECH, Rhodes, 1997). pp. 1895–1898

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA