A review of speaker diarization: Recent advances with deep learning

Computer Speech & Language - Tập 72 - Trang 101317 - 2022

Tae Jin Park¹, Naoyuki Kanda², Dimitrios Dimitriadis², Kyu J. Han³, Shinji Watanabe⁴, Shrikanth Narayanan¹

¹University of Southern California, Los Angeles, USA

²Microsoft, Redmond, USA

³ASAPP, Mountain View, USA

⁴Johns Hopkins University, Baltimore, USA

Tài liệu tham khảo

Addlesee, A., Yu, Y., Eshghi, A., 2020. A comprehensive evaluation of incremental speech recognition and diarization for conversational AI. In: Proceedings of the International Conference on Computational Linguistics. pp. 3492–3503. Ajmera, J., Lathoud, G., McCowan, L., 2004. Clustering and segmenting speakers and their locations in meetings. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 605–608. Ajmera, 2004, Robust speaker change detection, IEEE Signal Process. Lett., 11, 649, 10.1109/LSP.2004.831666 Ajmera, J., Wooters, C., 2003. A robust speaker clustering algorithm. In: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding. pp. 411–416. AMI, 2009 Anguera, 2012, 356 Anguera, X., Wooters, C., Hernando, J., 2006. Purity algorithms for speaker diarization of meetings data. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. Vol. I. pp. 1025–1028. Anguera, 2007, Acoustic beamforming for speaker diarization of meetings, IEEE Trans. Audio Speech Lang. Process., 15, 2011, 10.1109/TASL.2007.902460 Araki, S., Ono, N., Kinoshita, K., Delcroix, M., 2018. Meeting recognition with asynchronous distributed microphone array using block-wise refinement of mask-based MVDR beamformer. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 5694–5698. Arora, 2020 Barker, J., Watanabe, S., Vincent, E., Trmal, J., 2018. The fifth ’CHiME’ speech separation and recognition challenge: dataset, task and baselines. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 1561–1565. Blei, 2011, Distance dependent Chinese restaurant processes, J. Mach. Learn. Res., 12 Boakye, 2008, Overlapped speech detection for improved speaker diarization in multiparty meetings, 4353 Boeddecker, C., Heitkaemper, J., Schmalenstroeer, J., Drude, L., Heymann, J., Haeb-Umbach, R., 2018. Front-end processing for the CHiME-5 dinner party scenario. In: Proceedings of CHiME 2018 Workshop on Speech Processing in Everyday Environments. pp. 35–40. Bonastre, 2000, A speaker tracking system based on speaker turn detection for NIST evaluation, 1177 Bone, 2017, Signal processing and machine learning for mental health research and clinical applications, IEEE Signal Process. Mag., 34, 189, 10.1109/MSP.2017.2718581 Bozonnet, 2010, System output combination for improved speaker diarization, 2642 Brummer, N., Burget, L., Cernocky, J., Glembek, O., Grezl, F., Karafiat, M., A. van Leeuwen, D., Matejka, P., Schwarz, P., Strasheim, A., 2007. Fusion of Heterogeneous Speaker Recognition Systems in the STBU Submission for the NIST Speaker Recognition Evaluation 2006. Vol. 15. No. 7. pp. 2072–2084. Buchner, H., Aichner, R., Kellermann, W., 2005. A Generalization of Blind Source Separation Algorithms for Convolutive Mixtures Based on Second-Order Statistics. Vol. 13. No. 1. Pp. 120–134. Bullock, 2020, Overlap-aware diarization: Resegmentation using neural end-to-end overlapped speech detection, 7114 Canseco-Rodriguez, L., Lamel, L., Gauvain, J.-L., 2004. Speaker diarization from speech transcripts. In: Proceedings of the International Conference on Spoken Language Processing. Vol. 4. pp. 3–7. Carletta, J., Ashby, S., Bourban, S., Flynn, M., Guillemot, M., Hain, T., Kadlec, J., Karaiskos, V., Kraaij, W., Kronenthal, M., Lathoud, G., Lincoln, M., Lisowska, A., McCowan, I., andD. Reidsma, W.P., Wellner, P., 2006. The AMI meeting corpus: a pre-announcement. In: Proceedings of Int. Worksh. Machine Learning for Multimodal Interaction. pp. 28–39. Carletta, 2005, The AMI meeting corpus: A pre-announcement, 28 Castaldo, F., Colibro, D., Dalmasso, E., Laface, P., Vair, C., 2008. Stream-based speaker segmentation using speaker factors and eigenvoices. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 4133–4136. Cetin, 2006, Speaker overlaps and ASR errors in meetings: Effects before, during, and after the overlap, 357 Chakravarthula, S.N., Nasir, M., Tseng, S.-Y., Li, H., Park, T.J., Baucom, B., Bryan, C.J., Narayanan, S., Georgiou, P., 2020. Automatic prediction of suicidal risk in military couples using multimodal interaction cues from couples conversations. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 6539–6543. Chang, X., Qian, Y., Yu, K., Watanabe, S., 2019. End-to-end monaural multi-speaker ASR system without pretraining. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 6256–6260. Chen, 1998, 127 Chen, 1998, Speaker, environment and channel change detection and clustering via the bayesian information criterion, 127 Chen, 2020, Continuous speech separation: Dataset and analysis, 7284 Chengalvarayan, R., 1999. Robust energy normalization using speech/nonspeech discriminator for German connected digit recognition. In: Sixth European Conference on Speech Communication and Technology. Chiu, 2017 Chung, J.S., Huh, J., Nagrani, A., Afouras, T., Zisserman, A., 2020. Spot the conversation: speaker diarisation in the wild. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 299–303. Chung, 2019 Comaniciu, 2002, Mean shift: A robust approach toward feature space analysis, IEEE Trans. Pattern Anal. Mach. Intell., 24, 603, 10.1109/34.1000236 Dehak, 2011 Delacourt, 2000, DISTBIC: A speaker-based segmentation for audio data indexing, Speech Commun., 32, 111, 10.1016/S0167-6393(00)00027-3 Delcroix, M., Watanabe, S., Ochiai, T., Kinoshita, K., Karita, S., Ogawa, A., Nakatani, T., 2019. End-to-end speakerbeam for single channel target speech recognition. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 451–455. Delcroix, 2018, Single channel target speaker extraction and recognition with speaker beam, 5554 Diez, 2019, 355 Diez, 2020, Optimizing Bayesian HMM based x-vector clustering for the second DIHARD speech diarization challenge, 6519 Diez, 2018, Speaker diarization based on Bayesian HMM with eigenvoice priors, 147 Diez, M., Burget, L., Wang, S., Rohdin, J., Cernockỳ, J., 2019. Bayesian HMM based x-vector clustering for speaker diarization. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 346–350. Diez, M., Landini, F., Burget, L., Rohdin, J., Silnova, A., Zmolíková, K., Novotnỳ, O., Veselỳ, K., Glembek, O., Plchot, O., et al., 2018b. BUT system for DIHARD speech diarization challenge 2018. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 2798–2802. Dimitriadis, 2019 Dimitriadis, D., Fousek, P., 2017. Developing on-line speaker diarization system. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 2739–2743. Dimitriadis, D., Fousek, P., 2017. Developing on-line speaker diarization system. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 2739–2743. Drude, L., Haeb-Umbach, R., 2017. Tight integration of spatial and spectral features for BSS with deep clustering embeddings. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 2650–2654. Drude, 2018, NARA-WPE: A python package for weighted prediction error dereverberation in numpy and tensorflow for online and offline processing, 1 Drugman, 2015, Voice activity detection: Merging source and filter-based information, IEEE Signal Process. Lett., 23, 252, 10.1109/LSP.2015.2495219 Du, J., Tu, Y., Sun, L., Ma, F., Wang, H., Pan, J., Liu, C., Chen, J., Lee, C., 2016. The USTC-iFlytek system for CHiME-4 challenge. In: Proceedings of CHiME-4 Workshop. pp. 36–38. Erdogan, 2015, Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks, 708 Erdogan, H., Hershey, J.R., Watanabe, S., Mandel, M.I., Le Roux, J., 2016. Improved MVDR beamforming using single-channel mask prediction networks. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 1981–1985. Finley, G.P., Edwards, E., Robinson, A., Sadoughi, N., Fone, J., Miller, M., Suendermann-Oeft, D., Brenndoerfer, M., Axtmann, N., 2018. An automated assistant for medical scribes. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 3212–3213. Fiscus, 1997, A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER), 347 Fiscus, 2007, 373 Fiscus, 2006, The rich transcription 2006 spring meeting recognition evaluation, 309 Flemotomos, N., Dimitriadis, D., 2020. A memory augmented architecture for continuous speaker identification in meetings. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 6524–6528. Flemotomos, 2020, Linguistically aided speaker diarization using speaker role information, 117 Fu, 2021 Fujita, Y., Kanda, N., Horiguchi, S., Nagamatsu, K., Watanabe, S., 2019. End-to-end neural speaker diarization with permutation-free objectives. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 4300–4304. Fujita, 2019, End-to-end neural speaker diarization with self-attention, 296 Fujita, 2020 Galliano, S., Gravier, G., Chaubard, L., 2009. The ESTER 2 evaluation campaign for the rich transcription of French radio broadcasts. In: Tenth Annual Conference of the International Speech Communication Association. Gangadharaiah, R., Narayanaswamy, B., Balakrishnan, N., 2004. A novel method for two-speaker segmentation. In: Eighth International Conference on Spoken Language Processing. Gao, 2018, Densely connected progressive learning for lstm-based speech enhancement, 5054 Garcia-Romero, D., Espy-Wilson, C.Y., 2011. Analysis of i-vector Length Normalization in Speaker Recognition Systems. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 249–252. Garofolo, 2004 Gauvain, J.-L., Lamel, L., Adda, G., 1998. Partitioning and transcription of broadcast news data. In: Proceedings of the International Conference on Spoken Language Processing. pp. 1335–1338. Gelly, 2017, Optimization of RNN-based speech activity detection, IEEE/ACM Trans. Audio Speech Lang Process., 26, 646, 10.1109/TASLP.2017.2769220 Georgiou, P.G., Black, M.P., Narayanan, S.S., 2011. Behavioral signal processing for understanding (distressed) dyadic interactions: some recent developments. In: Proceedings of the Joint ACM Workshop on Human Gesture and Behavior Understanding. pp. 7–12. Gish, H., Siu, M., Rohlicek, R., 1991. Segregation of speakers for speech recognition and speaker identification. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 873–876. Gish, 1991, Segregation of speakers for speech recognition and speaker identification, 873 Gravier, G., Adda, G., Paulson, N., Carré, M., Giraudel, A., Galibert, O., 2012. The ETAPE corpus for the evaluation of speech-based TV content processing in the French language. In: LREC-Eighth International Conference on Language Resources and Evaluation. pp. na. Gravier, 2004, The ESTER evaluation campaign for the rich transcription of french broadcast news Guo, A., Faria, A., Riedhammer, J., 2016. Remeeting – Deep insights to conversations. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 1964–1965. Guo, X., Gao, L., Liu, X., Yin, J., 2017. Improved deep embedded clustering with local structure preservation. In: Proceedings of International Joint Conference on Artificial Intelligence. pp. 1753–1759. Haeb-Umbach, 2019, Speech processing for digital home assistants: Combining signal processing with deep-learning techniques, 36, 111 Han, E., Lee, C., Stolcke, A., 2021. BW-EDA-EEND: Streaming End-to-End Neural Speaker Diarization for a Variable Number of Speakers. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 7193–7197. Han, 2020 Han, K.J., Narayanan, S.S., 2007. A robust stopping criterion for agglomerative hierarchical clustering in a speaker diarization system. In: Proceedings of the Annual Conference of the International Speech Communication Association. Haws, D., Dimitriadis, D., Saon, G., Thomas, S., Picheny, M., 2016. On the importance of event detection for ASR. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. He, K., Gkioxari, G., Dollár, P., Girshick, R., 2017. Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2961–2969. He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 770–778. Heigold, G., Moreno, I., Bengio, S., Shazeer, N., 2016. End-to-end text-dependent speaker verification. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 5115–5119. Hershey, 2016, Deep clustering: Discriminative embeddings for segmentation and separation, 31 Heymann, 2016, Neural network based spectral mask estimation for acoustic beamforming, 196 Horiguchi, S., Fujita, Y., Watanabe, S., Xue, Y., Nagamatsu, K., 2020. End-to-end speaker diarization for an unknown number of speakers with encoder-decoder based attractors. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 269–273. Horiguchi, 2020 Huang, 2007, The IBM RT07 evaluation systems for speaker diarization on lecture meetings, 497 Huang, 2020, Speaker diarization with region proposal network, 6514 Huijbregts, 2009, The majority wins: a method for combining speaker diarization systems, 924 Ito, N., Araki, S., Yoshioka, T., Nakatani, T., 2014. Relaxed disjointness based clustering for joint blind source separation and dereverberation. In: Proceedings of International Workshop on Acoustic Echo and Noise Control. pp. 268–272. Itu, 1996 Jain, U., Siegler, M.A., Doh, S.-J., Gouvea, E., Huerta, J., Moreno, P.J., Raj, B., Stern, R.M., 1996. Recognition of continuous broadcast news with multiple unknown speakers and environments. In: Proceedings of ARPA Spoken Language Technology Workshop. pp. 61–66. Janin, A., Baron, D., Edwards, J., Ellis, D., Gelbart, D., Morgan, N., Peskin, B., Pfau, T., Shriberg, E., Stolcke, A., Wooters, C., 2003. The ICSI meeting corpus. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. I–364–I–367. Jiang, 2014, PLDA In the i-supervector space for text-independent speaker verification, EURASIP J. Audio Speech Music Process., 2014, 1, 10.1186/s13636-014-0029-2 Jin, Q., Laskowski, K., Schultz, T., Waibel, A., 2004. Speaker segmentation and clustering in meetings. In: Proceedings of the International Conference on Spoken Language Processing. pp. 597–600. Kanagasundaram, 2014, i-vector Based speaker recognition using advanced channel compensation techniques, Comput. Speech Lang., 28, 121, 10.1016/j.csl.2013.04.002 Kanagasundaram, A., Dean, D., Vogt, R., McLaren, M., Sridharan, S., Mason, M., 2012. Weighted LDA techniques for i-vector based speaker verification. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 4781–4784. Kanagasundaram, 2011, i-vector Based speaker recognition on short utterances, 2341 Kanda, N., Boeddeker, C., Heitkaemper, J., Fujita, Y., Horiguchi, S., Nagamatsu, K., Haeb-Umbach, R., 2019. Guided source separation meets a strong ASR backend: hitachi/paderborn university joint investigation for dinner Party ASR. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 1248–1252. Kanda, N., Chang, X., Gaur, Y., Wang, X., Meng, Z., Chen, Z., Yoshioka, T., 2021. Investigation of end-to-end speaker-attributed ASR for continuous multi-talker recordings. In: Proceedings of IEEE Spoken Language Technology Workshop. Kanda, N., Fujita, Y., Horiguchi, S., Ikeshita, R., Nagamatsu, K., Watanabe, S., 2019. Acoustic modeling for distant multi-talker speech recognition with single-and multi-channel branches. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 6630–6634. Kanda, N., Gaur, Y., Wang, X., Meng, Z., Chen, Z., Zhou, T., Yoshioka, T., 2020. Joint speaker counting, speech recognition, and speaker identification for overlapped speech of any number of speakers. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 36–40. Kanda, N., Gaur, Y., Wang, X., Meng, Z., Yoshioka, T., 2020. Serialized output training for end-to-end overlapped speech recognition. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 2797–2801. Kanda, N., Horiguchi, S., Fujita, Y., Xue, Y., Nagamatsu, K., Watanabe, S., 2019. Simultaneous speech recognition and speaker diarization for monaural dialogue recordings with target-speaker acoustic models. In: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding. pp. 31–38. Kanda, N., Horiguchi, S., Takashima, R., Fujita, Y., Nagamatsu, K., Watanabe, S., 2019. Auxiliary interference speaker loss for target-speaker speech recognition. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 236–240. Kanda, N., Meng, Z., Lu, L., Gaur, Y., Wang, X., Chen, Z., Yoshioka, T., 2021. Minimum Bayes risk training for end-to-end speaker-attributed ASR. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 6503–6507. Kanda, 2021 Kemp, T., Schmidt, M., Westphal, M., Waibel, A., 2000. Strategies for automatic segmentation of audio data. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. Vol. 3. pp. 1423–1426. Kenny, 2008 Kenny, 2010, BayesIan speaker verification with heavy-tailed priors. Kenny, 2005, Eigenvoice modeling with sparse training data, IEEE Trans. Speech Audio Process., 13, 345, 10.1109/TSA.2004.840940 Kenny, 2007, 1448 Kenny, 2008, 980 Kenny, 2010, 1059 Kenny, 2010, Diarization of telephone conversations using factor analysis, IEEE J. Sel. Top. Sign. Proces., 4, 1059, 10.1109/JSTSP.2010.2081790 Kinoshita, 2020, Tackling real noisy reverberant meetings with all-neural source separation, counting, and diarization system, 381 Kinoshita, K., Delcroix, M., Tawara, N., 2021. Integrating end-to-end neural and clustering-based diarization: Getting the best of both worlds. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 7198–7202. Kolbæk, 2017, Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks, 25, 1901 Kounades-Bastian, 2017, An EM algorithm for joint source separation and diarisation of multichannel convolutive speech mixtures, 16 Kounades-Bastian, 2017, Exploiting the intermittency of speech for joint separation and diarization, 41 Kuhn, 1955, The hungarian method for the assignment problem, Nav. Res. Logist. Q., 2, 83, 10.1002/nav.3800020109 Kumar, 2020, Speaker diarization for naturalistic child-adult conversational interactions using contextual information., J. Acoust. Soc. Am., 147, EL196, 10.1121/10.0000736 Landini, 2020 Leeuwen, D.A.V., Konecny, M., 2007. Progress in the AMIDA speaker diarization system for meeting data. In: Proceedings of International Evaluation Workshops CLEAR 2007 and RT 2007. pp. 475–483. Li, B., Sainath, T.N., Narayanan, A., Caroselli, J., Bacchiani, M., Misra, A., Shafran, I., Sak, H., Punduk, G., Chin, K., Sim, K.C., Weiss, R.J., Wilson, K.W., Variani, E., Kim, C., Siohan, O., Weintrauba, M., McDermott, E., Rose, R., Shannon, M., 2017. Acoustic modeling for Google Home. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 399–403. Lin, Q., Hou, Y., Li, M., 2020. Self-attentive similarity measurement strategies in speaker diarization. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 284–288. Lin, Q., Yin, R., Li, M., Bredin, H., Barras, C., 2019. LSTM based similarity measurement with spectral clustering for speaker diarization. In: Proc. Interspeech 2019. pp. 366–370. Liu, D., Kubala, F., 1999. Fast speaker change detection for broadcast news transcription and indexing. In: Proceedings of the International Conference on Spoken Language Processing. pp. 1031–1034. Liu, D., Kubala, F., 2003. A cross-channel modeling approach for automatic segmentation of conversational telephone speech. In: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding. pp. 333–338. Loizou, 2013 Luo, 2019, Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation, 27, 1256 Luque, 2012, On the use of agglomerative and spectral clustering in speaker diarization of meetings, 130 Maciejewski, 2018 MacQueen, 1967, Some methods for classification and analysis of multivariate observations, 281 Maekawa, K., 2003. Corpus of spontaneous Japanese: Its design and evaluation. In: ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition. pp. 7–12. Malegaonkar, 2006, Unsupervised speaker change detection using probabilistic pattern matching, IEEE Signal Process. Lett., 13, 509, 10.1109/LSP.2006.873656 Mao, H.H., Li, S., McAuley, J., Cottrell, G., 2020. Speech recognition and multi-speaker diarization of long conversations. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 691–695. Matějka, 2011, Full-covariance UBM and heavy-tailed PLDA in i-vector speaker verification, 4828 Medennikov, I., Korenevsky, M., Prisyach, T., Khokhlov, Y., Korenevskaya, M., Sorokin, I., Timofeeva, T., Mitrofanov, A., Andrusenko, A., Podluzhny, I., Laptev, A., Romanenko, A., 2020. Target-speaker voice activity detection: a novel approach for multi-speaker diarization in a dinner party scenario. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 274–278. Medennikov, I., Korenevsky, M., Prisyach, T., Khokhlov, Y., Korenevskaya, M., Sorokin, I., Timofeeva, T., Mitrofanov, A., Andrusenko, A., Podluzhny, I., et al., 2020. The STC system for the CHiME-6 challenge. In: CHiME 2020 Workshop on Speech Processing in Everyday Environments. Meignier, 2006, Step-by-step and integrated approaches in broadcast news speaker diarization, Comput. Speech Lang., 20, 303, 10.1016/j.csl.2005.08.002 Mirheidari, 2017, Toward the automation of diagnostic conversation analysis in patients with memory complaints, J. Alzheimer’s Dis., 58, 373, 10.3233/JAD-160507 Mori, K., Nakagawa, S., 2001. Speaker change detection and speaker clustering using VQ distortion for broadcast news speech recognition. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. Vol. 1. pp. 413–416. Nagrani, 2020 Nakatani, 2010, Speech dereverberation based on variance-normalized delayed linear prediction, 18, 1717 Narayanan, 2013, Behavioral signal processing: Deriving human behavioral informatics from speech and language, Proc. IEEE, 101, 1203, 10.1109/JPROC.2012.2236291 Nemer, 2001, Robust voice activity detection using higher-order statistics in the LPC residual domain, IEEE Trans. Speech Audio Process., 9, 217, 10.1109/89.905996 Nesta, F., Svaizer, P., Omologo, M., 2011. Convolutive BSS of short mixtures by ICA recursively regularized across frequencies. Vol. 19. No. 3. pp. 624–639. Ng, 2001, On spectral clustering: Analysis and an algorithm, Adv. Neural Inf. Process. Syst., 14, 849 Ng, T., Zhang, B., Nguyen, L., Matsoukas, S., Zhou, X., Mesgarani, N., Veselỳ, K., Matějka, P., 2012. Developing a speech activity detection system for the DARPA RATS program. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 1969–1972. Ning, H., Liu, M., Tang, H., Huang, T.S., 2006. A spectral clustering approach to speaker diarization. In: Proceedings of the International Conference on Spoken Language Processing. pp. 2178–2181. NIST, 2009 Novoselov, S., Gusev, A., Ivanov, A., Pekhovsky, T., Shulipa, A., Avdeeva, A., Gorlanov, A., Kozlov, A., 2019. Speaker diarization with deep speaker embeddings for DIHARD challenge II. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 1003–1007. Otterson, 2007, Efficient use of overlap information in speaker diarization, 683 Padmanabhan, M., Bahl, L.R., Nahamoo, D., Picheny, M.A., 1996. Speaker clustering and transformation for speaker adaptation in large-vocabulary speech recognition systems. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 701–704. Panayotov, 2015, LibriSpeech: an ASR corpus based on public domain audio books, 5206 Park, T.J., Georgiou, P., 2018. Multimodal speaker segmentation and diarization using lexical and acoustic cues via sequence to sequence neural networks. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 1373–1377. Park, T.J., Han, K.J., Huang, J., He, X., Zhou, B., Georgiou, P., Narayanan, S., 2019. Speaker diarization with lexical information. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 391–395. Park, 2019, Auto-tuning spectral clustering for speaker diarization using normalized maximum eigengap, IEEE Signal Process. Lett., 27, 381, 10.1109/LSP.2019.2961071 Park, T.J., Kumar, M., Narayanan, S., 2021. Multi-scale speaker diarization with neural affinity score fusion. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 7173–7177. Pfau, 2001, Multispeaker speech activity detection for the ICSI meeting recorder, 107 Raj, D., Denisov, P., Chen, Z., Erdogan, H., Huang, Z., He, M., Watanabe, S., Du, J., Yoshioka, T., Luo, Y., Kanda, N., Li, J., Wisdom, S., R. Hershey, J., 2021. Integration of speech separation, diarization, and recognition for multi-speaker meetings: System description, comparison, and analysis. In: Proceedings of IEEE Spoken Language Technology Workshop. Recht, 2010, Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization, SIAM Rev., 52, 471, 10.1137/070697835 Ren, 2016, 1137 Reynolds, 2000, Speaker verification using adapted Gaussian mixture models, Digit. Signal Process., 10, 19, 10.1006/dspr.1999.0361 Reynolds, D.A., Torres-Carrasquillo, P., 2005. Approaches and applications of audio diarization. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 953–956. Rohlicek, J.R., Ayuso, D., Bates, M., Bobrow, R., Boulanger, A., Gish, H., Jeanrenaud, P., Meteer, M., Siu, M., 1992. Gisting conversational speech. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 113–116. Rosenberg, A.E., Gorin, A., Liu, Z., Parthasarathy, P., 2002. Unsupervised speaker segmentation of telephone conversations. In: Proceedings of the International Conference on Spoken Language Processing. pp. 565–568. Rougui, 2006, Fast incremental clustering of gaussian mixture speaker models for scaling up retrieval in on-line broadcast Ryant, N., Church, K., Cieri, C., Cristia, A., Du, J., Ganapathy, S., Liberman, M., 2018. The first DIHARD speech diarization challenge. In: Proceedings of the Annual Conference of the International Speech Communication Association. Ryant, N., Church, K., Cieri, C., Cristia, A., Du, J., Ganapathy, S., Liberman, M., 2019. The second DIHARD diarization challenge: dataset, task, and baselines. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 978–982. Ryant, 2020 Ryant, N., Liberman, M., Yuan, J., 2013. Speech activity detection on youtube using deep neural networks. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 728–731. Salmun, 2017, PLDA-based mean shift speakers’ short segments clustering, Comput. Speech Lang., 45, 411, 10.1016/j.csl.2017.04.006 Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., Lillicrap, T., 2016. Meta-learning with memory-augmented neural networks. In: Proceedings of International Conference on Machine Learning. pp. 1842—1850. Santoro, 2018, Grelational recurrent neural networks, Proc. Adv. Neural Inf. Process. Syst, 7299 Saon, 2017 Sarikaya, 1998, Robust detection of speech activity in the presence of noise, 1455 Sawada, 2007, Measuring dependence of bin-wise separated signals for permutation alignment in frequency-domain BSS, Int. Symp. Circ. Syst., 3247 Sawada, H., Araki, S., Makino, S., 2011. Underdetermined Convolutive Blind Source Separation Via Frequency Bin-Wise Clustering and Permutation Alignment. Vol. 19. No. 3. pp. 516–527. Seki, H., Hori, T., Watanabe, S., Le Roux, J., Hershey, J.R., 2018. A purely end-to-end system for multi-speaker speech recognition. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Vol. 1. pp. 2620–2630. Sell, 2014, Speaker diarization with PLDA i-vector scoring and unsupervised calibration, 413 Sell, 2015, Diarization resegmentation in the factor analysis subspace, 4794 Sell, G., Snyder, D., McCree, A., Garcia-Romero, D., Villalba, J., Maciejewski, M., Manohar, V., Dehak, N., Povey, D., Watanabe, S., Khudanpur, S., 2018. Diarization is hard: some experiences and lessons learned for the JHU team in the inaugural DIHARD challenge. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 2808–2812. Sell, G., Snyder, D., McCree, A., Garcia-Romero, D., Villalba, J., Maciejewski, M., Manohar, V., Dehak, N., Povey, D., Watanabe, S., et al., 2018. Diarization is hard: some experiences and lessons learned for the JHU team in the inaugural DIHARD challenge. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 2808–2812. Senoussaoui, 2010, An i-vector extractor suitable for speaker recognition with both microphone and telephone speech, 6 Senoussaoui, 2013, Efficient iterative mean shift based cosine dissimilarity for multi-recording speaker clustering, 7712 Senoussaoui, 2013, A study of the cosine distance-based mean shift for telephone speech diarization, IEEE/ACM Trans. Audio Speech Lang. Process., 22, 217, 10.1109/TASLP.2013.2285474 Shafey, 2019, Joint speech recognition and speaker diarization via sequence transduction, 396 Shum, S., Dehak, N., Chuangsuwanich, E., Reynolds, D., Glass, J., 2011. Exploiting intra-conversation variability for speaker diarization. In: Proceedings of the Annual Conference of the International Speech Communication Association. Shum, 2013, 2015 Shum, S., Dehak, N., Glass, J., 2012. On the use of spectral and iterative methods for speaker diarization. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 482–485. Siegler, M.A., Jain, U., Raj, B., Stern, R.M., 1997. Automatic segmentation, classification and clustering of broadcast news audio. In: Proc. DARPA Speech Recognition Workshop. Silovsky, 2012, Incorporation of the ASR output in speaker segmentation and clustering within the task of speaker diarization of broadcast streams, 118 Siu, M.-H., George, Y., Gish, H., 1992. An unsupervised, sequential learning algorithm for segmentation for speech waveforms with multiple speakers. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 189–192. Snyder, D., Garcia-Romero, D., Povey, D., Khudanpur, S., 2017. Deep neural network embeddings for text-independent speaker verification. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 999–1003. Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S., 2018. X-vectors: Robust DNN embeddings for speaker recognition. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 5329–5333. Sohn, 1999, A statistical model-based voice activity detection, IEEE Signal Process. Lett., 6, 1, 10.1109/97.736233 Stafylakis, 2010, Speaker clustering via the mean shift algorithm, Recall, 2, 7 Stolcke, A., 2011. Making the most from multiple microphones in meeting recordings. In: Proceedings of IEEE International Conference on Acoustics, Speech an Signal Processing. pp. 4992–4995. Stolcke, 2019, DOVER: A method for combining diarization outputs, 757 Sukhbaatar, 2015, End-to-end memory networks, Proc. Adv. Neural Inf. Process. Syst., 2440 Sun, Y., Wang, X., Tang, X., 2014. Deep learning face representation from predicting 10,000 classes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1891–1898. Taigman, Y., Yang, M., Ranzato, M., Wolf, L., 2014. Deepface: Closing the gap to human-level performance in face verification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1701–1708. Thomas, S., Ganapathy, S., Saon, G., Soltau, H., 2014. Analyzing convolutional neural networks for speech activity detection in mismatched acoustic conditions. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 2519–2523. Tranter, 2004, Speaker diarisation for broadcast news, 337 Tranter, S.E., Reynolds, D.A., 2006. An Overview of Automatic Speaker Diarization Systems. Vol. 14. No. 5. pp. 1557–1565. Tranter, S.E., Yu, K., Evermann, G., Woodland, P.C., 2004. Generating and evaluating for automatic speech recognition of conversational telephone speech. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 753–756. Tranter, 2003 Tritschler, A., Gopinath, R.A., 1999. Improved speaker segmentation and segments clustering using the bayesian information criterion. In: Sixth European Conference on Speech Communication and Technology. Ustinova, 2016, Learning deep embeddings with histogram loss, Proc. Adv. Neural Inf. Process. Syst., 29, 4170 Valente, 2005 Valente, F., Motlicek, P., Vijayasenan, D., 2010. Variational Bayesian speaker diarization of meeting recordings. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 4954–4957. Variani, E., Lei, X., McDermott, E., Moreno, I.L., G-Dominguez, J., 2014. Deep neural networks for small footprint text-dependent speaker verification. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 4052–4056. Variani, 2014, Deep neural networks for small footprint text-dependent speaker verification, 4052 Vijayasenan, 2009, An information theoretic approach to speaker diarization of meeting data, IEEE Trans. Audio Speech Lang. Process., 17, 1382, 10.1109/TASL.2009.2015698 Villalba, J., Chen, N., Snyder, D., Garcia-Romero, D., McCree, A., Sell, G., Borgstrom, J., Richardson, F., Shon, S., Grondin, F., et al., 2019. State-of-the-art speaker recognition for telephone and video speech: the JHU-MIT submission for NIST SRE18. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 1488–1492. Vincent, 2018 Von Luxburg, 2007, A tutorial on spectral clustering, Stat. Comput., 17, 395, 10.1007/s11222-007-9033-z von Neumann, 2019, All-neural online source separation, counting, and diarization for meeting analysis, 91 Wang, 2018, 1702 Wang, P., Chen, Z., Xiao, X., Meng, Z., Yoshioka, T., Zhou, T., Lu, L., Li, J., 2019. Speech separation using speaker inventory. In: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding. pp. 230–236. Wang, Q., Downey, C., Wan, L., Mansfield, P.A., Moreno, I.L., 2018. Speaker diarization with LSTM0. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 5239–5243. Wang, X., Kanda, N., Gaur, Y., Chen, Z., Meng, Z., Yoshioka, T., 2021. Exploring end-to-end multi-channel ASR with bias information for meeting transcription. In: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding. Wang, 2020, Speaker diarization with session-level speaker embedding refinement using graph neural networks, 7109 Watanabe, S., Mandel, M., Barker, J., Vincent, E., Arora, A., Chang, X., Khudanpur, S., Manohar, V., Povey, D., Raj, D., et al., 2020. CHiME-6 Challenge: Tackling multispeaker speech recognition fors unsegmented recordings. In: 6th International Workshop on Speech Processing in Everyday Environments (CHiME 2020). Woo, 2000, Robust voice activity detection algorithm for estimating noise spectrum, Electron. Lett., 36, 180, 10.1049/el:20000192 Xiao, 2016, A technology prototype system for rating therapist empathy from audio recordings in addiction counseling, PeerJ Comput. Sci., 2, 10.7717/peerj-cs.59 Xiao, 2020 Xie, J., Girshick, R., Farhadi, A., 2016. Unsupervised deep embedding for clustering analysis. In: Proceedings of International Conference on Machine Learning. pp. 478–487. Xiong, 2016 Xue, 2020 Yoshioka, T., Abramovski, I., Aksoylar, C., Chen, Z., David, M., Dimitriadis, D., Gong, Y., Gurvich, I., Huang, X., Huang, Y., Hurvitz, A., Jiang, L., Koubi, S., Krupka, E., Leichter, I., Liu, C., Parthasarathy, P., Vinnikov, A., Wu, L., Xiao, X., Xiong, W., Wang, H., Wang, Z., Zhang, J., Zhao, Y., Zhou, T., 2019. Advances in online audio-visual meeting transcription. In: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding. pp. 276–283. Yoshioka, T., Dimitriadis, D., Stolcke, A., Hinthorn, W., Chen, Z., Zeng, M., Xuedong, H., 2019. Meeting transcription using asynchronous distant microphones. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 2968–2972. Yoshioka, T., Erdogan, H., Chen, Z., Xiao, X., Alleva, F., 2018. Recognizing overlapped speech in meetings: a multichannel separation approach using neural networks. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 3038–3042. Yoshioka, T., Ito, N., Delcroix, M., Ogawa, A., Kinoshita, K., Fujimoto, M., Yu, C., Fabian, W.J., Espi, M., Higuchi, T., Araki, S., Nakatani, T., 2015. The NTT CHiME-3 system: advances in speech enhancement and recognition for mobile multi-microphone devices. In: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding. pp. 436–443. Yoshioka, 2012, Generalization of multi-channel linear prediction methods for blind MIMO impulse response shortening, 20, 2707 Yu, D., Chang, X., Qian, Y., 2017. Recognizing multi-talker speech with permutation invariant training. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 2456–2460. Zajíc, Z., Kunešová, M., Radová, V., 2016. Investigation of segmentation in i-vector based speaker diarization of telephone speech. In: International Conference on Speech and Computer. pp. 411–418. Zhang, A., Wang, Q., Zhu, Z., Paisley, J., Wang, C., 2019. Fully supervised speaker diarization. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 6301–6305. Zhu, X., Barras, C., Meignier, S., Gauvain, J.-L., 2005. Combining speaker identification and BIC for speaker diarization. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 2441–2444. Zhu, 2016, Online speaker diarization using adapted i-vector transforms, 5045 Zmolikova, K., Delcroix, M., Kinoshita, K., Higuchi, T., Ogawa, A., Nakatani, T., 2017. Speaker-aware neural network based beamformer for speaker extraction in speech mixtures. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 2655–2659.

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA