A review of speaker diarization: Recent advances with deep learning
Tài liệu tham khảo
Addlesee, A., Yu, Y., Eshghi, A., 2020. A comprehensive evaluation of incremental speech recognition and diarization for conversational AI. In: Proceedings of the International Conference on Computational Linguistics. pp. 3492–3503.
Ajmera, J., Lathoud, G., McCowan, L., 2004. Clustering and segmenting speakers and their locations in meetings. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 605–608.
Ajmera, 2004, Robust speaker change detection, IEEE Signal Process. Lett., 11, 649, 10.1109/LSP.2004.831666
Ajmera, J., Wooters, C., 2003. A robust speaker clustering algorithm. In: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding. pp. 411–416.
AMI, 2009
Anguera, 2012, 356
Anguera, X., Wooters, C., Hernando, J., 2006. Purity algorithms for speaker diarization of meetings data. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. Vol. I. pp. 1025–1028.
Anguera, 2007, Acoustic beamforming for speaker diarization of meetings, IEEE Trans. Audio Speech Lang. Process., 15, 2011, 10.1109/TASL.2007.902460
Araki, S., Ono, N., Kinoshita, K., Delcroix, M., 2018. Meeting recognition with asynchronous distributed microphone array using block-wise refinement of mask-based MVDR beamformer. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 5694–5698.
Arora, 2020
Barker, J., Watanabe, S., Vincent, E., Trmal, J., 2018. The fifth ’CHiME’ speech separation and recognition challenge: dataset, task and baselines. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 1561–1565.
Blei, 2011, Distance dependent Chinese restaurant processes, J. Mach. Learn. Res., 12
Boakye, 2008, Overlapped speech detection for improved speaker diarization in multiparty meetings, 4353
Boeddecker, C., Heitkaemper, J., Schmalenstroeer, J., Drude, L., Heymann, J., Haeb-Umbach, R., 2018. Front-end processing for the CHiME-5 dinner party scenario. In: Proceedings of CHiME 2018 Workshop on Speech Processing in Everyday Environments. pp. 35–40.
Bonastre, 2000, A speaker tracking system based on speaker turn detection for NIST evaluation, 1177
Bone, 2017, Signal processing and machine learning for mental health research and clinical applications, IEEE Signal Process. Mag., 34, 189, 10.1109/MSP.2017.2718581
Bozonnet, 2010, System output combination for improved speaker diarization, 2642
Brummer, N., Burget, L., Cernocky, J., Glembek, O., Grezl, F., Karafiat, M., A. van Leeuwen, D., Matejka, P., Schwarz, P., Strasheim, A., 2007. Fusion of Heterogeneous Speaker Recognition Systems in the STBU Submission for the NIST Speaker Recognition Evaluation 2006. Vol. 15. No. 7. pp. 2072–2084.
Buchner, H., Aichner, R., Kellermann, W., 2005. A Generalization of Blind Source Separation Algorithms for Convolutive Mixtures Based on Second-Order Statistics. Vol. 13. No. 1. Pp. 120–134.
Bullock, 2020, Overlap-aware diarization: Resegmentation using neural end-to-end overlapped speech detection, 7114
Canseco-Rodriguez, L., Lamel, L., Gauvain, J.-L., 2004. Speaker diarization from speech transcripts. In: Proceedings of the International Conference on Spoken Language Processing. Vol. 4. pp. 3–7.
Carletta, J., Ashby, S., Bourban, S., Flynn, M., Guillemot, M., Hain, T., Kadlec, J., Karaiskos, V., Kraaij, W., Kronenthal, M., Lathoud, G., Lincoln, M., Lisowska, A., McCowan, I., andD. Reidsma, W.P., Wellner, P., 2006. The AMI meeting corpus: a pre-announcement. In: Proceedings of Int. Worksh. Machine Learning for Multimodal Interaction. pp. 28–39.
Carletta, 2005, The AMI meeting corpus: A pre-announcement, 28
Castaldo, F., Colibro, D., Dalmasso, E., Laface, P., Vair, C., 2008. Stream-based speaker segmentation using speaker factors and eigenvoices. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 4133–4136.
Cetin, 2006, Speaker overlaps and ASR errors in meetings: Effects before, during, and after the overlap, 357
Chakravarthula, S.N., Nasir, M., Tseng, S.-Y., Li, H., Park, T.J., Baucom, B., Bryan, C.J., Narayanan, S., Georgiou, P., 2020. Automatic prediction of suicidal risk in military couples using multimodal interaction cues from couples conversations. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 6539–6543.
Chang, X., Qian, Y., Yu, K., Watanabe, S., 2019. End-to-end monaural multi-speaker ASR system without pretraining. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 6256–6260.
Chen, 1998, 127
Chen, 1998, Speaker, environment and channel change detection and clustering via the bayesian information criterion, 127
Chen, 2020, Continuous speech separation: Dataset and analysis, 7284
Chengalvarayan, R., 1999. Robust energy normalization using speech/nonspeech discriminator for German connected digit recognition. In: Sixth European Conference on Speech Communication and Technology.
Chiu, 2017
Chung, J.S., Huh, J., Nagrani, A., Afouras, T., Zisserman, A., 2020. Spot the conversation: speaker diarisation in the wild. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 299–303.
Chung, 2019
Comaniciu, 2002, Mean shift: A robust approach toward feature space analysis, IEEE Trans. Pattern Anal. Mach. Intell., 24, 603, 10.1109/34.1000236
Dehak, 2011
Delacourt, 2000, DISTBIC: A speaker-based segmentation for audio data indexing, Speech Commun., 32, 111, 10.1016/S0167-6393(00)00027-3
Delcroix, M., Watanabe, S., Ochiai, T., Kinoshita, K., Karita, S., Ogawa, A., Nakatani, T., 2019. End-to-end speakerbeam for single channel target speech recognition. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 451–455.
Delcroix, 2018, Single channel target speaker extraction and recognition with speaker beam, 5554
Diez, 2019, 355
Diez, 2020, Optimizing Bayesian HMM based x-vector clustering for the second DIHARD speech diarization challenge, 6519
Diez, 2018, Speaker diarization based on Bayesian HMM with eigenvoice priors, 147
Diez, M., Burget, L., Wang, S., Rohdin, J., Cernockỳ, J., 2019. Bayesian HMM based x-vector clustering for speaker diarization. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 346–350.
Diez, M., Landini, F., Burget, L., Rohdin, J., Silnova, A., Zmolíková, K., Novotnỳ, O., Veselỳ, K., Glembek, O., Plchot, O., et al., 2018b. BUT system for DIHARD speech diarization challenge 2018. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 2798–2802.
Dimitriadis, 2019
Dimitriadis, D., Fousek, P., 2017. Developing on-line speaker diarization system. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 2739–2743.
Dimitriadis, D., Fousek, P., 2017. Developing on-line speaker diarization system. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 2739–2743.
Drude, L., Haeb-Umbach, R., 2017. Tight integration of spatial and spectral features for BSS with deep clustering embeddings. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 2650–2654.
Drude, 2018, NARA-WPE: A python package for weighted prediction error dereverberation in numpy and tensorflow for online and offline processing, 1
Drugman, 2015, Voice activity detection: Merging source and filter-based information, IEEE Signal Process. Lett., 23, 252, 10.1109/LSP.2015.2495219
Du, J., Tu, Y., Sun, L., Ma, F., Wang, H., Pan, J., Liu, C., Chen, J., Lee, C., 2016. The USTC-iFlytek system for CHiME-4 challenge. In: Proceedings of CHiME-4 Workshop. pp. 36–38.
Erdogan, 2015, Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks, 708
Erdogan, H., Hershey, J.R., Watanabe, S., Mandel, M.I., Le Roux, J., 2016. Improved MVDR beamforming using single-channel mask prediction networks. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 1981–1985.
Finley, G.P., Edwards, E., Robinson, A., Sadoughi, N., Fone, J., Miller, M., Suendermann-Oeft, D., Brenndoerfer, M., Axtmann, N., 2018. An automated assistant for medical scribes. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 3212–3213.
Fiscus, 1997, A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER), 347
Fiscus, 2007, 373
Fiscus, 2006, The rich transcription 2006 spring meeting recognition evaluation, 309
Flemotomos, N., Dimitriadis, D., 2020. A memory augmented architecture for continuous speaker identification in meetings. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 6524–6528.
Flemotomos, 2020, Linguistically aided speaker diarization using speaker role information, 117
Fu, 2021
Fujita, Y., Kanda, N., Horiguchi, S., Nagamatsu, K., Watanabe, S., 2019. End-to-end neural speaker diarization with permutation-free objectives. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 4300–4304.
Fujita, 2019, End-to-end neural speaker diarization with self-attention, 296
Fujita, 2020
Galliano, S., Gravier, G., Chaubard, L., 2009. The ESTER 2 evaluation campaign for the rich transcription of French radio broadcasts. In: Tenth Annual Conference of the International Speech Communication Association.
Gangadharaiah, R., Narayanaswamy, B., Balakrishnan, N., 2004. A novel method for two-speaker segmentation. In: Eighth International Conference on Spoken Language Processing.
Gao, 2018, Densely connected progressive learning for lstm-based speech enhancement, 5054
Garcia-Romero, D., Espy-Wilson, C.Y., 2011. Analysis of i-vector Length Normalization in Speaker Recognition Systems. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 249–252.
Garofolo, 2004
Gauvain, J.-L., Lamel, L., Adda, G., 1998. Partitioning and transcription of broadcast news data. In: Proceedings of the International Conference on Spoken Language Processing. pp. 1335–1338.
Gelly, 2017, Optimization of RNN-based speech activity detection, IEEE/ACM Trans. Audio Speech Lang Process., 26, 646, 10.1109/TASLP.2017.2769220
Georgiou, P.G., Black, M.P., Narayanan, S.S., 2011. Behavioral signal processing for understanding (distressed) dyadic interactions: some recent developments. In: Proceedings of the Joint ACM Workshop on Human Gesture and Behavior Understanding. pp. 7–12.
Gish, H., Siu, M., Rohlicek, R., 1991. Segregation of speakers for speech recognition and speaker identification. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 873–876.
Gish, 1991, Segregation of speakers for speech recognition and speaker identification, 873
Gravier, G., Adda, G., Paulson, N., Carré, M., Giraudel, A., Galibert, O., 2012. The ETAPE corpus for the evaluation of speech-based TV content processing in the French language. In: LREC-Eighth International Conference on Language Resources and Evaluation. pp. na.
Gravier, 2004, The ESTER evaluation campaign for the rich transcription of french broadcast news
Guo, A., Faria, A., Riedhammer, J., 2016. Remeeting – Deep insights to conversations. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 1964–1965.
Guo, X., Gao, L., Liu, X., Yin, J., 2017. Improved deep embedded clustering with local structure preservation. In: Proceedings of International Joint Conference on Artificial Intelligence. pp. 1753–1759.
Haeb-Umbach, 2019, Speech processing for digital home assistants: Combining signal processing with deep-learning techniques, 36, 111
Han, E., Lee, C., Stolcke, A., 2021. BW-EDA-EEND: Streaming End-to-End Neural Speaker Diarization for a Variable Number of Speakers. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 7193–7197.
Han, 2020
Han, K.J., Narayanan, S.S., 2007. A robust stopping criterion for agglomerative hierarchical clustering in a speaker diarization system. In: Proceedings of the Annual Conference of the International Speech Communication Association.
Haws, D., Dimitriadis, D., Saon, G., Thomas, S., Picheny, M., 2016. On the importance of event detection for ASR. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing.
He, K., Gkioxari, G., Dollár, P., Girshick, R., 2017. Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2961–2969.
He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 770–778.
Heigold, G., Moreno, I., Bengio, S., Shazeer, N., 2016. End-to-end text-dependent speaker verification. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 5115–5119.
Hershey, 2016, Deep clustering: Discriminative embeddings for segmentation and separation, 31
Heymann, 2016, Neural network based spectral mask estimation for acoustic beamforming, 196
Horiguchi, S., Fujita, Y., Watanabe, S., Xue, Y., Nagamatsu, K., 2020. End-to-end speaker diarization for an unknown number of speakers with encoder-decoder based attractors. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 269–273.
Horiguchi, 2020
Huang, 2007, The IBM RT07 evaluation systems for speaker diarization on lecture meetings, 497
Huang, 2020, Speaker diarization with region proposal network, 6514
Huijbregts, 2009, The majority wins: a method for combining speaker diarization systems, 924
Ito, N., Araki, S., Yoshioka, T., Nakatani, T., 2014. Relaxed disjointness based clustering for joint blind source separation and dereverberation. In: Proceedings of International Workshop on Acoustic Echo and Noise Control. pp. 268–272.
Itu, 1996
Jain, U., Siegler, M.A., Doh, S.-J., Gouvea, E., Huerta, J., Moreno, P.J., Raj, B., Stern, R.M., 1996. Recognition of continuous broadcast news with multiple unknown speakers and environments. In: Proceedings of ARPA Spoken Language Technology Workshop. pp. 61–66.
Janin, A., Baron, D., Edwards, J., Ellis, D., Gelbart, D., Morgan, N., Peskin, B., Pfau, T., Shriberg, E., Stolcke, A., Wooters, C., 2003. The ICSI meeting corpus. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. I–364–I–367.
Jiang, 2014, PLDA In the i-supervector space for text-independent speaker verification, EURASIP J. Audio Speech Music Process., 2014, 1, 10.1186/s13636-014-0029-2
Jin, Q., Laskowski, K., Schultz, T., Waibel, A., 2004. Speaker segmentation and clustering in meetings. In: Proceedings of the International Conference on Spoken Language Processing. pp. 597–600.
Kanagasundaram, 2014, i-vector Based speaker recognition using advanced channel compensation techniques, Comput. Speech Lang., 28, 121, 10.1016/j.csl.2013.04.002
Kanagasundaram, A., Dean, D., Vogt, R., McLaren, M., Sridharan, S., Mason, M., 2012. Weighted LDA techniques for i-vector based speaker verification. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 4781–4784.
Kanagasundaram, 2011, i-vector Based speaker recognition on short utterances, 2341
Kanda, N., Boeddeker, C., Heitkaemper, J., Fujita, Y., Horiguchi, S., Nagamatsu, K., Haeb-Umbach, R., 2019. Guided source separation meets a strong ASR backend: hitachi/paderborn university joint investigation for dinner Party ASR. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 1248–1252.
Kanda, N., Chang, X., Gaur, Y., Wang, X., Meng, Z., Chen, Z., Yoshioka, T., 2021. Investigation of end-to-end speaker-attributed ASR for continuous multi-talker recordings. In: Proceedings of IEEE Spoken Language Technology Workshop.
Kanda, N., Fujita, Y., Horiguchi, S., Ikeshita, R., Nagamatsu, K., Watanabe, S., 2019. Acoustic modeling for distant multi-talker speech recognition with single-and multi-channel branches. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 6630–6634.
Kanda, N., Gaur, Y., Wang, X., Meng, Z., Chen, Z., Zhou, T., Yoshioka, T., 2020. Joint speaker counting, speech recognition, and speaker identification for overlapped speech of any number of speakers. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 36–40.
Kanda, N., Gaur, Y., Wang, X., Meng, Z., Yoshioka, T., 2020. Serialized output training for end-to-end overlapped speech recognition. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 2797–2801.
Kanda, N., Horiguchi, S., Fujita, Y., Xue, Y., Nagamatsu, K., Watanabe, S., 2019. Simultaneous speech recognition and speaker diarization for monaural dialogue recordings with target-speaker acoustic models. In: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding. pp. 31–38.
Kanda, N., Horiguchi, S., Takashima, R., Fujita, Y., Nagamatsu, K., Watanabe, S., 2019. Auxiliary interference speaker loss for target-speaker speech recognition. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 236–240.
Kanda, N., Meng, Z., Lu, L., Gaur, Y., Wang, X., Chen, Z., Yoshioka, T., 2021. Minimum Bayes risk training for end-to-end speaker-attributed ASR. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 6503–6507.
Kanda, 2021
Kemp, T., Schmidt, M., Westphal, M., Waibel, A., 2000. Strategies for automatic segmentation of audio data. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. Vol. 3. pp. 1423–1426.
Kenny, 2008
Kenny, 2010, BayesIan speaker verification with heavy-tailed priors.
Kenny, 2005, Eigenvoice modeling with sparse training data, IEEE Trans. Speech Audio Process., 13, 345, 10.1109/TSA.2004.840940
Kenny, 2007, 1448
Kenny, 2008, 980
Kenny, 2010, 1059
Kenny, 2010, Diarization of telephone conversations using factor analysis, IEEE J. Sel. Top. Sign. Proces., 4, 1059, 10.1109/JSTSP.2010.2081790
Kinoshita, 2020, Tackling real noisy reverberant meetings with all-neural source separation, counting, and diarization system, 381
Kinoshita, K., Delcroix, M., Tawara, N., 2021. Integrating end-to-end neural and clustering-based diarization: Getting the best of both worlds. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 7198–7202.
Kolbæk, 2017, Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks, 25, 1901
Kounades-Bastian, 2017, An EM algorithm for joint source separation and diarisation of multichannel convolutive speech mixtures, 16
Kounades-Bastian, 2017, Exploiting the intermittency of speech for joint separation and diarization, 41
Kuhn, 1955, The hungarian method for the assignment problem, Nav. Res. Logist. Q., 2, 83, 10.1002/nav.3800020109
Kumar, 2020, Speaker diarization for naturalistic child-adult conversational interactions using contextual information., J. Acoust. Soc. Am., 147, EL196, 10.1121/10.0000736
Landini, 2020
Leeuwen, D.A.V., Konecny, M., 2007. Progress in the AMIDA speaker diarization system for meeting data. In: Proceedings of International Evaluation Workshops CLEAR 2007 and RT 2007. pp. 475–483.
Li, B., Sainath, T.N., Narayanan, A., Caroselli, J., Bacchiani, M., Misra, A., Shafran, I., Sak, H., Punduk, G., Chin, K., Sim, K.C., Weiss, R.J., Wilson, K.W., Variani, E., Kim, C., Siohan, O., Weintrauba, M., McDermott, E., Rose, R., Shannon, M., 2017. Acoustic modeling for Google Home. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 399–403.
Lin, Q., Hou, Y., Li, M., 2020. Self-attentive similarity measurement strategies in speaker diarization. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 284–288.
Lin, Q., Yin, R., Li, M., Bredin, H., Barras, C., 2019. LSTM based similarity measurement with spectral clustering for speaker diarization. In: Proc. Interspeech 2019. pp. 366–370.
Liu, D., Kubala, F., 1999. Fast speaker change detection for broadcast news transcription and indexing. In: Proceedings of the International Conference on Spoken Language Processing. pp. 1031–1034.
Liu, D., Kubala, F., 2003. A cross-channel modeling approach for automatic segmentation of conversational telephone speech. In: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding. pp. 333–338.
Loizou, 2013
Luo, 2019, Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation, 27, 1256
Luque, 2012, On the use of agglomerative and spectral clustering in speaker diarization of meetings, 130
Maciejewski, 2018
MacQueen, 1967, Some methods for classification and analysis of multivariate observations, 281
Maekawa, K., 2003. Corpus of spontaneous Japanese: Its design and evaluation. In: ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition. pp. 7–12.
Malegaonkar, 2006, Unsupervised speaker change detection using probabilistic pattern matching, IEEE Signal Process. Lett., 13, 509, 10.1109/LSP.2006.873656
Mao, H.H., Li, S., McAuley, J., Cottrell, G., 2020. Speech recognition and multi-speaker diarization of long conversations. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 691–695.
Matějka, 2011, Full-covariance UBM and heavy-tailed PLDA in i-vector speaker verification, 4828
Medennikov, I., Korenevsky, M., Prisyach, T., Khokhlov, Y., Korenevskaya, M., Sorokin, I., Timofeeva, T., Mitrofanov, A., Andrusenko, A., Podluzhny, I., Laptev, A., Romanenko, A., 2020. Target-speaker voice activity detection: a novel approach for multi-speaker diarization in a dinner party scenario. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 274–278.
Medennikov, I., Korenevsky, M., Prisyach, T., Khokhlov, Y., Korenevskaya, M., Sorokin, I., Timofeeva, T., Mitrofanov, A., Andrusenko, A., Podluzhny, I., et al., 2020. The STC system for the CHiME-6 challenge. In: CHiME 2020 Workshop on Speech Processing in Everyday Environments.
Meignier, 2006, Step-by-step and integrated approaches in broadcast news speaker diarization, Comput. Speech Lang., 20, 303, 10.1016/j.csl.2005.08.002
Mirheidari, 2017, Toward the automation of diagnostic conversation analysis in patients with memory complaints, J. Alzheimer’s Dis., 58, 373, 10.3233/JAD-160507
Mori, K., Nakagawa, S., 2001. Speaker change detection and speaker clustering using VQ distortion for broadcast news speech recognition. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. Vol. 1. pp. 413–416.
Nagrani, 2020
Nakatani, 2010, Speech dereverberation based on variance-normalized delayed linear prediction, 18, 1717
Narayanan, 2013, Behavioral signal processing: Deriving human behavioral informatics from speech and language, Proc. IEEE, 101, 1203, 10.1109/JPROC.2012.2236291
Nemer, 2001, Robust voice activity detection using higher-order statistics in the LPC residual domain, IEEE Trans. Speech Audio Process., 9, 217, 10.1109/89.905996
Nesta, F., Svaizer, P., Omologo, M., 2011. Convolutive BSS of short mixtures by ICA recursively regularized across frequencies. Vol. 19. No. 3. pp. 624–639.
Ng, 2001, On spectral clustering: Analysis and an algorithm, Adv. Neural Inf. Process. Syst., 14, 849
Ng, T., Zhang, B., Nguyen, L., Matsoukas, S., Zhou, X., Mesgarani, N., Veselỳ, K., Matějka, P., 2012. Developing a speech activity detection system for the DARPA RATS program. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 1969–1972.
Ning, H., Liu, M., Tang, H., Huang, T.S., 2006. A spectral clustering approach to speaker diarization. In: Proceedings of the International Conference on Spoken Language Processing. pp. 2178–2181.
NIST, 2009
Novoselov, S., Gusev, A., Ivanov, A., Pekhovsky, T., Shulipa, A., Avdeeva, A., Gorlanov, A., Kozlov, A., 2019. Speaker diarization with deep speaker embeddings for DIHARD challenge II. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 1003–1007.
Otterson, 2007, Efficient use of overlap information in speaker diarization, 683
Padmanabhan, M., Bahl, L.R., Nahamoo, D., Picheny, M.A., 1996. Speaker clustering and transformation for speaker adaptation in large-vocabulary speech recognition systems. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 701–704.
Panayotov, 2015, LibriSpeech: an ASR corpus based on public domain audio books, 5206
Park, T.J., Georgiou, P., 2018. Multimodal speaker segmentation and diarization using lexical and acoustic cues via sequence to sequence neural networks. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 1373–1377.
Park, T.J., Han, K.J., Huang, J., He, X., Zhou, B., Georgiou, P., Narayanan, S., 2019. Speaker diarization with lexical information. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 391–395.
Park, 2019, Auto-tuning spectral clustering for speaker diarization using normalized maximum eigengap, IEEE Signal Process. Lett., 27, 381, 10.1109/LSP.2019.2961071
Park, T.J., Kumar, M., Narayanan, S., 2021. Multi-scale speaker diarization with neural affinity score fusion. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 7173–7177.
Pfau, 2001, Multispeaker speech activity detection for the ICSI meeting recorder, 107
Raj, D., Denisov, P., Chen, Z., Erdogan, H., Huang, Z., He, M., Watanabe, S., Du, J., Yoshioka, T., Luo, Y., Kanda, N., Li, J., Wisdom, S., R. Hershey, J., 2021. Integration of speech separation, diarization, and recognition for multi-speaker meetings: System description, comparison, and analysis. In: Proceedings of IEEE Spoken Language Technology Workshop.
Recht, 2010, Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization, SIAM Rev., 52, 471, 10.1137/070697835
Ren, 2016, 1137
Reynolds, 2000, Speaker verification using adapted Gaussian mixture models, Digit. Signal Process., 10, 19, 10.1006/dspr.1999.0361
Reynolds, D.A., Torres-Carrasquillo, P., 2005. Approaches and applications of audio diarization. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 953–956.
Rohlicek, J.R., Ayuso, D., Bates, M., Bobrow, R., Boulanger, A., Gish, H., Jeanrenaud, P., Meteer, M., Siu, M., 1992. Gisting conversational speech. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 113–116.
Rosenberg, A.E., Gorin, A., Liu, Z., Parthasarathy, P., 2002. Unsupervised speaker segmentation of telephone conversations. In: Proceedings of the International Conference on Spoken Language Processing. pp. 565–568.
Rougui, 2006, Fast incremental clustering of gaussian mixture speaker models for scaling up retrieval in on-line broadcast
Ryant, N., Church, K., Cieri, C., Cristia, A., Du, J., Ganapathy, S., Liberman, M., 2018. The first DIHARD speech diarization challenge. In: Proceedings of the Annual Conference of the International Speech Communication Association.
Ryant, N., Church, K., Cieri, C., Cristia, A., Du, J., Ganapathy, S., Liberman, M., 2019. The second DIHARD diarization challenge: dataset, task, and baselines. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 978–982.
Ryant, 2020
Ryant, N., Liberman, M., Yuan, J., 2013. Speech activity detection on youtube using deep neural networks. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 728–731.
Salmun, 2017, PLDA-based mean shift speakers’ short segments clustering, Comput. Speech Lang., 45, 411, 10.1016/j.csl.2017.04.006
Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., Lillicrap, T., 2016. Meta-learning with memory-augmented neural networks. In: Proceedings of International Conference on Machine Learning. pp. 1842—1850.
Santoro, 2018, Grelational recurrent neural networks, Proc. Adv. Neural Inf. Process. Syst, 7299
Saon, 2017
Sarikaya, 1998, Robust detection of speech activity in the presence of noise, 1455
Sawada, 2007, Measuring dependence of bin-wise separated signals for permutation alignment in frequency-domain BSS, Int. Symp. Circ. Syst., 3247
Sawada, H., Araki, S., Makino, S., 2011. Underdetermined Convolutive Blind Source Separation Via Frequency Bin-Wise Clustering and Permutation Alignment. Vol. 19. No. 3. pp. 516–527.
Seki, H., Hori, T., Watanabe, S., Le Roux, J., Hershey, J.R., 2018. A purely end-to-end system for multi-speaker speech recognition. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Vol. 1. pp. 2620–2630.
Sell, 2014, Speaker diarization with PLDA i-vector scoring and unsupervised calibration, 413
Sell, 2015, Diarization resegmentation in the factor analysis subspace, 4794
Sell, G., Snyder, D., McCree, A., Garcia-Romero, D., Villalba, J., Maciejewski, M., Manohar, V., Dehak, N., Povey, D., Watanabe, S., Khudanpur, S., 2018. Diarization is hard: some experiences and lessons learned for the JHU team in the inaugural DIHARD challenge. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 2808–2812.
Sell, G., Snyder, D., McCree, A., Garcia-Romero, D., Villalba, J., Maciejewski, M., Manohar, V., Dehak, N., Povey, D., Watanabe, S., et al., 2018. Diarization is hard: some experiences and lessons learned for the JHU team in the inaugural DIHARD challenge. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 2808–2812.
Senoussaoui, 2010, An i-vector extractor suitable for speaker recognition with both microphone and telephone speech, 6
Senoussaoui, 2013, Efficient iterative mean shift based cosine dissimilarity for multi-recording speaker clustering, 7712
Senoussaoui, 2013, A study of the cosine distance-based mean shift for telephone speech diarization, IEEE/ACM Trans. Audio Speech Lang. Process., 22, 217, 10.1109/TASLP.2013.2285474
Shafey, 2019, Joint speech recognition and speaker diarization via sequence transduction, 396
Shum, S., Dehak, N., Chuangsuwanich, E., Reynolds, D., Glass, J., 2011. Exploiting intra-conversation variability for speaker diarization. In: Proceedings of the Annual Conference of the International Speech Communication Association.
Shum, 2013, 2015
Shum, S., Dehak, N., Glass, J., 2012. On the use of spectral and iterative methods for speaker diarization. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 482–485.
Siegler, M.A., Jain, U., Raj, B., Stern, R.M., 1997. Automatic segmentation, classification and clustering of broadcast news audio. In: Proc. DARPA Speech Recognition Workshop.
Silovsky, 2012, Incorporation of the ASR output in speaker segmentation and clustering within the task of speaker diarization of broadcast streams, 118
Siu, M.-H., George, Y., Gish, H., 1992. An unsupervised, sequential learning algorithm for segmentation for speech waveforms with multiple speakers. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 189–192.
Snyder, D., Garcia-Romero, D., Povey, D., Khudanpur, S., 2017. Deep neural network embeddings for text-independent speaker verification. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 999–1003.
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S., 2018. X-vectors: Robust DNN embeddings for speaker recognition. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 5329–5333.
Sohn, 1999, A statistical model-based voice activity detection, IEEE Signal Process. Lett., 6, 1, 10.1109/97.736233
Stafylakis, 2010, Speaker clustering via the mean shift algorithm, Recall, 2, 7
Stolcke, A., 2011. Making the most from multiple microphones in meeting recordings. In: Proceedings of IEEE International Conference on Acoustics, Speech an Signal Processing. pp. 4992–4995.
Stolcke, 2019, DOVER: A method for combining diarization outputs, 757
Sukhbaatar, 2015, End-to-end memory networks, Proc. Adv. Neural Inf. Process. Syst., 2440
Sun, Y., Wang, X., Tang, X., 2014. Deep learning face representation from predicting 10,000 classes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1891–1898.
Taigman, Y., Yang, M., Ranzato, M., Wolf, L., 2014. Deepface: Closing the gap to human-level performance in face verification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1701–1708.
Thomas, S., Ganapathy, S., Saon, G., Soltau, H., 2014. Analyzing convolutional neural networks for speech activity detection in mismatched acoustic conditions. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 2519–2523.
Tranter, 2004, Speaker diarisation for broadcast news, 337
Tranter, S.E., Reynolds, D.A., 2006. An Overview of Automatic Speaker Diarization Systems. Vol. 14. No. 5. pp. 1557–1565.
Tranter, S.E., Yu, K., Evermann, G., Woodland, P.C., 2004. Generating and evaluating for automatic speech recognition of conversational telephone speech. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 753–756.
Tranter, 2003
Tritschler, A., Gopinath, R.A., 1999. Improved speaker segmentation and segments clustering using the bayesian information criterion. In: Sixth European Conference on Speech Communication and Technology.
Ustinova, 2016, Learning deep embeddings with histogram loss, Proc. Adv. Neural Inf. Process. Syst., 29, 4170
Valente, 2005
Valente, F., Motlicek, P., Vijayasenan, D., 2010. Variational Bayesian speaker diarization of meeting recordings. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 4954–4957.
Variani, E., Lei, X., McDermott, E., Moreno, I.L., G-Dominguez, J., 2014. Deep neural networks for small footprint text-dependent speaker verification. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 4052–4056.
Variani, 2014, Deep neural networks for small footprint text-dependent speaker verification, 4052
Vijayasenan, 2009, An information theoretic approach to speaker diarization of meeting data, IEEE Trans. Audio Speech Lang. Process., 17, 1382, 10.1109/TASL.2009.2015698
Villalba, J., Chen, N., Snyder, D., Garcia-Romero, D., McCree, A., Sell, G., Borgstrom, J., Richardson, F., Shon, S., Grondin, F., et al., 2019. State-of-the-art speaker recognition for telephone and video speech: the JHU-MIT submission for NIST SRE18. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 1488–1492.
Vincent, 2018
Von Luxburg, 2007, A tutorial on spectral clustering, Stat. Comput., 17, 395, 10.1007/s11222-007-9033-z
von Neumann, 2019, All-neural online source separation, counting, and diarization for meeting analysis, 91
Wang, 2018, 1702
Wang, P., Chen, Z., Xiao, X., Meng, Z., Yoshioka, T., Zhou, T., Lu, L., Li, J., 2019. Speech separation using speaker inventory. In: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding. pp. 230–236.
Wang, Q., Downey, C., Wan, L., Mansfield, P.A., Moreno, I.L., 2018. Speaker diarization with LSTM0. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 5239–5243.
Wang, X., Kanda, N., Gaur, Y., Chen, Z., Meng, Z., Yoshioka, T., 2021. Exploring end-to-end multi-channel ASR with bias information for meeting transcription. In: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding.
Wang, 2020, Speaker diarization with session-level speaker embedding refinement using graph neural networks, 7109
Watanabe, S., Mandel, M., Barker, J., Vincent, E., Arora, A., Chang, X., Khudanpur, S., Manohar, V., Povey, D., Raj, D., et al., 2020. CHiME-6 Challenge: Tackling multispeaker speech recognition fors unsegmented recordings. In: 6th International Workshop on Speech Processing in Everyday Environments (CHiME 2020).
Woo, 2000, Robust voice activity detection algorithm for estimating noise spectrum, Electron. Lett., 36, 180, 10.1049/el:20000192
Xiao, 2016, A technology prototype system for rating therapist empathy from audio recordings in addiction counseling, PeerJ Comput. Sci., 2, 10.7717/peerj-cs.59
Xiao, 2020
Xie, J., Girshick, R., Farhadi, A., 2016. Unsupervised deep embedding for clustering analysis. In: Proceedings of International Conference on Machine Learning. pp. 478–487.
Xiong, 2016
Xue, 2020
Yoshioka, T., Abramovski, I., Aksoylar, C., Chen, Z., David, M., Dimitriadis, D., Gong, Y., Gurvich, I., Huang, X., Huang, Y., Hurvitz, A., Jiang, L., Koubi, S., Krupka, E., Leichter, I., Liu, C., Parthasarathy, P., Vinnikov, A., Wu, L., Xiao, X., Xiong, W., Wang, H., Wang, Z., Zhang, J., Zhao, Y., Zhou, T., 2019. Advances in online audio-visual meeting transcription. In: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding. pp. 276–283.
Yoshioka, T., Dimitriadis, D., Stolcke, A., Hinthorn, W., Chen, Z., Zeng, M., Xuedong, H., 2019. Meeting transcription using asynchronous distant microphones. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 2968–2972.
Yoshioka, T., Erdogan, H., Chen, Z., Xiao, X., Alleva, F., 2018. Recognizing overlapped speech in meetings: a multichannel separation approach using neural networks. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 3038–3042.
Yoshioka, T., Ito, N., Delcroix, M., Ogawa, A., Kinoshita, K., Fujimoto, M., Yu, C., Fabian, W.J., Espi, M., Higuchi, T., Araki, S., Nakatani, T., 2015. The NTT CHiME-3 system: advances in speech enhancement and recognition for mobile multi-microphone devices. In: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding. pp. 436–443.
Yoshioka, 2012, Generalization of multi-channel linear prediction methods for blind MIMO impulse response shortening, 20, 2707
Yu, D., Chang, X., Qian, Y., 2017. Recognizing multi-talker speech with permutation invariant training. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 2456–2460.
Zajíc, Z., Kunešová, M., Radová, V., 2016. Investigation of segmentation in i-vector based speaker diarization of telephone speech. In: International Conference on Speech and Computer. pp. 411–418.
Zhang, A., Wang, Q., Zhu, Z., Paisley, J., Wang, C., 2019. Fully supervised speaker diarization. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 6301–6305.
Zhu, X., Barras, C., Meignier, S., Gauvain, J.-L., 2005. Combining speaker identification and BIC for speaker diarization. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 2441–2444.
Zhu, 2016, Online speaker diarization using adapted i-vector transforms, 5045
Zmolikova, K., Delcroix, M., Kinoshita, K., Higuchi, T., Ogawa, A., Nakatani, T., 2017. Speaker-aware neural network based beamformer for speaker extraction in speech mixtures. In: Proceedings of the Annual Conference of the International Speech Communication Association. pp. 2655–2659.