Whisper-based spoken term detection systems for search on speech ALBAYZIN evaluation challenge

Javier Tejedor1, Doroteo T. Toledano2
1Institute of Technology, Universidad San Pablo-CEU, CEU universities, Boadilla del Monte, Spain
2AUDIAS, Electronics and Communication Technology Department, Escuela Politécnica Superior, Universidad Autónoma de Madrid, Madrid, Spain

Tóm tắt

The vast amount of information stored in audio repositories makes necessary the development of efficient and automatic methods to search on audio content. In that direction, search on speech (SoS) has received much attention in the last decades. To motivate the development of automatic systems, ALBAYZIN evaluations include a search on speech challenge since 2012. This challenge releases several databases that cover different acoustic domains (i.e., spontaneous speech from TV shows, conference talks, parliament sessions, to name a few) aiming to build automatic systems that retrieve a set of terms from those databases. This paper presents a baseline system based on the Whisper automatic speech recognizer for the spoken term detection task in the search on speech challenge held in 2022 within the ALBAYZIN evaluations. This baseline system will be released with this publication and will be given to participants in the upcoming SoS ALBAYZIN evaluation in 2024. Additionally, several analyses based on some term properties (i.e., in-language and foreign terms, and single-word and multi-word terms) are carried out to show the Whisper capability at retrieving terms that convey specific properties. Although the results obtained for some databases are far from being perfect (e.g., for broadcast news domain), this Whisper-based approach has obtained the best results on the challenge databases so far so that it presents a strong baseline system for the upcoming challenge, encouraging participants to improve it.

Từ khóa


Tài liệu tham khảo

K. Ng, V.W. Zue, Subword-based approaches for spoken document retrieval. Speech Comm. 32(3), 157–186 (2000) B. Chen, K.-Y. Chen, P.-N. Chen, Y.-W. Chen, Spoken document retrieval with unsupervised query modeling techniques. IEEE Trans. Audio Speech Lang. Process. 20(9), 2602–2612 (2012) T.-H. Lo, Y.-W. Chen, K.-Y. Chen, H.-M. Wang, B. Chen, in Proceedings of ASRU. Neural relevance-aware query modeling for spoken document retrieval. IEEE, Okinawa (2017), pp. 466–473 W.F.L. Heeren, F.M.G. Jong, L.B. Werff, M.A.H. Huijbregts, R.J.F. Ordelman, in Proceedings of LREC. Evaluation of spoken document retrieval for historic speech collections (2008), pp. 2037–2041 Y.-C. Pan, H.-Y. Lee, L.-S. Lee, Interactive spoken document retrieval with suggested key terms ranked by a Markov decision process. IEEE Trans. Audio Speech Lang. Process. 20(2), 632–645 (2012) Y.-W. Chen, K.-Y. Chen, H.-M. Wang, B. in Proceedings of Interspeech. Chen, Exploring the use of significant words language modeling for spoken document retrieval. ISCA, Stockholm (2017), pp. 2889–2893 A. Gupta, D. Yadav, A novel approach to perform context-based automatic spoken document retrieval of political speeches based on wavelet tree indexing. Multimed. Tools Appl. 80, 22209–22229 (2021) S.-W. Fan-Jiang, T.-H. Lo, B. Chen, in Proceedings of ICASSP. Spoken document retrieval leveraging BERT-based modeling and query reformulation. IEEE, Barcelona (2020), pp. 8144–8148 H.-Y. Lin, T.-H. Lo, B. Chen, in Proceedings ASRU. Enhanced BERT-based ranking models for spoken document retrieval. IEEE, Sentosa (2019), pp. 601–606 Z.-Y. Wu, L.-P. Yen, K.-Y. Chen, in Proceedings of ICASSP. Generating pseudo-relevant representations for spoken document retrieval. ISCA, Brighton (2019), pp. 7370–7374 L.-P. Yen, Z.-Y. Wu, K.-Y. Chen, in Proceedings of ICASSP. A neural document language modeling framework for spoken document retrieval. IEEE, Barcelona (2020), pp. 8139–8143 Y. Moriya, G.J.F. Jones, in Proceedings of SLT. Improving noise robustness for spoken content retrieval using semi-supervised ASR and N-best transcripts for BERT-based ranking models. IEEE, Doha (2023), pp. 398–405 E. Villatoro-Tello, S. Madikeri, P. Motlicek, A. Ganapathiraju, A.V. Ivanov, in Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. Expanded lattice embeddings for spoken document retrieval on informal meetings. ACM, Madrid (2022), pp. 2669–2674 P. Gao, J. Liang, P. Ding, B. Xu, in Proceedings of ICASSP. A novel phone-state matrix based vocabulary-independent keyword spotting method for spontaneous speech. IEEE, Honolulu (2007), pp. 425–428 A. Mandal, J. Hout, Y.-C. Tam, V. Mitra, Y. Lei, J. Zheng, D. Vergyri, L. Ferrer, M. Graciarena, A. Kathol, H. Franco, in Proceedings of Interspeech. Strategies for high accuracy keyword detection in noisy channels. ISCA, Lyon (2013), pp. 15–19 S. Panchapagesan, M. Sun, A. Khare, S. Matsoukas, A. Mandal, B. Hoffmeister, S. Vitaladevuni, in Proceedings of Interspeech. Multi-task learning and weighted cross-entropy for DNN-based keyword spotting. ISCA, San Francisco (2016), pp. 760–764 H. Mazzawi, X. Gonzalvo, A. Kracun, P. Sridhar, N. Subrahmanya, I.L. Moreno, H.J. Park, P. Violette, in Proceedings of Interspeech. Improving keyword spotting and language identification via Neural Architecture Search at Scale. ISCA, Graz (2019), pp. 1278–1282 T. Mo, Y. Yu, M. Salameh, D. Niu, S. Jui, in Proceedings of Interspeech. Neural architecture search for keyword spotting. ISCA, Shanghai (2020), pp. 1982–1986 H.-J. Park, P. Zhu, I.L. Moreno, N. Subrahmanya, in Proceedings of Interspeech. Noisy student-teacher training for robust keyword spotting. ISCA, Brno (2021), pp. 331–335 B. Wei, M. Yang, T. Zhang, X. Tang, X. Huang, K. Kim, J. Lee, K. Cho, S.-U. Park, in Proceedings of Interspeech, End-to-end transformer-based open-vocabulary keyword spotting with location-guided local attention. ISCA, Brno (2021), pp. 361–365 R. Kirandevraj, V.K. Kurmi, V. Namboodiri, C.V. Jawahar, in Proceedings of Interspeech. Generalized keyword spotting using ASR embeddings. ISCA, Incheon (2022), pp. 126–130 Z. Yang, S. Sun, J. Li, X. Zhang, X. Wang, L. Ma, L. Xie, in Proceedings of Interspeech. CaTT-KWS: A multi-stage customized keyword spotting framework based on cascaded transducer-transformer. ISCA, Incheon (2022), pp. 1681–1685 L. Lei, G. Yuan, H. Yu, D. Kong, Y. He, Multilingual customized keyword spotting using similar-pair contrastive learning. IEEE/ACM Trans. Audio Speech Lang. Process. 31, 2437–2447 (2023) M. Dampfhoffer, T. Mesquida, E. Hardy, A. Valentian, L. Anghel, in Proceedings of ICASSP, Leveraging sparsity with spiking recurrent neural networks for energy-efficient keyword spotting. IEEE, Rhodes island (2023), pp. 1–5 E. van der Westhuizen, H. Kamper, R. Menon, J. Quinn, T. Niesler, Feature learning for efficient ASR-free keyword spotting in low-resource languages. Comp. Speech Lang. 71, 101275 (2022) K. Ding, M. Zong, J. Li, B. Li, in Proceedings of ICASSP. Letr: A lightweight and efficient transformer for keyword spotting. IEEE, Singapore (2022), pp. 7987–7991 Z. Wang, L. Wan, B. Zhang, Y. Huang, S.-W. Li, M. Sun, X. Lei, Z. Yang, in Proceedings of ICASSP, Disentangled training with adversarial examples for robust small-footprint keyword spotting. IEEE, Rhodes island (2023), pp. 1–5 A. Buzo, H. Cucu, C. Burileanu, in Proceedings of MediaEval. SpeeD@MediaEval 2014: Spoken term detection with robust multilingual phone recognition. MediaEval Multimedia, Barcelona (2014), pp. 721–722 R. Konno, K. Ouchi, M. Obara, Y. Shimizu, T. Chiba, T. Hirota, Y. Itoh, in Proceedings of NTCIR-12. An STD system using multiple STD results and multiple rescoring method for NTCIR-12 SpokenQuery &Doc task. National Institute of Informatics, Tokyo (2016), pp. 200–204 R. Jarina, M. Kuba, R. Gubka, M. Chmulik, M. Paralic, in Proceedings of MediaEval. UNIZA system for the spoken web search task at MediaEval 2013. MediaEval Multimedia, Barcelona (2013), pp. 791–792 X. Anguera, M. Ferrarons, in Proceedings of ICME. Memory efficient subsequence DTW for query-by-example spoken term detection. IEEE, San Jose (2013), pp. 1–6 C. Chan, L. Lee, in Proceedings of Interspeech. Unsupervised spoken-term detection with spoken queries using segment-based dynamic time warping. ISCA, Chiba (2010), pp. 693–696 J. Huang, W. Gharbieh, Q. Wan, H.S. Shim, H.C. Lee, in Proceedings of Interspeech. QbyE-MLPMixer: Query-by-example open-vocabulary keyword spotting using MLPMixer. ISCA, Incheon (2022), pp. 5200–5204 S.-Y. Chang, G. Prakash, Z. Wu, T. Sainath, B. Li, Q. Liang, A. Stambler, S. Upadhyay, M. Faruqui, T. Strohman, in Proceedings of Interspeech. Streaming intended query detection using E2E modeling for continued conversation. ISCA, Incheon (2022), pp. 1826–1830 D. Ram, L. Miculicich, H. Bourlard, Neural network based end-to-end query by example spoken term detection. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 1416–1427 (2020) J. Huang, W. Gharbieh, H.S. Shim, E. Kim, in Proceedings of ICASSP. Query-by-example keyword spotting system using multi-head attention and soft-triple loss. IEEE, Toronto (2021), pp. 6858–6862 D. Ram, L. Miculicich, H. Bourlard, in Proceedings of ASRU. Multilingual bottleneck features for query by example spoken term detection. IEEE, Sentosa (2019), pp. 621–628 Y. Hu, S. Settle, K. Livescu, in Proceedings of SLT. Acoustic span embeddings for multilingual query-by-example search. IEEE, Shenzhen (2021), pp. 935–942 Y. Yuan, L. Xie, C.-C. Leung, H. Chen, B. Ma, Fast query-by-example speech search using attention-based deep binary embeddings. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 1988–2000 (2020) P.M. Reuter, C. Rollwage, B.T. Meyer, in Proceedings of ICASSP. Multilingual query-by-example keyword spotting with metric learning and phoneme-to-embedding mapping. IEEE, Rhodes island (2023), pp. 1–5 R. Khwildi, A.O. Zaid, F. Dufaux, Query-by-example HDR image retrieval based on CNN. Multimed. Tools Appl. 80, 15413–15428 (2021) P. Lopez-Otero, J. Parapar, A. Barreiro, Statistical language models for query-by-example spoken document retrieval. Multimedia Tools Appl. 79, 7927–7949 (2020) J. Mamou, B. Ramabhadran, O. Siohan, in Proceedings of ACM SIGIR. Vocabulary independent spoken term detection. ACM, Amsterdam (2007), pp. 615–622 J. Mamou, B. Ramabhadran, in Proceedings of Interspeech. Phonetic query expansion for spoken document retrieval. ISCA, Brisbane (2008), pp. 2106–2109 D. Can, E. Cooper, A. Sethy, C. White, B. Ramabhadran, M. Saraclar, in Proceedings of ICASSP. Effect of pronunciations on OOV queries in spoken term detection. IEEE, Taipei (2009), pp. 3957–3960 A. Rosenberg, K. Audhkhasi, A. Sethy, B. Ramabhadran, M. Picheny, in Proceedings of ICASSP. End-to-end speech recognition and keyword search on low-resource languages. IEEE, New Orleans (2017), pp. 5280–5284 K. Audhkhasi, A. Rosenberg, A. Sethy, B. Ramabhadran, B. Kingsbury, in Proceedings of ICASSP. End-to-end ASR-free keyword search from speech. IEEE, New Orleans (2017), pp. 4840–4844 K. Audhkhasi, A. Rosenberg, A. Sethy, B. Ramabhadran, B. Kingsbury, End-to-end ASR-free keyword search from speech. IEEE J. Sel. Top. Signal Process. 11(8), 1351–1359 (2017) J.G. Fiscus, J. Ajot, J.S. Garofolo, G. Doddingtion, in Proceedings of SSCS. Results of the 2006 spoken term detection evaluation. ACM, Amsterdam (2007), pp. 45–50 W. Hartmann, L. Zhang, K. Barnes, R. Hsiao, S. Tsakalidis, R. Schwartz, in Proceedings of Interspeech. Comparison of multiple system combination techniques for keyword spotting. ISCA, San Francisco (2016), pp. 1913–1917 T. Alumae, D. Karakos, W. Hartmann, R. Hsiao, L. Zhang, L. Nguyen, S. Tsakalidis, R. Schwartz, in Proceedings of ICASSP. The 2016 BBN Georgian telephone speech keyword spotting system. IEEE, New Orleans (2017), pp. 5755–5759 D. Vergyri, A. Stolcke, R.R. Gadde, W. Wang, in Proceedings of NIST Spoken Term Detection Workshop (STD 2006). The SRI 2006 spoken term detection system. National Institute of Standards and Technology, Gaithersburg (2006), pp. 1–15 D. Vergyri, I. Shafran, A. Stolcke, R.R. Gadde, M. Akbacak, B. Roark, W. Wang, in Proceedings of Interspeech. The SRI/OGI 2006 spoken term detection system. ISCA, Antwerp (2007), pp. 2393–2396 M. Akbacak, D. Vergyri, A. Stolcke, in Proceedings of ICASSP. Open-vocabulary spoken term detection using graphone-based hybrid recognition systems. IEEE, Las Vegas (2008), pp. 5240–5243 I. Szöke, M. Faps̆o, M. Karafiát, L. F. Burget, Grézl, P. Schwarz, O. Glembek, P. Matĕjka, J. Kopecký, J. C̆ernocký, in Machine Learning for Multimodal Interaction. Spoken term detection system based on combination of LVCSR and phonetic search, vol 4892/2008. Springer, Brno (2008), pp. 237–247 I. Szöke, L. Burget, J. C̆ernocký, M. Faps̆o, in Proceedings of SLT. Sub-word modeling of out of vocabulary words in spoken term detection. IEEE, Goa, India (2008), pp. 273–276 I. Szöke, M. Faps̆o, L. Burget, J. C̆ernocký, in Proceedings of Speech Search Workshop at SIGIR. Hybrid word-subword decoding for spoken term detection. ACM, Singapore (2008), pp. 42–48 S. Meng, P. Yu, J. Liu, F. Seide, in Proceedings of ICASSP. Fusing multiple systems into a compact lattice index for Chinese spoken term detection. IEEE, Las Vegas (2008), pp. 4345–4348 S. Shah, S. Sitaram, in Proceedings of International Conference on Data Mining. Using monolingual speech recognition for spoken term detection in code-switched Hindi-English speech. IEEE, Beijing (2019), pp. 1–5 K. Thambiratmann, S. Sridharan, Rapid yet accurate speech indexing using dynamic match lattice spotting. IEEE Trans. Audio Speech Lang. Process. 15(1), 346–357 (2007) R. Wallace, R. Vogt, B. Baker, S. Sridharan, in Proceedings of ICASSP. Optimising figure of merit for phonetic spoken term detection. IEEE, Dallas (2010), pp. 5298–5301 C. Parada, A. Sethy, M. Dredze, F. Jelinek, in Proceedings of Interspeech. A spoken term detection framework for recovering out-of-vocabulary words using the web. ISCA, Chiba (2010), pp. 1269–1272 A. Jansen, K. Church, H. Hermansky, in Proceedings of Interspeech. Towards spoken term discovery at scale with zero resources. ISCA, Chiba (2010), pp. 1676–1679 C. Parada, A. Sethy, B. Ramabhadran, in Proceedings of ICASSP. Balancing false alarms and hits in spoken term detection. IEEE, Dallas (2010), pp. 5286–5289 J. Trmal, M. Wiesner, V. Peddinti, X. Zhang, P. Ghahremani, Y. Wang, V. Manohar, H. Xu, D. Povey, S. Khudanpur, in Proceedings of Interspeech. The Kaldi OpenKWS system: Improving low resource keyword search. ISCA, Stockholm (2017), pp. 3597–3601 D. Schneider, T. Mertens, M. Larson, J. Kohler, in Proceedings of Interspeech. Contextual verification for open vocabulary spoken term detection. ISCA, Chiba (2010), pp. 697–700 C.-A. Chan, L.-S. Lee, in Proceedings of Interspeech. Unsupervised spoken-term detection with spoken queries using segment-based dynamic time warping. ISCA, Chiba (2010), pp. 693–696 C.-P. Chen, H.-Y. Lee, C.-F. Yeh, L.-S. Lee, in Proceedings of Interspeech. Improved spoken term detection by feature space pseudo-relevance feedback. ISCA, Chiba (2010), pp. 1672–1675 P. Motlicek, F. Valente, P. Garner, in Proceedings of Interspeech. English spoken term detection in multilingual recordings. ISCA, Chiba (2010), pp. 206–209 J. Wintrode, J. Wilkes, in Proceedings of ICASSP. Fast lattice-free keyword filtering for accelerated spoken term detection. IEEE, Barcelona (2020), pp. 7469–7473 T.S. Fuchs, Y. Segal, J. Keshet, in Proceedings of ICASSP. CNN-based spoken term detection and localization without dynamic programming. IEEE, Toronto (2021), pp. 6853–6857 B. Yusuf, M. Saraclar, in Proceedings of Interspeech. An empirical evaluation of DTW subsampling methods for keyword search (2019), pp. 2673–2677 V.L.V. Nadimpalli, S. Kesiraju, R. Banka, R. Kethireddy, S.V. Gangashetty, Resources and benchmarks for keyword search in spoken audio from low-resource indian languages. IEEE Access 10, 34789–34799 (2022) NIST, The Spoken Term Detection (STD) 2006 Evaluation Plan (2006). https://catalog.ldc.upenn.edu/docs/LDC2011S02/std06-evalplan-v10.pdf. Accessed 26 Feb 2024 NIST, OpenKWS13 Keyword Search Evaluation Plan (National Institute of Standards and Technology (NIST), Gaithersburg, 2013). https://www.nist.gov/system/files/documents/itl/iad/mig/OpenKWS13-EvalPlan.pdf. Accessed 26 Feb 2024 NIST, Draft KWS14 Keyword Search Evaluation Plan (National Institute of Standards and Technology (NIST), Gaithersburg, 2013). https://www.nist.gov/system/files/documents/itl/iad/mig/KWS14-evalplan-v11.pdf. Accessed 26 Feb 2024 NIST, KWS15 Keyword Search Evaluation Plan (National Institute of Standards and Technology (NIST), Gaithersburg, 2015). https://www.nist.gov/system/files/documents/itl/iad/mig/KWS15-evalplan-v05.pdf. Accessed 26 Feb 2024 NIST, Draft KWS16 Keyword Search Evaluation Plan (National Institute of Standards and Technology (NIST), Gaithersburg, 2016). https://www.nist.gov/system/files/documents/itl/iad/mig/KWS16-evalplan-v04.pdf. Accessed 26 Feb 2024 Z. Lv, M. Cai, W.-Q. Zhang, J. Liu, in Proceedings of Interspeech. A novel discriminative score calibration method for keyword search. ISCA, San Francisco (2016), pp. 745–749 N.F. Chen, V.T. Pharri, H. Xu, X. Xiao, V.H. Do, C. Ni, I.-F. Chen, S. Sivadas, C.-H. Lee, E.S. Chng, B. Ma, H. Li, in Proceedings of ICASSP. Exemplar-inspired strategies for low-resource spoken keyword search in Swahili. IEEE, Shanghai (2016), pp. 6040–6044 C. Ni, C.-C. Leung, L. Wang, H. Liu, F. Rao, L. Lu, N.F. Chen, B. Ma, H. Li, in Proceedings of ICASSP. Cross-lingual deep neural network based submodular unbiased data selection for low-resource keyword search. IEEE, Shanghai (2016), pp. 6015–6019 M. Cai, Z. Lv, C. Lu, J. Kang, L. Hui, Z. Zhang, J. Liu, in Proceedings of ASRU. High-performance swahili keyword search with very limited language pack: The THUEE system for the OpenKWS15 evaluation. IEEE, Scottsdale (2015), pp. 215–222 N.F. Chen, C. Ni, I.-F. Chen, S. Sivadas, V.T. Pham, H. Xu, X. Xiao, T.S. Lau, S.J. Leow, B.P. Lim, C.-C. Leung, L. Wang, C.-H. Lee, A. Goh, E.S. Chng, B. Ma, H. Li, in Proceedings of ICASSP. Low-resource keyword search strategies for Tamil. IEEE, South Brisbane (2015), pp. 5366–5370 L. Mangu, G. Saon, M. Picheny, B. Kingsbury, in Proceedings of ICASSP, Order-free spoken term detection. IEEE, South Brisbane (2015), pp. 5331–5335 C. Heerden, D. Karakos, K. Narasimhan, M. Davel, R. Schwartz, in Proceedings of ICASSP. Constructing sub-word units for spoken term detection. IEEE, South Brisbane (2017), pp. 5780–5784 W. Hartmann, D. Karakos, R. Hsiao, L. Zhang, T. Alumae, S. Tsakalidis, R. Schwartz, in Proceedings of ICASSP. Analysis of keyword spotting performance across IARPA babel languages. ISCA, New Orleans (2017), pp. 5765–5769 C. Ni, C.-C. Leung, L. Wang, N.F. Chen, B. Ma, in Proceedings of ICASSP. Efficient methods to train multilingual bottleneck feature extractors for low resource keyword search. ISCA, New Orleans (2017), pp. 5650–5654 A. Ragni, D. Saunders, P. Zahemszky, J. Vasilakes, M.J.F. Gales, K.M. Knill, inProceedings of ICASSP. Morph-to-word transduction for accurate and efficient automatic speech recognition and keyword search. ISCA, New Orleans (2017), pp. 5770–5774 X. Chen, A. Ragnil, J. Vasilakes, X. Liu, K. Knilll, M.J..F. Gales, in Proceedings of ICASSP. Recurrent neural network language models for keyword search. ISCA, New Orleans (2017), pp. 5775–5779 V.T. Pham, H. Xu, X. Xiao, N.F. Chen, E.S. Chng, in Proceedings of International Symposium on Information and Communication Technology. Pruning strategies for partial search in spoken term detection. ACM, Nha Trang (2017), pp. 114–119 V.T. Pham, H. Xu, X. Xiao, N.F. Chen, E.S. Chng, Re-ranking spoken term detection with acoustic exemplars of keywords. Speech Comm. 104, 12–23 (2018) R. Lileikyte, T. Fraga-Silva, L. Lamel, J.-L. Gauvain, A. Laurent, G. Huang, in Proceedings of ICASSP. Effective keyword search for low-resourced conversational speech. ISCA, New Orleans (2017), pp. 5785–5789 Y. Khokhlov, I. Medennikov, A. Romanenko, V. Mendelev, M. Korenevsky, A. Prudnikov, N. Tomashenko, A. Zatvornitsky, in Proceedings of Interspeech. The STC keyword search system for OpenKWS 2016 evaluation. ISCA, Stockholm (2017), pp. 3602–3606 T. Sakai, H. Joho, in Proceedings of NTCIR-9. Overview of NTCIR-9. National Institute of Informatics, Tokyo (2011), pp. 1–7 T. Akiba, H. Nishizaki, K. Aikawa, X. Hu, Y. Itoh, T. Kawahara, S. Nakagawa, H. Nanjo, Y. Yamashita, in Proceedings of NTCIR-10. Overview of the NTCIR-10 SpokenQueryDoc-2 task. National Institute of Informatics, Tokyo (2013), pp. 1–15 T. Akiba, H. Nishizaki, H. Nanjo, G.J.F. Jones, in Proceedings of NTCIR-11. Overview of the NTCIR-11 SpokenQuery &Doc task. National Institute of Informatics, Tokyo (2014), pp. 1–15 T. Akiba, H. Nishizaki, H. Nanjo, G.J.F. Jones, in Proceedings of NTCIR-12. Overview of the NTCIR-12 SpokenQuery &Doc-2 task. National Institute of Informatics, Tokyo (2016), pp. 1–13 J. Wang, Y. He, C. Zhao, Q. Shao, W.-W. Tu, T. Ko, H.-y. Lee, L. Xie, in Proceedings of Interspeech. Auto-KWS 2021 challenge: Task, datasets, and baselines. ISCA, Brno (2021), pp. 4244–4248 J. Tejedor, D.T. Toledano, P. Lopez-Otero, L. Docio-Fernandez, C. Garcia-Mateo, A. Cardenal, J.D. Echeverry-Correa, A. Coucheiro-Limeres, J. Olcoz, A. Miguel, Spoken term detection ALBAYZIN 2014 evaluation: Overview, systems, results, and discussion. EURASIP J. Audio Speech Music Process. 2015(21), 1–27 (2015) J. Tejedor, D.T. Toledano, P. Lopez-Otero, L. Docio-Fernandez, L. Serrano, I. Hernaez, A. Coucheiro-Limeres, J. Ferreiros, J. Olcoz, J. Llombart, ALBAYZIN 2016 spoken term detection evaluation: An international open competitive evaluation in Spanish. EURASIP J. Audio Speech Music Process. 2017(22), 1–23 (2017) J. Tejedor, D.T. Toledano, P. Lopez-Otero, L. Docio-Fernandez, A.R. Montalvo, J.M. Ramirez, M. Peñagarikano, L.-J. Rodriguez-Fuentes, ALBAYZIN 2018 spoken term detection evaluation: A multi-domain international evaluation in Spanish. EURASIP J. Audio Speech Music Process. 2019(16), 1–37 (2019) J. Tejedor, D.T. Toledano, J.M. Ramirez, A.R. Montalvo, J.I. Alvarez-Trejos, The multi-domain international search on speech 2020 ALBAYZIN evaluation: Overview, systems, results, discussion and post-evaluation analyses. Appl. Sci. 11(18), 8519 (2021) A. Radford, J.W. Kim, T. Xu, G. Brockman, C. McLeavey, I. Sutskever, Robust speech recognition via large-scale weak supervision (2022). arXiv preprint arXiv:2212.04356 A.M. Sandoval, L.C. Llanos, in Proceedings of Iberspeech. MAVIR: A corpus of spontaneous formal speech in Spanish and English. RTTH, Madrid (2012) E. Lleida, A. Ortega, A. Miguel, V. Bazán-Gil, C. Perez, A. Prada, RTVE 2018, 2020 and 2022 Database Description (Vivolab, Aragon Institute for Engineering Resarch (I3A), University of Zaragoza, Spain, 2022). https://catedrartve.unizar.es/reto2022/RTVE2022DB.pdf. Accessed 26 Feb 2024 A. Martin, G. Doddington, T. Kamm, M. Ordowski, M. Przybocki, in Proceedings of Eurospeech. The DET curve in assessment of detection task performance. ISCA, Rhodes (1997), pp. 1895–1898 NIST, Evaluation Toolkit (STDEval) Software (National Institute of Standards and Technology (NIST), Gaithersburg, 1996). https://www.nist.gov/itl/iad/mig/tools. Accessed 26 Feb 2024 A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 1–11 (2017) P. Gage, A new algorithm for data compression. C Users J. 12(2), 23–38 (1994) R. Sennrich, B. Haddow, A. Birch, Neural machine translation of rare words with subword units (2015). arXiv preprint arXiv:1508.07909 J. Louradour, Whisper-timestamped (GitHub, 2023) T. Giorgino, Computing and visualizing dynamic time warping alignments in r: The dtw package. J. Stat. Softw. 31(7), 1–24 (2009) J.G. Fiscus, J. Ajot, J.S. Garofolo, G. Doddington, in Proceedings of SIGIR Workshop Searching Spontaneous Conversational Speech. Results of the 2006 spoken term detection evaluation. ACM, Amsterdam (2007), pp. 45–50