Systematic Literature Review of Speaker Diarization Techniques: Toward Bridging Gaps in Low-resourced Languages Using Machine Learning

Authors

  • Mohd Zulhafiz Rahim Faculty of Computer Science and Information Technology, Universiti Malaysia Sarawak
  • Sarah Samson Juan Faculty of Computer Science and Information Technology, Universiti Malaysia Sarawak, 94300 Kota Samarahan Sarawak, Malaysia http://orcid.org/0000-0002-9590-1666
  • Syahrul Nizam Junaini Faculty of Computer Science and Information Technology, Universiti Malaysia Sarawak https://orcid.org/0000-0001-7193-8862

Keywords:

Deep neural network, Low-resourced, Machine learning, Speaker diarization, x-vectors

Abstract

Speaker diarization, the process of segmenting audio into speaker-specific regions, plays a critical role in various speech technologies by determining "who spoke when" in a conversation. This technique is particularly valuable for enhancing automatic speech recognition (ASR) and conversational artificial intelligent systems. However, its application to low-resourced languages remains underexplored, limiting not only the performance of speaker diarization among low-resourced languages, but also stagnating the advancements of ASR to low-resourced languages. This is due to the fact that speaker diarization enables speaker adaptation in ASR, crucial for maximizing the performance of ASR itself. This lack of digital resources of speaker diarization  to low-resourced languages, as well as the scarcity of its implementation presents a gap between low-resourced languages and popular languages in terms of the advancements of speech technologies involving the particular languages.  This paper focuses on Sarawak Malay, a low-resourced language, and presents conversational data collected through a crowd-sourced approach, which needs speaker turns and transcripts. These missing annotations create challenges for building accurate acoustic models. To address this, we conducted a systematic review of recent speaker diarization research and related machine learning techniques. Using the PRISMA methodology, we reviewed 42 articles published between 2018 and 2023. Our findings identify key machine learning models, such as i-vectors and x-vectors, and open-source tools like Pyannote, which offer promising advancements in diarization performance. Besides that, these tools have shown potential to be implemented in developing speaker diarization models for low-resourced language. By highlighting the gaps in current research for low-resourced languages, we provide a pathway for improving speaker diarization models in these underrepresented languages through machine learning techniques.

References

[1] G. Saon, H. Soltau, D. Nahamoo and M. Picheny, Speaker adaptation of neural network acoustic models using i-vectors, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, 2013, 55-59.

[2] Y. Miao, H. Zhang and F. Metze, Speaker adaptive training of deep neural network acoustic models using i-vectors, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23, 2015, 1938-1949.

[3] L. Sarı, N. Moritz, T. Hori and J. Le Roux, Unsupervised speaker adaptation using attention-based speaker memory for end-to-end ASR, 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, 7384-7388.

[4] W. H. Kang and N. S. Kim, Unsupervised learning of total variability embedding for speaker with random digit strings, Applied Sciences, 9, 2019, 1-16.

[5] A. Nagrani, J. S. Chung, W. Xie and A. Zisserman, Voxceleb: Large-scale speaker verification in the wild, Computer Speech and Language, 60, 2020, 101027.

[6] D. V. Thanh, T. P. Viet and T. N. T. Thu, Deep speaker verification model for low-resource languages and Vietnamese dataset, Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation, 2021, 442-451.

[7] D. Snyder, D. Garcia-Romero, D. Povey and S. Khudanpur, Deep neural network embeddings for text-independent speaker verification, International Speech Communication Association (INTERSPEECH), 2017, 999-1003.

[8] S. Baghel, S. Ramoji, S. Sidharth, H. Ranjana, P. Singh, S. Jain, P. R. Chowdhuri, K. Kulkarni, S. Padhi, D. Vijayasenan, and S. Ganapathy, DISPLACE Challenge: Diarization of Speaker and Language in Conversational Environments, ArXiv.Org, 2023.

[9] Z. Wang and J. H. L. Hansen, Multi-source domain adaptation for text-independent forensic speaker recognition, IEEE/ACM Transactions on Audio Speech and Language Processing, 30, 2022, 60-75.

[10] H. Gish, M. H. Siu and R. Rohlicek, Segregation of speakers for speech recognition and speaker identification, Proceedings - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, 2, 1991, 873-876.

[11] S. S. Chen and P. S. Gobalakrishnan, Speaker, environment and channel change detection and clustering via the bayesian information criterion, DARPA Broadcast News Transcription and Understanding Workshop, 1998, 127-132.

[12] U. Jain, M. A. Siegler, S. -J. Doh, E. Gouvea, J. Huerta, P. J. Moreno, B. Raj and R. M. Stern, Recognition of continuous broadcast news with multiple unknown speakers and environments, DARPA Speech Recognition Workshop, 1996, 61.

[13] W. W. Lin, M. W. Mak and J. T. Chien, Multisource i-vectors domain adaptation using maximum mean discrepancy based autoencoders, IEEE/ACM Transactions on Audio Speech and Language Processing, 26, 2018, 2412-2422.

[14] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey and S. Khudanpur, X-Vectors: robust DNN embeddings for speaker recognition, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, 5329-5333.

[15] P. Cabañas-Molero, M. Lucena, J. M. Fuertes, P. Vera-Candeas and N. Ruiz-Reyes, Multimodal speaker diarization for meetings using volume-evaluated srp-phat and video analysis, Multimedia Tools and Applications, 77, 2018, 27685-27707.

[16] K. Akesbi and S. Gandhi, Diarizers: a repository for fine-tuning speaker diarization models, https://github.com/huggingface/diarizers, 2024 (accessed 23.10.2024).

[17] T. J. Park, N. Kanda, D. Dimitriadis, K. J. Han, S. Watanabe and S. Narayanan, A review of speaker diarization: recent advances with deep learning, Computer Speech and Language, 72, 2022, 101317.

[18] Z. Jin, Y. Yang, M. Shi, W. Kang, X. Yang, Z. Yao, F. Kuang, L. Guo, L. Meng, L. Lin, Y. Xu, S.-X. Zhang and D. Povey, LibriheavyMix: a 20,000-hour dataset for single-channel reverberant multi-talker speech separation, ASR and speaker diarization, International Speech Communication Association (INTERSPEECH), 2024, 702-706.

[19] V. Khoma, Y. Khoma, V. Brydinskyi and A. Konovalov, Development of supervised speaker diarization system based on the pyannote audio processing library, Sensors, 23, 2023, 2082.

[20] L. Besacier, E. Barnard, A. Karpov, and T. Schultz, Automatic speech recognition for under-resourced languages: a survey, Speech Communication, 56, 2014, 85-100.

[21] S. S. Juan, Exploiting Resources from Closely-Related Languages for Automatic Speech Recognition in Low-Resource Languages from Malaysia, Ph.D. dissertation, Grenoble-Alpes University, France, 2015.

[22] I. Aman and R. Mustaffa, Social variation of malay language in Kuching, Sarawak, Malaysia: a study on accent, identity and integration, GEMA Online Journal of Language Studies, 9, 2009, 63-76.

[23] J. T. Collins, The study of sarawak malay in context, In Between Worlds: Linguistic Papers in Memory of David John Prentice, Australia: Pacific Linguistics, 2002, 65-75.

[24] D. M. Eberhard, G. F. Simons and C. D. Fennig, Ethnologue: Languages of the World., Twenty-seventh Online version, Dallas, Texas: SIL International, 2024.

[25] L. Shamseer, D. Moher, M. Clarke, et al., Preferred reporting items for systematic review and meta-analysis protocols (PRISMA-P) 2015: elaboration and explanation, BMJ (Clinical Research Ed.), 350, 2015, 1-25.

[26] S. Alharbi, M. Alrazgan, A. Alrashed, T. Alnomasi, R. Almojel, R. Alharbi, S. Alharbi, S. Alturki, F. Alshehri and M. Almojil, Automatic speech recognition: systematic literature review, IEEE Access, 9, 2021, 131858-131876.

[27] C. Deka, A. Shrivastava, A. K. Abraham, S. Nautiyal and P. Chauhan, AI-based automated speech therapy tools for persons with speech sound disorder: a systematic literature review, Speech, Language and Hearing, 2024, 1-22.

[28] M. S. Jahan and M. Oussalah, A systematic review of hate speech automatic detection using natural language processing, Neurocomputing, 546, 2023, 126232.

[29] M. M. Kabir, M. F. Mridha, J. Shin, I. Jahan and A. Q. Ohi, A survey of speaker recognition: fundamental theories, recognition methods and opportunities, IEEE Access, 9, 2021, 79236-79263.

[30] X. A. Miro, S. Bozonnet, N. Evans, C. Fredouille, G. Friedland and O. Vinyals, Speaker diarization: a review of recent research, IEEE Transactions on Audio, Speech and Language Processing, 20, 2012, 356-370.

[31] V. Sethuram, A. Prasad and R. R. Rao, Optimal trained artificial neural network for telugu speaker diarization, Evolutionary Intelligence, 13, 2020, 631-648.

[32] A. Q. Ohi, M. F. Mridha, M. A. Hamid and M. M. Monowar, Deep speaker recognition: process, progress, and challenges, IEEE Access, 9, 2021, 89619-89643.

[33] S. Meignier and T. Merlin, LIUM SPKDIARIZATION: An open source toolkit for diarization, CMU SPUD Workshop, 2020, 1-6.

[34] S. Kanwal, K. Malik, K. Shahzad and Z. A. F. and Nawaz, Urdu named entity recognition: corpus generation and deep learning, ACM Transactions on Asian and Low-Resource Language Information, 19, 2020, 1-13.

[35] A. Jati and P. Georgiou, Neural predictive coding using convolutional neural networks toward unsupervised learning of speaker characteristics, IEEE/ACM Transactions on Audio Speech and Language Processing, 27, 2019, 1577-1589.

[36] J. Karadayi, C. Scaff, J. Stieglitz and A. Cristia, Diarization in maximally ecological recordings: data from tsimane children, 6th Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU 2018), 2018, 30-35.

[37] H. Taherian, Z. -Q. Wang and W. DeLiang, Deep learning based multi-channel speaker recognition in noisy and reverberant environments, Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), 2019, 4070-4074.

[38] X. A. Miro, S. Bozonnet, N. Evans, C. Fredouille, G. Friedland and O. Vinyals, Speaker diarization: a review of recent research, IEEE Transactions on Audio, Speech and Language Processing, 20, 2012, 356-370.

[39] E. Alsharhan and A. Ramsay, Investigating the effects of gender, dialect, and training size on the performance of arabic speech recognition, Language Resources And Evaluation, 54, 2020, 975-998.

[40] Q. Li, F. L. Kreyssig, C. Zhang and P. C. Woodland, Discriminative neural clustering for speaker diarisation, IEEE Spoken Language Technology Workshop (SLT 2021), 2021, 574-581.

[41] M. K. Nammous, K. Saeed and P. Kobojek, Using a small amount of text-independent speech data for a bilstm-scale speaker identification approach, Journal of King Saud University - Computer and Information Sciences, 34, 2022, 764-770.

[42] A. Mane, J. Bhopale, R. Motghare and P. Chimurkar, An overview of speaker recognition and implementation of speaker diarization with transcription, International Journal of Computer Applications, 175, 2020, 1-6.

[43] N. Kanda, S. Horiguchi, Y. Fujita, Y. Xue, K. Nagamatsu and S. Watanabe, Simultaneous speech recognition and speaker diarization for monaural dialogue recordings with target-speaker acoustic models, IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2019), 2019, 31-38.

[44] Z. Bai and X. -L. Zhang, Speaker recognition based on deep learning: an overview, Neural Networks, 140, 2021, 65-99.

[45] G. A. Levow, Investigating speaker diarization of endangered language data, COMPUTEL 2023 - 6th Workshop on the Use of Computational Methods in the Study of Endangered Languages, 2023, 38-44.

[46] H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov, M. Lavechin, Di. Fustes, H. Titeux, W. Bouaziz and M.-P. Gill, Pyannote.Audio: neural building blocks for speaker diarization, IEEE International Conference on Acoustics, Speech and Signal Processing, 2020, 7124-7128.

[47] J. Karadayi, C. Scaff, J. Stieglitz and A. Cristia, Diarization in maximally ecological recordings: data from tsimane children, 6th Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU 2018), 2018, 30-35.

[48] R. Jahangir, Y. W. Teh, N. A. Memon, G. Mujtaba, M. Zareei, U. Ishtiaq, M. Z. Akhtar and I. Ali, Text-independent speaker identification through feature fusion and deep neural network, IEEE Access, 8, 2020, 32187-32202.

[49] A. B. Nassif, I. Shahin, I. Attili, M. Azzeh and K. Shaalan, Speech recognition using deep neural networks: a systematic review, IEEE Access, 7, 2019, 19143-19165.

[50] T. J. Park, K. J. Han, J. Huang, X. He, B. Zhou, P. Georgiou and S. Narayanan, Speaker diarization with lexical information, Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), 2019, 391-395.

[51] M. Diez, L. Burget, S. Wang, J. Rohdin and H. Černocký, Bayesian HMM based x-vector clustering for speaker diarization, Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), 2019, 346-350.

[52] F. Landini, S. Wang, M. Diez, L. Burget, P. Matejka, K. Zmolikova, L. Mosner, A. Silnova, O. Plchot, O. Novotny, H. Zeinali and J. Rohdin, But system for the second dihard speech diarization challenge, IEEE International Conference on Acoustics, Speech and Signal Processing, 2020, 6529-6533.

[53] A. A. Joshy and R. Rajan, Automated dysarthria severity classification: a study on acoustic and deep learning techniques, IEEE Transactions on Neural Systems And Rehabilitation Engineering, 30, 2022, 1147-1157.

[54] N. Ryant, K. Church, C. Cieri, A. Cristia, J. Du, S. Ganapathy and M. Liberman, The second dihard diarization challenge: dataset, task, and baselines, Interspeech, 2019, 978-982.

[55] M. Diez, L. Burget and P. Matějka, Speaker diarization based on bayesian hmm with eigenvoice priors, Speaker and Language Recognition Workshop (ODYSSEY 2018), 2018, 147-154.

[56] E. Prud’hommeaux, R. Jimerson, R. Hatcher and K. Michelson, Automatic speech recognition for supporting endangered language, Language Documentation & Conservation, 15, 2021, 491-513.

Downloads

Published

02-01-2025

How to Cite

Rahim, M. Z., Juan, S. S., & Junaini, S. N. (2025). Systematic Literature Review of Speaker Diarization Techniques: Toward Bridging Gaps in Low-resourced Languages Using Machine Learning. Applications of Modelling and Simulation, 9, 22–36. Retrieved from https://www.ojs.arqiipubl.com/index.php/AMS_Journal/article/view/801

Issue

Section

Articles

Similar Articles

<< < 1 2 3 4 > >> 

You may also start an advanced similarity search for this article.