E. Adegbegha1, A. Rivera1, F. Iacobelli2, R. Balyan1
Automatic Speech Recognition (ASR) has proven to be an effective tool for processing monolingual speech. However, it still faces significant challenges in accurately handling code-switching (CS) in speech. CS is a linguistic phenomenon that involves conversing in multiple languages or dialects. During a conversation, code-switching may happen between sentences, sentence fragments, or words. ASR’s inability to accurately handle CS creates a learning barrier for learners likely to code-switch and use voice-enabled technologies while learning. The inability of ASR to understand CS reinforces language barriers, particularly for learners from diverse linguistic backgrounds. It also limits the feedback they receive from systems like Intelligent Tutoring Systems (ITSs). ITSs simulate the role of human tutors in a specific topic or subject, employing constructivist methodologies in education, monitoring user advancement, posing complex inquiries, prompting thorough explanations, and delivering feedback. ASR plays a crucial role in ITSs. However, most ASR systems are trained on monolingual data, making them unable to cater to learners from diverse backgrounds. With our research, we aim to develop a culturally aware tutoring system specially designed to educate low-literacy Hispanic Breast Cancer Survivors on breast cancer survivorship skills via voice interactions. Our goal is to use ITS as a scalable alternative for instruction for underserved populations to bridge inequity gaps. Our target population is bilingual speakers of English and Spanish. With our research goal in mind, we aim to use multilingual ASR that can accurately handle CS to enhance the learning experience of our target population, that is likely to code-switch. Our previous study found that OpenAI’s ASR Whisper models performed well on monolingual audio, including English and Spanish. This study focuses on collecting code-switching data and evaluating Whisper’s performance on the code-switched data. Using Whisper, we conduct experiments on CS data, consisting of casual and interview-like conversations. Based on the experiments conducted using the available automatic evaluation metrics for ASR, Whisper was not found to be a very reliable transcription tool for CS. However, upon human evaluation, we observe that the available ASR evaluation metrics and those used for the experiments in this study do not always accurately represent the performance of multilingual ASR in transcribing code-switched data. As a result, we propose to develop a new automatic metric tailored to evaluating the accuracy of multilingual ASR in code-switching that utilizes features ignored by the existing evaluation metrics.
Keywords: Automatic Speech Recognition, code-switching, evaluation metric.