A CODE-SWITCHED CORPUS FOR INTELLIGENT TUTORING SYSTEMS IN BREAST CANCER EDUCATION
A. Rivera1, F. Iacobelli2, R. Balyan1
Intelligent Tutoring Systems (ITSs) simulate human tutors using constructivist strategies that monitor learner progress, pose challenging questions, elicit deep explanations, and deliver targeted feedback. These systems are scalable and effective educational tools, particularly valuable in healthcare contexts for low-literacy populations. However, most ITSs are trained and modeled using monolingual data and overlook bilingual learners who frequently code-switch. Code-switching is the fluid practice of alternating between two languages within a conversation, often to express ideas more naturally or compensate for limited vocabulary in one language. This common practice of code-switching among speakers with limited English fluency or English as their second language, often results in and leads to disengagement from health-related resources and education.
Considering the health impacts that disengagement may have, this study specifically focuses on breast cancer survivors within the Hispanic community. This choice is driven by two key reasons: first, breast cancer remains one of the leading causes of death among the Hispanic population, and second, code-switching is highly prevalent among Spanish-speaking individuals. In order to effectively train an ITS to support this demographic, it requires specialized code-switching corpora focused on breast cancer. Although multilingual datasets exist, they often lack linguistic accuracy and do not contain health-specific content, making it challenging to develop such responsive systems for bilingual Spanish-speaking communities.
To address this gap, the study focuses on developing a bilingual breast cancer health corpus by generating code-switched text from aligned English and Spanish transcriptions. We identify the switch points by tagging part-of-speech (PoS) using SpaCy and use both the rule-based and probabilistic code-switching strategies. The Google Translate API is used wherever translation is needed and post-processing is carried out using ChatGPT-4.0 to ensure fluency of the generated code-switched output. The resulting corpus is evaluated using the LINCE code-switching benchmark with customized metrics for multilingual data quality. This resource aims to support the creation of models for a culturally adaptive ITSs for underserved Spanish-speaking populations, particularly for breast cancer education and survivorship.
Keywords: Code-switching, bilingual corpus, breast cancer.