EXPLORING THE USE OF NATURAL LANGUAGE PROCESSING IN THE TRANSLATION OF DIALECTS USING LIMITED DATA: A CASE STUDY OF THE HOKKIEN DIALECT
Z.J. Wong1, W.X. Koh1, K.Y.T. Lim2
Speech-to-speech translation (S2ST) technologies have made significant improvements in recent years, offering promising solutions to overcome linguistic barriers. However, their application on low-resource languages remains largely unexplored in practical terms. One such language is Hokkien, a Southern Min dialect spoken by millions in Taiwan and Southeast Asia. Due to its rich tonal system, phonetic diversity, and lack of standardized linguistic resources, developing S2ST for Hokkien presents unique challenges in all three processes of S2ST, which are automatic speech recognition (ASR), machine translation (MT), and text-to-speech synthesis (TTS).
This study proposes a novel S2ST system for low-resourced languages, with Hokkien as the focus of the study. The system integrated four essential components: (1) ASR to transcribe spoken Hokkien into Tâi-lô orthography, (2) MT to translate Tâi-lô text into an intermediate language, Chinese, (3) MT to translate Chinese into English, and (4) TTS to generate natural-sounding English speech. All models are based on the Transformer architecture, an attention-based encoder-decoder neural network. Given the scarcity of publicly available training resources, the system leverages pre-trained models for ASR, Chinese-to-English MT, and English TTS from HuggingFace while addressing a critical gap by developing a custom MT model for transliterating Tâi-lô to Chinese text.
To overcome data scarcity, we employ data augmentation techniques to expand the training dataset size and mimic real-life inputs. The custom MT model is trained on the augmented corpus to improve transliteration accuracy, leveraging linguistic patterns and contextual embeddings. Our evaluation includes quantitative assessments such as BLEU and CER scores for transcription and translation accuracy, as well as qualitative human evaluations for naturalness in the synthesized speech.
This study also proposed the use of an intermediate language, Chinese, in translating Tâi-lô orthography to English, as there exist available datasets of this language pair, as compared to the language pair dataset of Tâi-lô and English. The use of an intermediate language that shares cultural similarity with Hokkien also reduces the loss of meaning during translation as both Hokkien and Chinese share similar cultural and linguistic context.
Preliminary results show acceptable transcription accuracy and contextually coherent translations. These findings highlight the feasibility of developing S2ST systems for low-resource languages and add to the broader field of multilingual technologies. Beyond technical advancements, this work plays a crucial role in preserving Hokkien’s linguistic heritage, enhancing accessibility to resources in Hokkien, and supporting future digital inclusion efforts of low-resourced languages. Future research will explore further optimization of the pipeline through reinforcement learning and expanding its implementation to dialects within the Southern Min language family.
Keywords: Speech-to-Speech Translation, Low-Resource Languages, Hokkien, Machine Translation, Automatic Speech Recognition, Text-to-Speech, Transformer Architecture.