ABSTRACT VIEW
AUTOMATED GENERATION AND EVALUATION OF VOCABULARY TESTS: A LARGE LANGUAGE MODEL APPROACH
A. Nakanishi
Osaka Institute of Technology (JAPAN)
This study explores the development and validation of a multiple-choice vocabulary question system named Contextual Adaptive Vocabulary Enhancement System (CAVES), which integrates large language models (LLMs) such as GPT and BERT to automatically generate and evaluate vocabulary questions tailored to learners' individual needs. Automatic question generation uses parameters such as the desired vocabulary to learn and difficulty level, as defined in the Common European Framework of Reference for Languages (CEFR), with models generating prompts via the ChatGPT API. This study also explored an automatic scoring process using BERT to calculate the appropriateness of vocabulary choices within a given context. This probabilistic approach allows for nuanced feedback, going beyond binary (right/wrong) assessments.

The validity and reliability of CAVES were evaluated using Cronbach’s alpha coefficients and correlation analysis between native English speakers’ responses and the system’s automated scores. The results indicated high internal consistency of the generated questions, with a Cronbach’s alpha of 0.88. Additionally, there was a strong correlation (r = 0.82) between native English speakers’ scores and the automated scores, reinforcing the validity of CAVES. A survey conducted among 42 native English speakers residing in the U.S. assessed the naturalness and consistency of automatically generated questions. Most respondents rated the questions as natural and consistent, reflecting the effectiveness of LLMs in maintaining an appropriate language level.

The findings indicate that CAVES is capable of autonomously generating reliable and valid multiple-choice vocabulary questions and assessing the contextual appropriateness of proposed answers by quantifying their alignment with the context. This opens new avenues for personalized and effective vocabulary learning support, accommodating learners' diverse needs and proficiency levels. By shifting focus from traditional assessment to learning-centric evaluation, this system holds promise for boosting vocabulary acquisition. Despite these promising results, the study acknowledges the need for further verification to confirm the efficacy of CAVES as an evaluation tool. Future research directions include refining item response theory and item analysis to improve the discriminative power of the system.

Keywords: Large language models, vocabulary tests, computer-based testing.

Event: INTED2025
Session: Emerging Technologies in Education
Session time: Tuesday, 4th of March from 08:30 to 13:45
Session type: POSTER