S. Pöysäri, N. Siltala, J. Latokartano
The adoption of Artificial Intelligence (AI) in educational settings is rapidly increasing, with particular interest in its potential to automate time-intensive tasks such as creating multiple-choice exam questions. While generative artificial intelligence chatbots, like Microsoft Copilot or ChatGPT, have shown promise in generating content, their suitability for domain-specific assessments remains underexplored. This study evaluates the effectiveness of large language models (LLMs) in generating multiple-choice exam questions for industrial robotics education, focusing on their practicality and limitations. The study also provides guidelines on how to utilize AI effectively and what kind of prompts should be given to the AI when generating multiple-choice exam questions.
Using qualitative research methodology, three university instructors specializing in robotics utilized Copilot, ChatGPT, and its “Moodle GIFT short answer generator” to generate multiple-choice exam questions for Moodle platform in GIFT format. The correct answers should come from the selected chapters of a course textbook, which are given as input for the LLM. While the wrong answers can come from any source but not being in conflict with the provided textbook information. The study evaluated AI-generated questions based on key metrics such as content relevance, alignment with the provided background material, ease of generation without extensive teacher modifications, and the variety and quality of incorrect answers. The questions were also assessed for their ability to match the desired difficulty level.
The results revealed several limitations in the AI's ability to generate high-quality multiple-choice exam questions. The AI often repeated similar question structures over time, reducing the overall diversity and breadth of the questions. Additionally, it struggled to create questions that evenly covered all aspects of the provided material, occasionally generating questions that strayed beyond the scope of the given content. The AI also faced challenges in adhering to the desired question syntax, which led to inconsistencies and required additional effort to correct. Moreover, the incorrect answers were often too easy to distinguish from the correct ones, which overly simplified the questions and failed to adequately measure the test-taker's understanding. These limitations collectively necessitated significant teacher intervention to review, revise, and adapt the AI-generated questions to make them sufficiently challenging, contextually appropriate, and aligned with learning objectives. Despite these challenges, the tools enabled significant time savings, and the overall process required less effort than crafting questions entirely from scratch if the prompts to the AI are given in the appropriate form.
The study underscores the potential of AI tools like ChatGPT in assisting with multiple-choice exam question generation while highlighting critical areas for improvement. AI-generated questions required substantial human oversight to ensure quality. For robotics education, where technical understanding is essential, AI-generated questions served more as a preliminary draft than a final product. Future improvements in AI question generation should address issues of repetition and answer variability. While AI tools can reduce workload, they currently function best as collaborative tools that complement, rather than replace, human expertise.
Keywords: Artificial intelligence, education, assessment, robotics.