ENHANCING AI-GENERATED SINGLE BEST ANSWER QUESTIONS IN MEDICAL EDUCATION: A STUDY ON THE EFFECTIVENESS OF PROMPT CHAINING
O.S.H. Ho1, A. Markiv1, I.S.H. Ng1, S.J. Han1, C. Gasa1, A. Muthukumar2, V.W.T. Lau2, J. Sun2, M.G. Sagoo1
Single-best answer (SBA) questions are a common form of assessment in medical education, pivotal for evaluating students' knowledge and facilitating learning. The potential of utilising artificial intelligence (AI) for generating SBA questions is increasingly acknowledged for its enhanced efficiency compared to human generation. Despite this growing recognition, there remains a critical need to explore methodologies for effectively prompting AI models to produce consistent high-quality SBA questions.
This study investigates the effectiveness of prompt chaining in enhancing the quality of AI-generated questions compared to single prompt methods, specifically on SBA questions for preclinical medical education. A prompt chaining cascade consisting of six interconnected prompts was developed, in which the output of each prompt serves as the input for the subsequent prompt in the chain. Each prompt was designed to construct and refine specific aspects of the question, such as content generation, refinement, and difficulty adjustment, so that the question assesses students' lateral thinking capabilities in addition to factual recall, whilst ensuring accuracy, specificity and relevance to source material.
To evaluate the contribution of each prompt within the chain, six variants of questions were generated from the same source material, with each variant omitting one component of the chain prompt. Different AI models, including GPT-4o, LLAMA 3.1, and Claude 3.5 Sonnet, were used to generate different question sets for each set of chain prompts. A question generated using a single prompt was included as a control. A group of 30 preclinical medical students from King's College London was recruited using stratified sampling to collect diverse feedback. These students were asked to complete the AI-generated questions and subsequently evaluate the quality of each generated question set via a questionnaire. The evaluation focuses on overall question quality, difficulty, and relevance to the source material. Additionally, it assesses the clarity and relevance of the explanations generated for each option. The feedback for each question set was then analysed to identify the effectiveness of each prompt in improving the question sets for assessing knowledge and lateral thinking.
Preliminary results show improvement in the overall quality of the prompt chaining-generated questions compared to single-prompt methods. Participants reported that the questions generated by the prompt chaining method are more effective in testing lateral thinking compared to those from a single prompt.
Through analysing the impact of each prompt within the chain, this study provides valuable insights into the effectiveness of prompt chaining in improving AI-generated SBA questions. Our findings open the possibility for developing adaptive systems that can tailor the difficulty and content of questions based on individual student performance and needs, as well as applications across different fields and educational levels.
Keywords: Single-best Answer (SBA) Question, Question Generation, Artificial Intelligence (AI), Medical Education, Prompt Chaining.