Abstract View

ABSTRACT VIEW

Abstract NUM 1534

EXPLORING MULTI-LLM QUESTION GENERATION AND AI-BASED QUALITY CONTROL: A COMPARATIVE STUDY FOR CERTIFICATION PREPARATION

A. Dannecker, L. Meyer

University of Applied Sciences and Arts Northwestern Switzerland FHNW (SWITZERLAND)

Building on previous research on generative AI in education, this paper extends the exploration of automated question bank creation for the IPMA Level D Project Management certification. Our earlier work demonstrated the feasibility of using ChatGPT to generate high-quality, LMS-compatible questions efficiently. This study investigates a novel approach that combines multiple Large Language Models (LLMs) for question generation and integrates a secondary AI system for automated quality control and correction.

We systematically compare three leading LLMs as question generators, each tasked with producing open-ended and single-choice questions with detailed answers and source references based on the first chapter of a 383-page, 29-chapter study book. The outputs of each generator are then evaluated and, where necessary, corrected by an independent AI model designed to detect and resolve content inaccuracies, ambiguous phrasing, structural inconsistencies, and formatting errors to ensure direct compatibility with Moodle’s XML import format.

A distinctive aspect of this research is the full-factorial experimental design, in which all possible combinations of the three generators and the AI-based reviewers are tested. This allows us to assess not only the individual performance of each LLM in producing certification-relevant questions but also the effectiveness of different AI reviewers in improving output quality and reducing post-processing effort. Metrics include:
- The proportion of questions requiring no further manual intervention after AI review.
- The extent and nature of adjustments still needed by human experts.
- The time savings achieved compared to fully manual question creation.
- Consistency and quality differences across generator–reviewer pairings.

Preliminary results indicate notable variations in both content quality and structural adherence depending on the chosen generator–reviewer combination. Some pairs deliver questions nearly ready for deployment with minimal human intervention, while others still exhibit systematic weaknesses in certain areas, such as reference accuracy or structural compliance.

The study also examines the scalability and reliability of the AI-based review process itself, providing insights into how a multi-LLM and multi-reviewer setup can optimize both quality and efficiency in educational content generation. Future work will include user-centered evaluation of the resulting question banks in real-world certification preparation contexts, focusing on learner satisfaction, perceived usefulness, and actual exam performance.

These findings underscore the potential of orchestrating multiple generative and corrective AI models to enhance the creation of certification materials, balancing reduced workload with high standards of content quality and technical conformity.

Keywords: Generative AI, large language models, certification preparation, question generation, AI-based quality control.

Event: ICERI2025
Session: AI for Assessment and Feedback
Session time: Tuesday, 11th of November from 15:00 to 16:45
Session type: ORAL