ABSTRACT VIEW
AUTOMATED ITEM GENERATION APPROACHES IN EDUCATIONAL TESTING: A SYSTEMATIC REVIEW PROTOCOL AND PRELIMINARY FINDINGS
F. La Russa, R. Marzoli, A. Mastrogiovanni, A. Mattei
INVALSI (ITALY)
Similar to numerous other fields, educational assessment has been profoundly impacted by technological advancements, particularly artificial intelligence (AI). One of the key applications of AI in this domain is Automated Item Generation (AIG), which involves using computer technology to create assessment items based on predefined models. Over the past decade, AIG has been the focus of numerous studies across various disciplines. While it offers clear advantages, such as efficiently generating large volumes of test items and reducing costs associated with manual item development, several practical and ethical challenges remain.

This study presents a protocol for a systematic review aimed at analyzing and synthesizing the existing literature on AIG. Specifically, it seeks to identify the types of items generated for student performance assessment as well as the approaches to generate them.

The systematic review follows a structured search strategy to locate and select relevant studies, employing predefined inclusion and exclusion criteria. The screening process is documented using the PRISMA Flow Diagram. A comprehensive literature search was conducted across four databases (APA PsycINFO®, Education Source Ultimate, ERIC, and Scopus), using the Population, Concept, and Context (PCC) framework to develop the search string. The general search query included terms such as "automated item generation," "evaluation" or "assessment," and "students," adapted for each database. After deduplication, 291 records were screened in Rayyan, applying the following inclusion criteria:
1. Focus on automated item generation
2. Application in testing contexts
3. Application in educational settings

No restrictions were imposed regarding study design, enabling the inclusion of theoretical, methodological, empirical, qualitative, and quantitative research. Following the screening process and collaborative resolution of conflicts, 39 studies were selected for full-text analysis and subsequently subjected to coding and synthesis. The GRADE-CERQual approach was employed to assess the methodological limitations, coherence, adequacy of data, and relevance of the included studies.

Preliminary findings highlight a growing yet heterogeneous body of research on Automated Item Generation (AIG) in education. Template-based and rule-based approaches remain dominant, with a notable shift toward NLP-enhanced and LLM-based techniques. Applications are concentrated in higher education, particularly in STEM and language learning, with multiple-choice items being the most frequently generated format. While several studies demonstrate high coherence and support the efficiency and scalability of AIG, GRADE-CERQual assessments reveal moderate concerns related to methodological limitations, data adequacy, and contextual relevance, especially in studies with narrow disciplinary focus or limited cultural diversity. Empirical evidence shows that AI-generated items can rival human-authored ones in quality, though student perceptions vary. Persistent issues include ethical risks, opacity, and minimal stakeholder involvement. Future research should prioritize validation, transparency, and equity-driven design to enhance the applicability of AIG systems in educational assessment.

Keywords: Systematic review, Automated item generation, Assessment.

Event: EDULEARN25
Session: Emerging Technologies in Education
Session time: Tuesday, 1st of July from 08:30 to 13:45
Session type: POSTER