OPTIMISATION OF ROBUSTNESS IN ONLINE ASSESSMENT AGAINST AI CHATBOTS IN STEM STUDY PROGRAMMES
J. Knaut, M. Wiehl, M. Altieri
In this contribution, we explore different digital question designs for online assessments in the domain of computer science and mathematics, focusing on their difficulty for AI chatbots. We evaluate the test results and questions of a competency test designed for a master's programme at the University of Applied Sciences OTH Amberg-Weiden.
The test, implemented as a quiz in Moodle, contains digital questions developed with advanced STEM assessment tools such as STACK, JSXGraph and CodeRunner. These questions use different forms of content, such as text, images, and interactive diagrams, and offer various input types for mathematical objects and program code, including interactive graphical input such as moving a point in a coordinate system.
We analyse the test results of 600 participants, using their scores and answers to 26 different questions. These questions are grouped according to different design criteria. Descriptive statistics and indicators of the quality of a question and question group, such as the facility index or discriminative efficiency, are used to provide an overview of the results. In addition, we present all questions as prompts to different AI chatbots, including ChatGPT, Gemini, and DeepSeek, and examine and compare their performance in solving questions across different question groups.
Our findings highlight the limitations of current AI models in handling complex question types that involve interaction with graphics, particularly when feedback from continuous interaction is required to solve the task. For example, when a line needs to be adjusted until a dependent and dynamically updated value is minimised, the AI has incomplete information to solve the task and can only suggest possible next steps. The use of graphics can generally lead to inaccurate analysis by the AI, for example when positions in a coordinate system are only approximately recognised. In some cases, arranging graphical elements in a way that is equivalent, but more difficult to analyse, can also lead to incorrect solutions. Furthermore, AI models add code for ease of testing, when answering programming tasks. This additional code produces output that is not required in the task and can be detected during automatic assessment.
Based on the results, we propose new strategies for designing questions that are more challenging for AI, with the aim of improving the robustness and reliability of online assessments.
Keywords: Online assessment, chatbots, STEM.