ABSTRACT VIEW
AUTOMATIC QUESTION GENERATION FROM LECTURE MATERIAL: COMPARING CHATGPT TO AN INTEGRATED OPEN-SOURCE SOLUTION
A. Ekiz, S. Schlögl, S. Kruschel
MCI - The Entrepreneurial School (AUSTRIA)
Chatbots have become increasingly important in recent years. They simulate human conversations, retrieve information, and take over tasks, thereby relieving humans in many areas of life. In academic education, we find particular use for chatbots in the form of so-called “study buddies”, where by asking relevant questions they aim to help students prepare for exams and reflect on learned content. While several such chatbot tools exist, which could be used by universities for these kinds of use cases, their implementation is often time-consuming and inefficient, particularly if education providers and/or teaching professionals want to control the learning material that is to be used as a basis for study sessions. The biggest challenge here is to fill these chatbot systems with an appropriate question pool and keep it up to date so as to assure the validity of the training.

To this end, we report on an investigation that evaluated the use of existing open source natural language processing (NLP) tools as a means to generate template questions from teaching content. Based on a number of open source libraries we developed a tool which automatically generates questions from provided lecture notes and slide decks. In order to evaluate our tool, we checked the generated questions for both quality and validity. Two criteria were decisive:
(1) to what extent were the questions content-related and
(2) to what extent could they really be answered with the help of the provided teaching material. The goal was to generate questions that corresponded to the learning content and were not invented due to a generic language model. Finally, a comparison was made with a set of questions created by ChatGPT using the GPT4 transformer model.

We used approximately 350 slides from an undergraduate course on principles of software engineering as a testbed. Taking those slides as input, our tool initially produced 113, after some input cleaning 94 unique open-ended questions. We then manually evaluated these questions concerning wording and relevance, each rated on a scale running from 1 (worst) to 5 (best). While some of the produced questions were unclear due to lengthy phrasing, possibly caused by the input text's structure (e.g., lists, keywords, incomplete sentences), the majority was judged valid and usable for student training purposes.

In comparison, feeding the same 350 slides to ChatGPT, only 12 open ended questions were initially produced. While at first glance, these 12 questions appeared well-crafted and thoughtful, a deeper analysis revealed several issues. For instance, half of the them could not be answered based on the content. Although the individual terms that were used in the questions were present in the material, the questions were framed in such a way that answering them was impossible.

Automatic question generation is a dynamic and active research area in the field of NLP, particularly when it comes to the use of large language models such as ChatGPT. Our results show, however, that smaller NLP-based solutions may produce similar, and with some pre-processing, potentially even better and more controllable results.

Keywords: Learning Technology, Large Language Models, Natural Language Processing, Tool Comparison.