M.J. Diepeveen, H. Westbroek, J. van Muijlwijk-Koezen, D. Scholten
At Vrije Universiteit Amsterdam, a prominent AI research department and research groups are exploring various machine learning (ML) and large language model (LLM) applications. Recently, we initiated several studies on the effectiveness of AI in our educational programs. We will present various examples and explore one case in detail.
The case study focused on what we can learn from using Large Language Model (LLM) tools like ChatGPT and Copilot by students to receive feedback on their written assignments. The primary objective was to evaluate whether using these tools benefits students. To achieve this, we needed to establish a comprehensive evaluation framework that goes beyond standard benchmarks and considers the specific needs and intents of the users.
Traditional benchmarks for evaluating LLM performance are typically model ability-focused. These benchmarks include Accuracy, Coherence and Fluency.
While these benchmarks are helpful, they still need to capture the effectiveness of LLM tools in educational settings fully.
To determine the effectiveness of LLM tools in education, it is essential to understand the intent behind their use. By identifying students' specific goals and needs, we can better evaluate whether the tools are meeting these objectives.
In one chemistry course, students were required to complete a written assignment. To facilitate this, they received instructions via a workshop on using LLM tools like ChatGPT and Copilot. The students then worked individually on their papers, utilising the LLM tools to receive feedback and improve their assignments.
To evaluate the effectiveness of LLM tools in this context, we employed a mixed-methods approach:
1. Interviews with Teachers: We interviewed two teachers who advocated for and allowed using LLM tools in their courses. The interviews aimed to gather insights into the teachers' perspectives on the benefits and challenges of using these tools in the classroom.
2. Student Surveys: We surveyed the students to understand their intended use and experience with the LLM tools. The survey included questions about their goals, how they used the tools and their perceptions of their effectiveness.
3. Chat History Analysis: We analysed the chat history between students and the LLM tools. This data provided valuable insights into how students interacted with the tools, the types of feedback they received, and how they used this feedback to improve their assignments.
Data Coding and Analysis:
The collected data was coded and analysed to identify common themes and patterns. The coding process involved categorising the data based on the following criteria:
- Needs: What specific needs did the students have when using the LLM tools?
- Intent: What were the students' goals and intentions behind using the tools?
- Effectiveness: Did using LLM tools help students achieve their goals?
By analysing the coded data, we could conclude the overall effectiveness of LLM tools in the educational context.
The findings from this case study suggest that LLM tools like ChatGPT and Copilot can be valuable educational aids when used appropriately. They provide students with instant feedback, which can enhance their learning experience and improve their writing skills. However, the effectiveness of these tools depends on several factors, including the student's intent, the quality of the feedback provided, and the teacher's guidance.
Keywords: Higher education, generative AI, LLM.