S. Kluibenschädl, S. Schlögl, S. Kruschel
The use of Large Language Models (LLM) has shown great promise in recent years, enabling a wide range of use cases, including their use as a backbone for educational chatbots. However, it has been observed that sometimes these models tend to hallucinate, especially in domains that were sparsely represented in a model’s training data. Since students are especially susceptible to misinformation, it seems particularly important to address this issue. Retrieval Augmented Generation (RAG) is considered a potential solution. It provides the LLM with additional data during inference. In contrast to other methods, RAG does not require additional pretraining or finetuning of the LLM, making it an inexpensive and flexible approach of knowledge extension.
Mixtral-8x7b, an open source LLM, was used as an experimental technology environment to test the suitability of using educational learning material as a source for such RAG-based knowledge augmentation. To do so, a local installation of the LLM was given access to a concrete knowledge domain via recorded webinars of an undergraduate IT course on principles of software engineering. A custom data pipeline was built to convert these videos into vector embeddings, and store them in a vector database that was then used for the RAG augmentation.
Next, a typical college-like test method was used to evaluate the quality of the solution and compare it to a baseline installation of Mixtral-8x7b without this type of knowledge embedding. We asked both LLMs (i.e. the baseline version and the RAG-augmented version) approximately 30 multiple-choice (MC) and open-ended questions related to the given knowledge domain. The answers to the open-ended questions were then evaluated using the RAGAs framework, which was created to assess RAG-based systems through metrics. Here a significant metric is faithfulness or factual accuracy, which improved by approximately 39% with the knowledge embedded model. The answer relevancy, which assesses the relevance of an answer to the question, however, exhibited only a slight increase. This indicates that the baseline model was already proficient at providing plausible-sounding answers, independent of whether they were correct or incorrect, which once more highlights the challenge of model hallucinations.
Interestingly, with the answers to the MC questions, our knowledge embedded model performed worse than the baseline model. Our hypothesis for this is that by embedding additional knowledge vectors the model has become clouded. In other words, we assume that if a model already has knowledge about a topic and is then provided with additional information, its judgement with particular types of questions (i.e., in our case MC questions) may become clouded. This assumption was further substantiated by a small-scale test run with a subset of the questions, where the RAG-augmented model started answering certain questions incorrectly despite of having answered them correctly with the baseline model.
Our findings of this experimental study demonstrate on the one hand that it is both feasible and practical to work with open-source LLMs in educational settings, and that by using RAG at least some types of LLM hallucinations can be successfully tackled. Yet, on the other hand, if one wants to create an LLM-driven “study buddy” that is large free from error, additional tuning work and a better understanding of the LLM reasoning process is still necessary.
Keywords: Learning Technology, Large Language Models, Retrieval Augmented Generation, Model Comparison.