S. Bieletzke1, F. Naumann2
Correctly detecting emotions in a conversation, such as in a chat, can be crucial for the dialogue. Therefore, automated sentiment analysis is an important part of AI-driven dialogue systems, especially in higher education, where it helps support learning motivation. In the project named SMARTA (Student Motivation and Reflective Training AI-Assistants), chatbots were developed that interact with students, e.g., for motivation. To allow the chatbots to respond empathetically to motivation-problems, procrastination, or social challenges, it was necessary for them to detect the correct sentiment during the dialogue. This paper presents an empirical quality assessment of automated AI-based sentiment detection.
In a comparative study, sentiment classifications by linguistic experts, students, and different GPT models were analyzed. The results show that even older language models like GPT-3.5 can recognize sentiment reliably. As expected, it could be observed that with optimized prompts and newer models, a higher level of agreement with human assessments can be achieved. The analysis is based on specific quality metrics such as accuracy, precision, and recall, as well as confusion matrices. Deviations like false negatives and false positives, are rare and mainly affect neutral cases, which also vary among human assessments. Therefore, it is suggested that AI performance should not be measured against a zero-error rate but rather compared to human error rates what the authors call the Human-AI-Gap-Benchmark.
Despite its high reliability, sentiment analysis remains a not fully explainable black-box process. This paper briefly examines regulatory implications under the EU AI Act, particularly to assess whether the method falls under medical AI or prohibited emotion recognition. To comply with the AI Act, a technical method is proposed where sentiment analysis results are stored anonymously during the chat, while the chat history itself is deleted immediately after the conversation ends. As a result of the high precision and accuracy, especially in the critical task of detecting negative sentiment, automated sentiment analysis is used as a trusted module within the SMARTA project.
Keywords: Sentiment analysis, AI-driven dialogue systems, GPT, SMARTA, chatbot, empirical assessment, EU AI Act, Motivation, Higher Education.