ABSTRACT VIEW
Abstract NUM 1357

WHEN AI TEACHES GEOMETRY WRONG: SYSTEMATIC ERRORS IN LLM-GENERATED EXPLANATIONS AND THEIR EDUCATIONAL RISKS
J. Přibyl, M. Krátká, M. Tichá
Jan Evangelista Purkyne University, Faculty of Science (CZECH REPUBLIC)
As large language models (LLMs) become increasingly integrated into educational settings, their role in assisting mathematics learning—particularly in geometry—requires critical scrutiny. While LLMs produce fluent and confident responses, these qualities often obscure significant conceptual errors. Geometry, with its reliance on formal definitions, categorical distinctions, and spatial reasoning, offers a robust context in which to evaluate the depth and consistency of AI-generated explanations.

This study analyzes the responses of six prominent LLMs (ChatGPT, Claude, Gemini, Mistral Large, Copilot Quick-Nuance, and Copilot Deep-Thinker) to 49 conceptual geometry questions in Czech and 48 in English. The difference in prompt count reflects a translation overlap. In total, 582 responses were examined for correctness, clarity, and internal consistency. To ensure fair comparison within each language, the models were queried in their respective native environments; in particular, English prompts were submitted under simulated English-speaking conditions, including adjustments to system locale and IP address.

The results reveal substantial limitations in the models' geometric reasoning. Conceptual errors were inductively categorized into types, including misinterpretation of definitions (e.g., confusing inscribed and circumscribed circles), incorrect geometric properties (e.g., assigning axis symmetry to general parallelograms), and flawed logical inferences (e.g., assuming that all shapes with equal angles must have equal sides). Several models also demonstrated a tendency to "hallucinate" mathematical terminology (e.g., inventing the term “divergent triangle”) or to provide oversimplified or misleading generalizations. Notably, all six models showed confusion between the terms circle and disk, which led to systematic misclassification of intersection scenarios.

Quantitative analysis focused on error frequency per response, error type distribution across models and languages, and variation in model performance. While some models performed marginally better than others, the differences were inconsistent and topic-dependent. The findings emphasize that despite impressive linguistic fluency, current LLMs lack a stable and conceptually grounded understanding of elementary geometry. This has serious implications for their use in education, especially in languages other than English and in tasks requiring formal precision.

This paper argues that educational use of LLMs in mathematics should be approached with caution and paired with pedagogical strategies that foster critical evaluation of AI-generated content. The study also underscores the need for benchmarks in mathematical literacy that go beyond symbolic computation and address conceptual depth, especially in multilingual settings.

Keywords: AI in mathematics education, large language models, conceptual errors in geometry, language and reasoning in AI, LLM-based instruction.

Event: ICERI2025
Session: Risks and Challenges of AI
Session time: Tuesday, 11th of November from 10:30 to 12:00
Session type: ORAL