ABSTRACT VIEW
Abstract NUM 1439

AN EMPIRICAL STUDY OF GRADING WRITING TESTS USING LLMS
C. Sun, J. Adachi
California State University, Los Angeles (UNITED STATES)
Grading is one of the most crucial components in education as it provides important feedback and serves as a key motivator for students. On the other hand, grading is also one of the most time-consuming tasks for teachers, especially when there are large class sizes and limited instructional support. In this study, we focus on the use of large language models (LLMs) for grading writing tests. This study falls within the body of work on automated essay scoring (AES), which has been a topic of research since the 1960s. Traditional AES systems have utilized supervised machine learning techniques. More recently, large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation, leading to their exploration in various educational applications including essay grading.

There are already a number of studies evaluating LLMs for AES, though most of these studies simply used the LLMs "as is" without any optimization, and the few that did stopped at prompt engineering. Our study differs from previous work in three important ways. First, we explore multiple optimization techniques, including prompt engineering (by both human and AI), n-shot prompting, and fine-tuning. Second, we compare both reasoning and non-reasoning models to understand architectural advantages in essay grading. Third, we investigate cross-dataset generalization by optimizing models on one dataset and evaluating on another, simulating a common scenario in real-world AES usage.

The two datasets used in this study are from two different writing tests in two different classes. Both writing tests share the same template and grading rubric, but the writing prompts are different. All student essays were graded by one of the authors who was in charge of the writing tests. We used three recently released LLMs - GPT 4.1, o4-mini, and Claude 3.7 - to grade the student essays and compared their scores against the scores of the human grader across five rubric criteria: Audience, Purpose, Clarity, Conciseness, and Accessibility. Our findings are summarized below.

First, we used the non-reasoning model GPT 4.1 without any optimization to establish the baseline result, which shows reasonable accuracy for the Purpose and Accessibility criteria with a mean absolute difference (MAD) of 0.4 from human grading, but the model struggles with Clarity and Conciseness, which have MADs close to 1. Second, we find that without optimization, reasoning models o4-mini and Claude 3.7 perform comparably or worse than the non-reasoning model GPT 4.1 on different criteria, suggesting that unlike coding or math problems, short essay scoring does not benefit from the sophisticated chain-of-thought capabilities of reasoning models. Third, we experimented with a number of optimizations, including model fine-tuning, n-shot prompting, and additional contextualizing prompts generated both by the human grader and by the models themselves based on graded examples. Some of these optimizations result in significant improvements in grading accuracy, especially for the non-reasoning model GPT 4.1, while other optimizations are less effective. Lastly, we investigated cross-dataset generalization and found that optimization based on one dataset is much less effective for a different dataset, showing the difficulty of applying LLM-based AES in real-world scenarios as the model may need to be re-optimized for each new assignment.

Keywords: Automated Essay Scoring, Large Language Model.

Event: ICERI2025
Track: Digital Transformation of Education
Session: Data Science & AI in Education
Session type: VIRTUAL