S. Ratner1, M. Lynch2, G. Cooney3
As education systems face growing demand for scalable, consistent, and cost-effective feedback on student writing, LLMs may present a promising yet under-explored solution. This study investigates whether Large Language Models (LLMs) can deliver human-equivalent essay scoring and feedback without prompt-specific pretraining. Most prior implementations of automated assessment have relied on training models for specific tasks or prompts. An approach that limits scalability and adaptability across diverse educational contexts. This research addresses the critical question of whether prompt engineering alone, without fine-tuning, can enable LLMs to produce reliable, rubric-aligned evaluation that is comparable to human grading.
We adopted a three-phase, design-based research (DBR) methodology to test both standard and advanced LLM configurations across thousands of student essays from Grades 3 to 11, encompassing multiple genres and rubric types. In Phase 1, we established proof of concept using early structured prompts embedded with analytic rubrics. Phase 2 focused on building infrastructure, refining prompt strategies, and integrating scalable model evaluation tools. In Phase 3, we conducted longitudinal testing across three quarterly model releases in 2024, using psychometric measures including Quadratic Weighted Kappa (QWK), Intraclass Correlation Coefficients (ICC), and Root Mean Squared Error (RMSE) to evaluate alignment with human ratings.
Results show that the standard model advanced from moderate to substantial agreement with human graders, with median QWK improving from 0.49 to 0.70 and a peak score of 0.81. The advanced model demonstrated consistently higher performance, with median scores rising from 0.60 to 0.77 and a maximum QWK of 0.91. This exceeded the threshold for expert-level agreement. Percentile analyses and visual diagnostics confirmed improved scoring consistency and reduced variance over time. This was particularly evident in structural and surface-level traits such as grammar, organization, and clarity. Several limitations were also observed including scoring highly contextual or creative dimensions and producing pedagogically nuanced feedback. The findings also affirm the viability of rubric-based prompting as a scalable alternative to traditional fine-tuning approaches.
This study offers a replicable framework for implementing AI-driven scoring in formative and summative assessment. Particularly in settings where speed, transparency, and cross-grade applicability are essential. It contributes to the emerging consensus that, under well-calibrated prompting conditions, LLMs can function as reliable evaluators across a broad range of educational tasks. Thus, supporting equity, efficiency, and pedagogical utility in digitally mediated learning environments.
Keywords: Marking, essay feedback, AI, LLMs, education.