M. Strong
This is the final report of a study that compares AI with humans in the evaluation of teaching performance. In past work we have conducted several experiments to assess how well human judges can distinguish teachers with success in raising student achievement from those who did not have such success, and the extent to which they agree or disagree with one another on the choices they make. We found that humans demonstrated significant limitations in identifying the successful teachers, but their choices were not random. Apparently they were influenced by some other factors or biases. We replicated these findings under various conditions (stimuli, expertise, type of rubric). The results were similar, no matter the conditions. With the advent of interest in the use of AI in education, we considered its potential for transforming and redefining teacher evaluation. Specifically, we questioned AI's potential for applying its analytical prowess in overcoming some of the biases and lack of comprehensiveness evident among humans using traditional evaluation frameworks.
We investigate two types of evaluative judgments— intuitive or unstructured and rubric-based or structured. We investigated unstructured judgments by asking subjects to “use their intuition or existing knowledge” to classify classroom instruction of known quality as representative of a group of teachers of either high or low effectiveness. Our definition of "effectiveness" is that student learning is improved as measured by standardized tests. We investigated structured judgments by asking subjects to count the occurrences of six concrete teaching behaviors using a low-inference rubric. With these data we were able to compare the performances of the human judges and AI. The tasks performed by human subjects replicate experiments we previously conducted, the results of which are published. Given the previous experiments, we are now able to compare the performance of AI and humans on the same tasks at the same time, as well as to human judges in previous studies. We recruited 100 human subjects to act as a point of comparison for the AI. The human subjects used an online platform to complete the tasks. We made comparisons of accuracy and reliability across groups and tasks, which provided us with a basis for judging the relative success of AI and human judges.
In previous studies, experts and non-experts did no better than chance when they relied solely on their intuitive judgment. Experts fared no better when using high-inference rubrics. However, experts and non-experts were more accurate than chance when they used low-inference rubrics, and just as accurate using transcripts of instruction compared to video. In the present study, we hypothesized that AI would be better at judging teaching effectiveness than humans, given that machines are very good at performing low-inference tasks, and AI in particular is very good at “understanding” written text, such as transcripts. We report the findings from this comparison. By some measures, AI outperformed humans, by others it did not. The results suggest that, after substantial training, AI has potential for use in teacher evaluation. This raises the question of whether human judges should be replaced by machines? Our data may help answer this question, and we hope to engage our audience in a discussion of the moral dilemmas it poses.
Keywords: Teacher observation, evaluation, AI.