STUDENTS AT RISK OF SCHOOL FAILURE IN ITALIAN CONTEXT: A MACHINE LEARNING APPROACH
F. Noccioli, M. Marsili, P. Falzetti
Recent studies have explored the potential of machine learning techniques applied to data from the Italian school system. In this context, supervised machine learning models, i.e. models trained on a series of labeled data, have been compared to more traditional models generally applied in the field of education, analyzing the potential of this class of models which, over the last few years, has seen more and more applications.
This study is focused on comparing machine learning models for a classification problem and identifying the most effective classifier within this context.
In particular, the aim is to estimate the risk of school failure based on the regularity in the course of studies and the scores obtained in the INVALSI (National Institute for the Evaluation of the Education and Training System) standardized tests in Italy.
An integrated dataset of INVALSI and Ministry of Education and Merit (MIM) sources was constructed to obtain information relating to two cohorts of students, one relating to grade 5 students in the 2017-18 school year and the other to grade 5 students in the 2018-19 school year. The target variable aims to identify grade 5 students at risk of school failure. It is a binary variable that takes the value 1 if the student, at grade 8, does not achieve the basic skills in both Italian and mathematics in the INVALSI tests, or if she or he does not regularly reach grade 8 within the three years. Therefore, students for whom it is not possible to obtain information were excluded. Missing values identified in the contextual variables were treated using multiple imputation by chained equations (MICE). Following this, the study proceeded with data transformation and standardization phases, before the training, the hyper-parameter tuning and evaluation stage.
Different machine learning models, suitable for classification problems and scaled well to large dataset, were tested on students of the 2017-18 cohort and their performance was measured using the metrics of accuracy, precision, recall, F1 score and ROC-AUC. The cohort of grade 5 students in the 2017-2018 school year was used to train the set of machine learning models selected and, subsequently, the chosen model was applied to the cohort of students grade 5 in the 2018-2019 school year in order to predict the result.
This study has highlighted the potential of predictive analytics in identifying students at risk of school failure. From the comparison carried out, the Gradient Boosted Machine models show the best performance, highlighting good accuracy and a fair balance between Precision and Recall. However, some limitations also emerged, this evidence suggests the adoption of further improvement techniques which could be oriented.
The analysis of the most relevant variables showed the significant role of teachers' grades and performance in the INVALSI tests. The possibility of analyzing the decision mechanisms is important because it allows us to better understand the functioning of the model and the contribution of the individual variables within it.
Furthermore, the results on the cohort data showed stability confirming a robustness of the selected model over time. It can be concluded that the use of a machine learning model, suitably trained, can be a powerful prediction tool on large amounts of data for use in educational policy interventions.
Keywords: Machine learning, prediction models, classification, education.