thanks @Simon for sharing this. I feel the conversation scratched only the surface, despite the very deep insights and i'd like to add some reflections.
the purpose and explainability of predictive models is something we should all reflect on very carefully. As Vitomir noted, there is a tendency of taking the observations emerging from the descriptive value of models into causal inferences that have a prescriptive value, or to inform decision-making.
If we take a step back from the data, especially in university or schools, there must be structures in place in organisations that enable an open discussion and inclusive support systems to deal with individual needs.
The fact that a model identifies a student as potentially 'at risk' is ok, as long as the assumptions are explained and shared with the student as the 'target' of the model (the explainability mentioned above).
In most cases we have at heart the best interest for the students and help them to succeed (good intentions, laudable purpose), but the nature and assumptions of the model may not have all the desired effects and David's point about the ethical principles driving why we should use data in the first place is a very important driver. Students should be stakeholders in the process; this does not mean that they know better, but they should be able to make their own decisions. Sometimes students may will feel that the prediction is incorrect, or they will point out themselves that are other variables affecting the validity of the prediction which were not included in the model. LA models are not equivalent to the fuel gauge led in a car which tells you that if you don't refill you will not get to destination.
Andrew very succinctly pointed out many of the biases/problems with the problem which i very much agree with.
"Better training data results in better models" in particular is one i'd like to pick on: what does better actually mean?
more data? more comprehensive data (i.e. including more features)? more representative data? (by definition this makes is more biased to the sample in question)
There is a lot of work done in educational data mining focusing on feature selection (tapping into the issue of viability of computational or statistical models).
This is for me a double edges sword as on one side it is a rational, systematic and conscious process to identify the most relevant variables to explain what we try to measure (or generate a prediction), but at the same time it is also removing the natural richness of variation in the sample which obfuscates the clarity of predictions. the tradition in statistical inference and 'scientificity' has permeated the field of psychology for the last century and fundamentally boils down to a philosophical stance about measurability of human characteristics and behaviours (and i'm sure that this conversation will be easily go off-track...).
at the end, i do not believe that there is a fundamental issue with the models as 'colour-blind': there are fundamental issues with how these models are explained and used in the same way we don't try to tight a screw with a hammer. This means that the 'metadata' about the model, the assumptions, the detailed characterisation of the decisions made by the modeller which are specific of the data used, and the actual reasons why the model is created in the first place should be all packaged and presented together. Yes, i know this is not what managers want to hear as it will not give them a straight answer about what to do next, but i'm sure that a better understanding will save them a lot of the hassle coming from the simplification of the outcomes and the potential perception of bias coming with these.
And of course, the self-fulfilling prophecy or vicious circle of certain factors affecting future outcomes is critical, but pretending that these factors are not there will invalidate the model because crucial pieces of the puzzle are missing!
what is more important is the actions to be taken after the prediction is generated and i'm strongly supporting the idea that individualisation is what we should focus on.
Dr Lorenzo Vigentini (Cpsychol, AFMBPsS, FHEA, IEEE)
Academic Lead, Educational Intelligence & Analytics, PVC(Education) Portfolio
Adjunct Senior Lecturer to the School of Education and the School of Computer Science & Engineering.
Research Associate of the Higher Education Academy
Level 10 Library Stage II, UNSW, Kensington 2052 Australia
THE UNIVERSITY OF NEW SOUTH WALES
SYDNEY NSW 2052 AUSTRALIA
t: +61 (2) 9385 6226