Should predictive models of student outcome be “colour-blind”?

107 views
Skip to first unread message

s.buckin...@gmail.com

unread,
Sep 16, 2020, 9:31:31 PM9/16/20
to Learning Analytics
Hi all

I'd welcome your thoughts, so sharing here...

Should predictive models of student outcome be “colour-blind”?

"This post was sparked by the international condemnation of George Floyd’s death, and the many others who came before him. Many communities and institutions are now reflecting on how structural racism manifests in their work. 

This is a tentative step into issues of race, about which I should declare I have no academic grounding. Nonetheless, it is important to ask what the implications are for a specific form of Learning Analytics, namely the predictive modelling of student outcomes. Should demographic attributes such as ethnicity be explicitly modelled, or should the models be “colour-blind”? While all categories have politics, this struck me as an interesting question, given that such techniques are demonstrating their value specifically in levelling the university playing field for all students."

Vitomir Kovanovic

unread,
Sep 16, 2020, 9:59:10 PM9/16/20
to s.buckin...@gmail.com, Learning Analytics
Hi Simon,

as a general rule, I disagree that any predictive model should not use any particular feature if it is helping it achieve its goal. After all, predictive models are pretty straightforward - they have a training set and a clear goal - to do is to get the accuracy as high as possible (by any means possible). It is up to the LA professionals to give good datasets. If you start with a racist dataset, you will end up with a racist predictive model, as simple as that. Also, whatever was the prediction, the model needs to learn from it. It is one thing to make a bad prediction, it is much worse not to use this data to build better models. 

It should be noted that predictive models are finding correlations - if a student is in 35-50 years bracket and doesn't watch X number of lecture, we predict he is at risk. However, I see that many people want to use them in a causational manner: because the student is in 35-50 year bracket and didn't watch X number of videos, he will fail. And if you come with that explanation to the student (even when the prediction was accurate), it is obvious that some will be highly offended (and rightfully so).  So really, the problem is not in the model, but the way we interpret models and why we use them. It is one thing to use them to predict something, it is entirely another thing to use them to devise actions for improvement (which imply causality)

Best,
Vita

---------------------------------------------------------------------------------

Dr Vitomir Kovanović BSc, MSc, PhD | Senior Lecturer

UniSA Education Futures

Room BH3-21, City West Campus | SA 5000

tel +61 8 8302 7377

Vitomir....@unisa.edu.au | www.unisa.edu.au

 

This email is intended for the addressee(s) only. Should this email be received in 

error by a person or company other than those intended, the contents of this email 

are confidential and must not be released or used by a person or company not 

authorised to do so.

 

CRICOS No: 00121B



--
You received this message because you are subscribed to the Google Groups "Learning Analytics" group.
To unsubscribe from this group and stop receiving emails from it, send an email to learninganalyt...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/learninganalytics/632baf47-55e7-4400-bb54-c3056fe9dd56n%40googlegroups.com.

Rene Kizilcec

unread,
Sep 16, 2020, 10:18:09 PM9/16/20
to s.buckin...@gmail.com, Learning Analytics, Rene Kizilcec
I enjoyed reading your post, Simon. A piece that I think is missing in the argument is the following:

There is strong evidence now that the omission of protected attributes in predictive models can harm historically disadvantaged group members. See e.g. Kleinberg et al 2018 https://www.cs.cornell.edu/home/kleinber/aer18-fairness.pdf We discuss this idea of 'fairness by unawareness' further in the context of algorithmic fairness in a forthcoming chapter https://arxiv.org/abs/2007.05443

The issue is that the much of the problem lies in training data itself, which is generated in an environment with many biases and instances of discrimination (e.g. systematic differences in SAT/ACT/GRE scores). The research shows that adding protected attributes into the predictive model can actually help mitigate those biases in the data by applying different parameters to different groups (e.g. making a lower GRE score less important for applicants who get lower scores on average).

So long as we train on data that reflects real-world inequities, especially ones that disadvantage groups underrepresented in the data, I find it difficult to avoid the inclusion of group identifiers. Not because they improve overall accuracy by a marginal amount, but because they improve the fairness of our predictive models.

Cheers, Rene

nfonsang

unread,
Sep 16, 2020, 10:18:39 PM9/16/20
to s.buckin...@gmail.com, Learning Analytics
This is a great question (Should demographic attributes such as ethnicity be explicitly modelled, or should the models be “colour-blind”? ). I think it depends on how the model wants to be used. Unfortunately, statistics seems to have been used in the past (intentionally or unintentionally) to enforce stereotype and discrimination based on demographic characteristics especially when the results of the model are not correctly interpreted. An example is when correlation or a relationship between variables is interpreted as causation without any controlled experimental study. If a model predicts students from ethnic group A to have a higher math score averagely and predicts students from group B to have a lower math score averagely, there is a relationship between ethnicity and performance but these kind of results have been interpreted as students from ethnic group A being genetically smarter than students from ethnic group B. This is a problematic interpretation, such results should rather drive researchers to find out why there is a systematic difference in the scores of students from the various groups. In such situations, maybe students from group B are not performing well due to poverty, they may lack internet at home, don't have additional resources as students in group A, so resources might be the third variable responsible for the relationship between ethnicity and students' scores. 
In the nutshell, if such results are used to develop interventions to close the gap in performance, then this is a good use of demographic variables in a model, but if the results are not well interpreted and are used to enforce the ideology that group B students are naturally dumb or "have no brain", this is problematic. 

So, I think the problem may not really be the issue of inclusion or exclusion of demographic variables but how the results of the model are used. I think demographic variables in models are useful in identifying problems in sub populations so that the sub populations can be targeted with interventions to improve the situation in question.  

--

Andrew Gibson

unread,
Sep 17, 2020, 12:12:46 AM9/17/20
to s.buckin...@gmail.com, Learning Analytics
I think more fundamentally there are some false assumptions that underpin the problem. Here’s a few to start with…

Success in putting measurements of the world into the model is a basis for putting the model to use in the world.
Accuracy of the model implies some heightened meaning for the phenomena that the model is meant to reflect.
Better training data results in better models.
More features result in better models.
The problem can be modelled computationally.


Andrew Gibson PhD
Lecturer | School of Information Systems | Science and Engineering Faculty
Queensland University of Technology (QUT)

CRICOS No 00213J 


David Quigley

unread,
Sep 17, 2020, 6:32:27 PM9/17/20
to Learning Analytics
(I'll keep my post brief)

I think the ethicists in our community have been wrestling with this issue for a while. Simon, you spoke at length about research from Open University - their work is built on the back of their "Ethical use of Student Data for Learning Analytics" policy documents - I particularly turn to the 8 Principles when I'm trying to reflect on what should / should not go into models. I think Principle 7 is most relevant here. As described, it is more about choosing modeling techniques, but I think that it may be worthwhile to re-examine this tenet in a new light. I worked with folks at Colorado State to develop similar principles, but their parallel principle is "Principle 6: Learning Analytics data use arises from respect for the individual" which, at a glance, allows for more freedom to engage with the data of diverse factors. I know there are others in our community who have dealt with establishing these guiding principles at their institutions, and there are other such documents, but these are the two with which I am most familiar.

An interesting discussion!

David Quigley
Instructor - Department of Computer Science
Research Associate - Institute of Cognitive Science
University of Colorado Boulder

lorenzo vigentini

unread,
Sep 19, 2020, 2:20:05 AM9/19/20
to Learning Analytics
hi everyone,

thanks @Simon for sharing this. I feel the conversation scratched only the surface, despite the very deep insights and i'd like to add some reflections.

the purpose and explainability of predictive models is something we should all reflect on very carefully. As Vitomir noted, there is a tendency of taking the observations emerging from the descriptive value of models into causal inferences that have a prescriptive value, or to inform decision-making. 
If we take a step back from the data, especially in university or schools, there must be structures in place in organisations that enable an open discussion and inclusive support systems to deal with individual needs. 

The fact that a model identifies a student as potentially 'at risk' is ok, as long as the assumptions are explained and shared with the student as the 'target' of the model (the explainability mentioned above). 
In most cases we have at heart the best interest for the students and help them to succeed (good intentions, laudable purpose), but the nature and assumptions of the model may not have all the desired effects and David's point about the ethical principles driving why we should use data in the first place is a very important driver. Students should be stakeholders in the process; this does not mean that they know better, but they should be able to make their own decisions. Sometimes students may will feel that the prediction is incorrect, or they will point out themselves that are other variables affecting the validity of the prediction which were not included in the model. LA models are not equivalent to the fuel gauge led in a car which tells you that if you don't refill you will not get to destination.

Andrew very succinctly pointed out many of the biases/problems with the problem which i very much agree with.
"Better training data results in better models" in particular is one i'd like to pick on: what does better actually mean?
more data? more comprehensive data (i.e. including more features)? more representative data? (by definition this makes is more biased to the sample in question)

There is a lot of work done in educational data mining focusing on feature selection (tapping into the issue of viability of computational or statistical models). 
This is for me a double edges sword as on one side it is a rational, systematic and conscious process to identify the most relevant variables to explain what we try to measure (or generate a prediction), but at the same time it is also removing the natural richness of variation in the sample which obfuscates the clarity of predictions. the tradition in statistical inference and 'scientificity' has permeated the field of psychology for the last century and fundamentally boils down to a philosophical stance about measurability of human characteristics and behaviours (and i'm sure that this conversation will be easily go off-track...).

at the end, i do not believe that there is a fundamental issue with the models as 'colour-blind': there are fundamental issues with how these models are explained and used in the same way we don't try to tight a screw with a hammer. This means that the 'metadata' about the model, the assumptions, the detailed characterisation of the decisions made by the modeller which are specific of the data used, and the actual reasons why the model is created in the first place should be all packaged and presented together. Yes, i know this is not what managers want to hear as it will not give them a straight answer about what to do next, but i'm sure that a better understanding will save them a lot of the hassle coming from the simplification of the outcomes and the potential perception of bias coming with these.

And of course, the self-fulfilling prophecy or vicious circle of certain factors affecting future outcomes is critical, but pretending that these factors are not there will invalidate the model because crucial pieces of the puzzle are missing!

what is more important is the actions to be taken after the prediction is generated and i'm strongly supporting the idea that individualisation is what we should focus on.


Lorenzo



Dr Lorenzo Vigentini (Cpsychol, AFMBPsS, FHEA, IEEE)

Academic Lead, Educational Intelligence & Analytics, PVC(Education) Portfolio

Adjunct Senior Lecturer to the School of Education and the School of Computer Science & Engineering. 

Research Associate of the Higher Education Academy

 

Level 10 Library Stage II, UNSW, Kensington 2052 Australia

THE UNIVERSITY OF NEW SOUTH WALES

SYDNEY NSW 2052 AUSTRALIA

t: +61 (2) 9385 6226


s.buckin...@gmail.com

unread,
Sep 30, 2020, 8:22:54 PM9/30/20
to Learning Analytics
Hi all — just to acknowledge the great responses to this post, I simply haven't had time to digest and respond yet. But thank you :-)

Simon

Reply all
Reply to author
Forward
0 new messages