Hello!We have a very exciting session coming up next week with Alina presenting some of her work on influence functions. Here are the details of the session:
Title: Generating stereotypes from implicitly hateful posts with Influence Functions
Abstract: Substantial progress has been made on detecting explicit forms of hate, while implicitly hateful posts containing, e.g., microaggressions and condescension, still pose a major challenge. In light of high error rates, explanations accompanying model decisions are especially important. Since implicit abuse cannot be put down to the use of an individual slur, but arises out of the wider sentence context, highlighting individual tokens as an explanation is of limited use.
In this paper, we generate full-text verbalisations of stereotypes that underlie implicitly hateful posts. We test the hypothesis that providing more context to the model - such as a small set of related samples - will lower the bar for generating the implied stereotype. For a given post, instance attribution methods, such as Influence Functions, are used to source similar examples from the training data. Then BART is trained to generate the underlying stereotype from an original input and its most similar neighbours.
When: 21st of March 2023 at 3pm CET
Best,
Aditya