Within the context of the CLARIAH-VL research infrastructure project funded by the Flemish Research Foundation (FWO), the Centre for Computational Linguistics (CCL), part of the ComForT research unit at KU Leuven, seeks to hire a PhD student to carry out research on the subject of neural network architectures for linguistics and digital humanities applications.
Project
Recent machine learning methods based on neural transformer architectures have greatly improved the state of the art for a wide range of natural language processing applications, such as machine translation (Vaswani et al., 2017) and general natural language understanding (Devlin et al., 2019). In this project, the goal is to adapt and exploit transformer architectures as a means for the exploration and analysis of linguistic phenomena, and to explore their use for applications within the field of digital humanities.
Current state of the art NLP systems are typically trained using self- supervision in the form of a masked language modeling objective. The transformer architecture receives as input a large corpus where a certain amount of tokens is replaced with a special mask symbol. The model's goal is then to predict the original tokens that have been masked out. In order to make the correct prediction, the model needs to pay close attention to the surrounding context; as such, it builds up precise contextualized representations for the tokens in the sentence. Once the model has been pre- trained on a large corpus using self-supervision, it is then typically finetuned on a specific task using a limited amount of supervised training data. This two stage setup generally leads to state of the art performance on a wide range of NLP tasks, ranging from explicit linguistic annotation tasks (POS tagging, named entity recognition) to more general natural language understanding tasks (sentiment analysis, natural language inference, etc.), as tested by well-known benchmarks such as GLUE and SuperGLUE.
There is corroborating evidence that self-supervised transformer architectures implicitly encode a wide range of linguistic knowledge, from part of speech information over syntactic structure to co-reference information (Peters et al., 2018; Hewitt et al., 2019; Linzen et al., 2019). We will investigate to what extent such implicit linguistic representations might be exploited as a tool for linguistic analysis. More specifically, we will investigate whether the linguistic information present in the models might be distilled for the purpose of similarity computations. Such a process would allow one to automatically harvest a corpus of linguistically similar structures, in order to support linguistic analyses. Moreover, as transformer architectures simultaneously encode syntactic and semantic information in their contextualized representations, this would allow one to automatically harvest syntactically disparate realizations of similar semantic content, providing an adequate means for a linguistic analysis of the syntax-semantics interface.
The architecture described above makes use of a single encoder setup: the input sentence is encoded into a contextualized representation, which can be exploited for classification tasks (either for the sentence as a whole, or for each token individually). If, on the other hand, one wants to predict a target output from a source input, one typically makes use of an encoder- decoder setup. Such a setup would typically be used for machine translation (Vaswani et al., 2017), but the setup may equally be used for monolingual sequence to sequence tasks, such as abstractive summarization or question answering (Lewis et al., 2020). We will equally explore encoder-decoder transformer architectures for linguistic applications.
The project will mainly focus on linguistic applications, but the research will equally be extended to applications within the field of digital humanities. Possible applications include semantic search, and semantically oriented annotations.
ProfileFor more information please contact Tim Van de Cruys, mail: tim.van...@kuleuven.be.