Hello everyone,
I'm trying to implement the original GloVe paper in Python and I have a question about the corpus preparation.
According to the paper: We tokenize
and lowercase each corpus with the Stanford tokenizer, build a vocabulary of the 400,000 most
frequent words, and then construct a matrix of cooccurrence counts X.
What exactly does "build a vocabulary of the 400,000 most frequent words" actually mean? Should I delete words that are not in the 400K words vocabulary from my corpus when building the co-occurence matrix ? Or should I take them into account when counting the co-occurence ?
I'm using CoNLL-2003 as the dataset, should I remove stopwords from the corpus? (it was not mentioned in the paper so I guess no?)
I'm quite new to nlp so please forgive me if the answer seems obvious.
Thanks a lot,
Shangqian