Preparing corpus for GloVe implementation

173 views

Skip to first unread message

Shangqian WU

unread,

Jan 8, 2023, 12:10:40 PM1/8/23

to GloVe: Global Vectors for Word Representation

Hello everyone,

I'm trying to implement the original GloVe paper in Python and I have a question about the corpus preparation.

According to the paper: We tokenize and lowercase each corpus with the Stanford tokenizer, build a vocabulary of the 400,000 most frequent words, and then construct a matrix of cooccurrence counts X.

What exactly does "build a vocabulary of the 400,000 most frequent words" actually mean? Should I delete words that are not in the 400K words vocabulary from my corpus when building the co-occurence matrix ? Or should I take them into account when counting the co-occurence ?

I'm using CoNLL-2003 as the dataset, should I remove stopwords from the corpus? (it was not mentioned in the paper so I guess no?)

I'm quite new to nlp so please forgive me if the answer seems obvious.

Thanks a lot,

Shangqian

sulman sarwar

unread,

Feb 4, 2023, 6:28:47 PM2/4/23

to GloVe: Global Vectors for Word Representation

They built the vocabulary using the vocab_count tool. See here https://github.com/stanfordnlp/GloVe/tree/master/src

They opted to threshold the vocabulary at 400,000 words but it's optional.

Try with a small sample with stop words removed vs not removed and see what works best for you. In my experience removing stop words or not depends on one's own particular problem and whether it will be beneficial for them. If you see the output of vocab_count then you can see that most frequently occurring words will in fact be the stop words so you can experiment a little and see what works for you.