Pretrained GloVe, preprocessing one's text and contractions

Jeff Winchell

unread,

Jul 7, 2023, 2:18:13 PM7/7/23

to GloVe: Global Vectors for Word Representation

I am looking for confirmation of my suspicions below:

This does not appear to be documented, but the VERY commonly-used pre-trained GloVe models require you to preprocess your text the same way or you will not be embedding VERY common words like
I'm
we're
you're

These words do not seem to appear in the pre-trained files (I checked the 840B token file and could not find the word "we're" in it though I did see things like 're, 'rE, we ....)

It seems the pre-trained GloVe models were created by first running the Stanford parser on it (no docs on what switches were used). This parser does not simply split on apostrophes as the GloVe 840B token file has words like O'Brien, o'clock, C'mon, etc. I doubt there is ANY standard tokenizer besides this Stanford one that tokenizes text that way.

csunsay

unread,

Jul 7, 2023, 3:31:11 PM7/7/23

to Jeff Winchell, GloVe: Global Vectors for Word Representation

Generally texts are preprocessed to remove punctuations so words such as “I’m” would disappear. Often stop words are removed also texts are lemmatized. I don’t think this is problematic as such words are not interesting anyways

Sent from my iPhone

On Jul 7, 2023, at 9:18 PM, Jeff Winchell <jeffwi...@gmail.com> wrote:

I am looking for confirmation of my suspicions below:

--
You received this message because you are subscribed to the Google Groups "GloVe: Global Vectors for Word Representation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to GlobalVector...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/GlobalVectors/e9f34f78-b50e-4b05-adf0-97083594d450n%40googlegroups.com.

Jeff Winchell

unread,

Jul 7, 2023, 3:44:42 PM7/7/23

to GloVe: Global Vectors for Word Representation

That is your opinion about importance which I disagree with. And it is equally as vague (e.g. "generally") as the non-documentation on GloVe. I found Google even less clear about how Word2Vec's pre-trained model was built.

Jeff Winchell

unread,

Jul 7, 2023, 3:46:01 PM7/7/23

to GloVe: Global Vectors for Word Representation

Stop words are not removed from GloVe's pretrained models and there are lots of punctuation marks in the words in them. Just not any common words.

Reply all

Reply to author

Forward