Jeff Winchell
unread,Jul 7, 2023, 2:18:13 PM7/7/23Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to GloVe: Global Vectors for Word Representation
I am looking for confirmation of my suspicions below:
This does not appear to be documented, but the VERY commonly-used pre-trained GloVe models require you to preprocess your text the same way or you will not be embedding VERY common words like
I'm
we're
you're
These words do not seem to appear in the pre-trained files (I checked the 840B token file and could not find the word "we're" in it though I did see things like 're, 'rE, we ....)
It seems the pre-trained GloVe models were created by first running the Stanford parser on it (no docs on what switches were used). This parser does not simply split on apostrophes as the GloVe 840B token file has words like O'Brien, o'clock, C'mon, etc. I doubt there is ANY standard tokenizer besides this Stanford one that tokenizes text that way.