Need tokenizer/preprocessor for popular pretrained embeddings models

30 views

Skip to first unread message

Jeff Winchell

unread,

Jul 7, 2023, 3:35:55 PM7/7/23

to Gensim

Having discovered the undocumented feature that common words like

I'm

we're

don't

etc are OOV in the common GloVe pretrained models

(while words like o'clock are in so you can't just split on apostrophe/single quotes)

and seeing no docs except some vague references that Stanford parser with undocumented switches MIGHT have been used to generate the common pretrained GloVe models

and finding ZERO comments from Google about how they preprocessed the text used for Word2Vec's Google News pretrained model

it seems to me that GenSim would do people a lot of good by making tokenizers matching each of their most popular included pretrained models so that users are writing NLP programs that speak the same language as their models rather than comparing apples to oranges.

Gordon Mohr

unread,

Jul 19, 2023, 2:41:50 PM7/19/23

to Gensim

I made a feature-request item in our issue-tracker for this – https://github.com/RaRe-Technologies/gensim/issues/3485 – as similar requests have come up before.

I also added some comments there - that it'd be useful, but also never-exactly-right – & might encourage further over-reliance on such pretrained vectors even when they're not truly the best choice for many projects.