Need tokenizer/preprocessor for popular pretrained embeddings models
30 views
Skip to first unread message
Jeff Winchell
unread,
Jul 7, 2023, 3:35:55 PM7/7/23
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Gensim
Having discovered the undocumented feature that common words like
I'm
we're
don't
etc are OOV in the common GloVe pretrained models
(while words like o'clock are in so you can't just split on apostrophe/single quotes)
and seeing no docs except some vague references that Stanford parser with undocumented switches MIGHT have been used to generate the common pretrained GloVe models
and finding ZERO comments from Google about how they preprocessed the text used for Word2Vec's Google News pretrained model
it seems to me that GenSim would do people a lot of good by making tokenizers matching each of their most popular included pretrained models so that users are writing NLP programs that speak the same language as their models rather than comparing apples to oranges.
Gordon Mohr
unread,
Jul 19, 2023, 2:41:50 PM7/19/23
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
I also added some comments there - that it'd be useful, but also never-exactly-right – & might encourage further over-reliance on such pretrained vectors even when they're not truly the best choice for many projects.