Pretrained Word2Vec models (and not just embeddings)

Michal Kosinski

unread,

Aug 3, 2020, 9:01:49 PM8/3/20

to Gensim

Hi All,

I am looking for the pre-trained Word2Vec models (not just word embeddings) that can be used with model.predict_output_word() method. Are there availabe anywhere or do I need to train the model from scratch to be able to use this method?

Thank you so much for your help!

Michal

(For clarity: I am not looking for word vectors that can be loaded using gensim.models.KeyedVectors.load() but full models that can be loaded using gensim.models.Word2Vec.load())

Radim Řehůřek

unread,

Aug 4, 2020, 7:49:12 AM8/4/20

to Gensim

Hi Michal,

that's a reasonable request but I'm not aware of any repositories with pretrained "full word2vec models" – as opposed to only the "final vectors".

What kind of models do you need? (trained on what data)

If it's "anything public", then training word2vec on e.g. Wikipedia or US patents is reasonably easy:

https://github.com/RaRe-Technologies/gensim-data/releases/tag/wiki-english-20171001

https://github.com/RaRe-Technologies/gensim-data/releases/tag/patent-2017

HTH,

Radim

Gordon Mohr

unread,

Aug 4, 2020, 2:20:42 PM8/4/20

to Gensim

If your true goal is word-prediction, or perhaps more expansively text-generation, it's unlikely word2vec is a very good way to do it.

The word2vec algorithm only uses a fragmentary, approximate, constrained word-prediction task as a training goal in order to create word-vectors that happen to be interesting for non-word-prediction uses. (The model might still be pretty bad at word-prediction at the end of training, yet give off useful word-vectors.)

Also, the `predict_output_word()` method has quite a few caveats. It's very slow, it doesn't apply the same relative-weighting of words by distance or frequency that's applied during training, and it only works for negative-sampling models (not hierarchical-softmax). And, even if the model was trained using skip-gram, its prediction is more akin to CBOW mode - though perhaps that is the best that can be grafted on to get one set of ranked predictions from a list of words.

So, I would not rely on that method for anything more rigorous than an art project.

- Gordon

Michal Kosinski

unread,

Aug 4, 2020, 2:31:50 PM8/4/20

to Gensim

Dear Radim,

Many thanks for your response. I am interested in studying modern language models to further our understanding of human psychology. The modern language models such as word2vec, encode the language patterns of thousands or millions of individuals. Thus, I think, instead of asking individuals about their opinions, thoughts, or feelings (in the context of an online or a lab study), one could interview/consult the language model. In the context of studying bias, for example, instead of asking people about their perceptions of young vs. old, or men vs. women (while trying to find a way of circumventing people’s tendency to hide their biases), one could examine the model derived from the language samples collected in the wild. Like in this Science paper https://science.sciencemag.org/content/356/6334/183. Or study gender biases encoded in language by examining the probability distributions over words predicted to finish a sentence such as “She is…” vs. “He is…”.

To answer your question about the kind of models needed, I think that the most useful ones would be derived from language used spontaneously in the wild to describe people’s own thoughts, feelings, opinions,etc. (e.g., Twitter, blogs, websites). Interesting insights could be also derived from language used to describe others / society / politics (e.g. Google News or Wiki datasets).

I can definitely train the models myself, but I would rather examine a range of existing, state-of-the-art, models. This would not only enable others to easily replicate / build on my research, but would also minimize the chances that my findings are biased by the particular approach I used.

Warm wishes, Michal

Michal Kosinski

unread,

Aug 4, 2020, 2:33:54 PM8/4/20

to Gensim

Dear Gordon,

Thank you so much for your comment. What model/approach would you suggest instead?

Best wishes,

Michal

Gordon Mohr

unread,

Aug 4, 2020, 5:46:12 PM8/4/20

to Gensim

Having seen the details in your other response, simply checking word-vector-to-word-vector similarities may be as useful, or more, than attempting a context-prediction that isn't even quite what the model is actually trained to do.

A simple Markov model, or one of the deep-text-generation models (GPT-* & similar), could also complete your prompts, or you could just search large corpuses for an exact count of how many times different words finish certain exact prompts – rather than a `prediction filtered through a compressed model that has (by design) thrown away lots of the original corpus info, and whose final state is influenced by many uses of randomization during the training process. Word2vec models begin with random initialization, & use other random sampling during training, so many of the subtle differences between models, or word-rank-orders, in final models will change via random jitter from run-to-run. So drawing too many conclusions from any one model, especially a "single draw" from someone else's training with not-fully-documented processing/data (as with the 2012 `GoogleNews` vectors), risks chasing ghosts.

I had some thoughts on a similar query on StackOverflow – https://stackoverflow.com/questions/61736874/how-to-compare-cosine-similarities-across-three-pretrained-models/61741677#61741677 – including a warning there, in the last few paragraphs, over how some of the early observations about "bias in word-vectors" overclaimed the effect because they didn't notice that the standard analogy-solving routines (such as in Google's original code release or Gensim) were automatically skipping potential answer-words that were part of the query.

- Gordon

Reply all

Reply to author

Forward