Cannot Reproduce Word2Vec Pearson R Value

18 views
Skip to first unread message

E

unread,
Oct 28, 2021, 6:29:05 PM10/28/21
to Gensim
Very much a newbie to Gensim and stats in general. I'm running into an issue with reproducing a Pearson R value for Word2Vec word similarity scores compared to human scores (details attached).

Gensim's API provides a Pearson R of roughly 0.54035, but independently calculating for the same provides 0.7391. Oddly, I have no issue independently reproducing the same R value for GloVe.

Would anyone have any thoughts on why I'm seeing this discrepancy? Also, any tips for better presenting my question/work are welcome.


PearsonQuestion.docx

Andrey Kutuzov

unread,
Oct 28, 2021, 8:07:56 PM10/28/21
to gen...@googlegroups.com
Hi E,

The word2vec-google-news-300 model contains words in lower-case,
UPPER-CASE, Title-case, etc.
By default, Gensim's evaluate_word_pairs() method is insensitive to case
(that is, it converts all the input test words to upper-case). With your
list of words, this introduces unwanted distortions.

Just use `evaluate_word_pairs(case_insensitive=False)`. If some words in
your word list are extremely rare (but still present in the model), you
might also want to increase the `restrict_vocab` parameter, which is by
default set to 300000.

On 29.10.2021 00:29, E wrote:
> Very much a newbie to Gensim and stats in general. I'm running into an
> issue with reproducing a Pearson R value for Word2Vec word similarity
> scores compared to human scores (details attached).
>
> Gensim's API provides a Pearson R of roughly 0.54035, but independently
> calculating for the same provides 0.7391. Oddly, I have no issue
> independently reproducing the same R value for GloVe.
>
> Would anyone have any thoughts on why I'm seeing this discrepancy? Also,
> any tips for better presenting my question/work are welcome.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Gensim" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to gensim+un...@googlegroups.com
> <mailto:gensim+un...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/gensim/9d5d7b82-6eaa-471d-8ae6-594f0e229154n%40googlegroups.com
> <https://groups.google.com/d/msgid/gensim/9d5d7b82-6eaa-471d-8ae6-594f0e229154n%40googlegroups.com?utm_medium=email&utm_source=footer>.

--
Solve et coagula!
Andrey

E

unread,
Oct 29, 2021, 11:28:11 AM10/29/21
to Gensim
Many kind thanks for this insight, Andrey! I was able to resolve the discrepancy.
Reply all
Reply to author
Forward
0 new messages