Hi E,
The word2vec-google-news-300 model contains words in lower-case,
UPPER-CASE, Title-case, etc.
By default, Gensim's evaluate_word_pairs() method is insensitive to case
(that is, it converts all the input test words to upper-case). With your
list of words, this introduces unwanted distortions.
Just use `evaluate_word_pairs(case_insensitive=False)`. If some words in
your word list are extremely rare (but still present in the model), you
might also want to increase the `restrict_vocab` parameter, which is by
default set to 300000.
On 29.10.2021 00:29, E wrote:
> Very much a newbie to Gensim and stats in general. I'm running into an
> issue with reproducing a Pearson R value for Word2Vec word similarity
> scores compared to human scores (details attached).
>
> Gensim's API provides a Pearson R of roughly 0.54035, but independently
> calculating for the same provides 0.7391. Oddly, I have no issue
> independently reproducing the same R value for GloVe.
>
> Would anyone have any thoughts on why I'm seeing this discrepancy? Also,
> any tips for better presenting my question/work are welcome.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Gensim" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to
gensim+un...@googlegroups.com
> <mailto:
gensim+un...@googlegroups.com>.
> To view this discussion on the web visit
>
https://groups.google.com/d/msgid/gensim/9d5d7b82-6eaa-471d-8ae6-594f0e229154n%40googlegroups.com
> <
https://groups.google.com/d/msgid/gensim/9d5d7b82-6eaa-471d-8ae6-594f0e229154n%40googlegroups.com?utm_medium=email&utm_source=footer>.
--
Solve et coagula!
Andrey