Get number of tokens a word2vec model was trained with

189 views
Skip to first unread message

Phil

unread,
Nov 3, 2021, 10:37:43 AM11/3/21
to Gensim
I have some gensim.models.Word2Vec models saved into files.

After loading the models from file, how can I get the number of tokens these models were trained with?

Is this information saved in the file?

---------

It looks like the 'train` function of Word2Vec returns the number of tokens (https://stackoverflow.com/a/59928416/1719931) altough the return value seems undocumented (https://radimrehurek.com/gensim/models/word2vec.html).

I used the constructor directly to train it.

Is the number of tokens retrievable after training by calling the constructor?

Gordon Mohr

unread,
Nov 3, 2021, 1:32:22 PM11/3/21
to Gensim
Unless you call `.train()` yourself, those return values – the total considered of raw words considered by all training epochs, & the total count of words (in-vocabulary & not down-sampled) actually trained on – aren't saved anywhere. 

The model will cache the total number of words seen during the last `build_vocab()` in `model.total_words`. You could also sum the effective word-counts for all known words:

     tally_all_known_words = sum(model.wv.get_vecattr(word, 'count') for word in model.wv.index_to_key)

- Gordon

Phil

unread,
Nov 4, 2021, 11:25:47 AM11/4/21
to Gensim
Thanks for your answer.
'
1. I have tried `model.total_words` but the attribute does seem to exist:

>>> from gensim.models import Word2Vec
>>> model = Word2Vec.load("mymodel.model")
>>> model.total_words
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'Word2Vec' object has no attribute 'total_words'

2.
```
tally_all_known_words = sum(model.wv.get_vecattr(word, 'count') for word in model.wv.index_to_key)
```

Gives a result slightly lower then the count of tokens in all the documents (manual count after the parser). About 0.7% lower.

Just out of curiosity, is there a reason for this slightly lower count?

Best regards,
Phil

Gordon Mohr

unread,
Nov 5, 2021, 12:17:57 AM11/5/21
to Gensim
On Thursday, November 4, 2021 at 8:25:47 AM UTC-7 Phil wrote:
Thanks for your answer.
'
1. I have tried `model.total_words` but the attribute does seem to exist:

>>> from gensim.models import Word2Vec
>>> model = Word2Vec.load("mymodel.model")
>>> model.total_words
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'Word2Vec' object has no attribute 'total_words'


Oops, I misspoke. The attribute is `model.corpus_total_words`. (It's only called `total_words` as a local variable before being assigned into `.corpus_total_words`.)

2.
```
tally_all_known_words = sum(model.wv.get_vecattr(word, 'count') for word in model.wv.index_to_key)
```

Gives a result slightly lower then the count of tokens in all the documents (manual count after the parser). About 0.7% lower.

Just out of curiosity, is there a reason for this slightly lower count?

My 1st guess would be that the tally of counts will only be for words that survived the `min_count` cutoff, and your separate count might not apply that. If that's not it, perhaps you're not counting tokens in the *exact* same form of the corpus as was passed to the model (as part of the constructor or `.build_vocab()`. 

If you take the *exact* same corpus you're counting manually, and create a *new* `Word2Vec` instance, with the same parameters, then pass *that* corpus to the new model's `.build_vocab()`, is there still a tally mismatch? If so, you could do a separate by-word survey – say, using the Python `Counter` class to tally all the words in your reference corpus – then check if the sum of all word counts in that new survey is more like the earlier model or your manual count.

If this new count disagrees with the oldeer model, you can drill into individual word counts to see which differ, which might hint as to the source of the discrepancy. (Are the words with different counts rare words? Exceptional in some other way? etc.)

- Gordon
Reply all
Reply to author
Forward
0 new messages