Number of effective words does not match model.wv.vectors? (Doc2Vec)

156 views

Skip to first unread message

Felix Forge

unread,

Nov 18, 2019, 12:53:28 PM11/18/19

to Gensim

I'm training a doc2vec model:

Now, I was wondering why model.wv.vectors has a shape/size of (44488, 300) while the effective word/s are shown as 101435.

So if total number of unique words is 101435 there should be 101435 one-hot-encoded word-vectors, i.e. model.wv.vectors should be of size (101435, 300)?

Or is the number of effective word/s not the number of unique words? (For the model.docvecs.vectors_docs, the size is exactly my (number of documents, features) )

num_workers = cpu_count()

epochs = 5 # (iter parm; iter=5 within word2vec) or epochs is set to 5 by default(!)

model = gensim.models.Doc2Vec(vector_size=300,

dm=0, min_count=1,

dbow_words=1,

alpha=0.025,

workers=num_workers,

window=10

#hs=hs_param

)

model.train(corp, total_examples=model.corpus_count, epochs=model.epochs)

2019-11-18 18:37:19,674 : WARNING : Effective 'alpha' higher than previous training cycles

2019-11-18 18:37:19,675 : INFO : training model with 8 workers on 44488 vocabulary and 300 features, using sg=1 hs=0 sample=0.001 negative=5 window=10

2019-11-18 18:37:20,719 : INFO : EPOCH 1 - PROGRESS: at 0.17%

...

2019-11-18 17:46:37,074 : INFO : EPOCH 5 - PROGRESS: at 99.86% examples, 98515 words/s, in_qsize 0, out_qsize 0

2019-11-18 17:46:37,815 : INFO : worker thread finished; awaiting finish of 7 more threads

2019-11-18 17:46:37,827 : INFO : worker thread finished; awaiting finish of 6 more threads

2019-11-18 17:46:37,832 : INFO : worker thread finished; awaiting finish of 5 more threads

2019-11-18 17:46:37,843 : INFO : worker thread finished; awaiting finish of 4 more threads

2019-11-18 17:46:37,976 : INFO : worker thread finished; awaiting finish of 3 more threads

2019-11-18 17:46:38,038 : INFO : worker thread finished; awaiting finish of 2 more threads

2019-11-18 17:46:38,162 : INFO : EPOCH 5 - PROGRESS: at 99.99% examples, 98488 words/s, in_qsize 1, out_qsize 1

2019-11-18 17:46:38,163 : INFO : worker thread finished; awaiting finish of 1 more threads

2019-11-18 17:46:38,181 : INFO : worker thread finished; awaiting finish of 0 more threads

2019-11-18 17:46:38,183 : INFO : EPOCH - 5 : training on 86187639 raw words (65876679 effective words) took 668.8s, 98496 effective words/s

2019-11-18 17:46:38,185 : INFO : training on a 430938195 raw words (329373877 effective words) took 3247.1s, 101435 effective words/s

Gordon Mohr

unread,

Nov 18, 2019, 1:53:27 PM11/18/19

to Gensim

The `effective words/s` number reported at the end should be understood as "effective words per second" – a measure of the rate of training progress, over the original corpus, in terms of all words that survived both `min_count` enforcement and `sample`-controlled downsampling. (This rate-of-progress values is just the "329373877 effective words" reported earlier on the same log line divided by the "3247.1s" also reported there. )

The actual number of unique words, is still just 44,488, as per the earlier log line you've shown:

> 2019-11-18 18:37:19,675 : INFO : training model with 8 workers on 44488 vocabulary and 300 features, using sg=1 hs=0 sample=0.001 negative=5 window=10

(There should be other even earlier log-lines that will also have described the effect of `min_count` on the surviving vocabulary of 44,488 words.)