Number of effective words does not match model.wv.vectors? (Doc2Vec)

156 views
Skip to first unread message

Felix Forge

unread,
Nov 18, 2019, 12:53:28 PM11/18/19
to Gensim

I'm training a doc2vec model:


Now, I was wondering why model.wv.vectors has a shape/size of (44488, 300) while the effective word/s are shown as 101435. 

So if total number of unique words is 101435 there should be 101435 one-hot-encoded word-vectors, i.e. model.wv.vectors should be of size (101435, 300)?


Or is the number of effective word/s not the number of unique words? (For the model.docvecs.vectors_docs, the size is exactly my (number of documents, features) )



num_workers = cpu_count()

epochs = 5  # (iter parm; iter=5 within word2vec) or epochs is set to 5 by default(!)


model = gensim.models.Doc2Vec(vector_size=300,

                              dm=0, min_count=1,

                              dbow_words=1,

                              alpha=0.025,

                              workers=num_workers,

                              window=10

                              #hs=hs_param

                              )


model.train(corp, total_examples=model.corpus_count, epochs=model.epochs)



2019-11-18 18:37:19,674 : WARNING : Effective 'alpha' higher than previous training cycles

2019-11-18 18:37:19,675 : INFO : training model with 8 workers on 44488 vocabulary and 300 features, using sg=1 hs=0 sample=0.001 negative=5 window=10

2019-11-18 18:37:20,719 : INFO : EPOCH 1 - PROGRESS: at 0.17%

...

2019-11-18 17:46:37,074 : INFO : EPOCH 5 - PROGRESS: at 99.86% examples, 98515 words/s, in_qsize 0, out_qsize 0

2019-11-18 17:46:37,815 : INFO : worker thread finished; awaiting finish of 7 more threads

2019-11-18 17:46:37,827 : INFO : worker thread finished; awaiting finish of 6 more threads

2019-11-18 17:46:37,832 : INFO : worker thread finished; awaiting finish of 5 more threads

2019-11-18 17:46:37,843 : INFO : worker thread finished; awaiting finish of 4 more threads

2019-11-18 17:46:37,976 : INFO : worker thread finished; awaiting finish of 3 more threads

2019-11-18 17:46:38,038 : INFO : worker thread finished; awaiting finish of 2 more threads

2019-11-18 17:46:38,162 : INFO : EPOCH 5 - PROGRESS: at 99.99% examples, 98488 words/s, in_qsize 1, out_qsize 1

2019-11-18 17:46:38,163 : INFO : worker thread finished; awaiting finish of 1 more threads

2019-11-18 17:46:38,181 : INFO : worker thread finished; awaiting finish of 0 more threads

2019-11-18 17:46:38,183 : INFO : EPOCH - 5 : training on 86187639 raw words (65876679 effective words) took 668.8s, 98496 effective words/s

2019-11-18 17:46:38,185 : INFO : training on a 430938195 raw words (329373877 effective words) took 3247.1s, 101435 effective words/s

Gordon Mohr

unread,
Nov 18, 2019, 1:53:27 PM11/18/19
to Gensim
The `effective words/s` number reported at the end should be understood as "effective words per second" – a measure of the rate of training progress, over the original corpus, in terms of all words that survived both `min_count` enforcement and `sample`-controlled downsampling. (This rate-of-progress values is just the "329373877 effective words" reported earlier on the same log line divided by the "3247.1s" also reported there. )

The actual number of unique words, is still just 44,488, as per the earlier log line you've shown:

> 2019-11-18 18:37:19,675 : INFO : training model with 8 workers on 44488 vocabulary and 300 features, using sg=1 hs=0 sample=0.001 negative=5 window=10


(There should be other even earlier log-lines that will also have described the effect of `min_count` on the surviving vocabulary of 44,488 words.)


- Gordon 

Reply all
Reply to author
Forward
0 new messages