I'm training a doc2vec model:
Now, I was wondering why model.wv.vectors has a shape/size of (44488, 300) while the effective word/s are shown as 101435.
So if total number of unique words is 101435 there should be 101435 one-hot-encoded word-vectors, i.e. model.wv.vectors should be of size (101435, 300)?
Or is the number of effective word/s not the number of unique words? (For the model.docvecs.vectors_docs, the size is exactly my (number of documents, features) )
num_workers = cpu_count()
epochs = 5 # (iter parm; iter=5 within word2vec) or epochs is set to 5 by default(!)
model = gensim.models.Doc2Vec(vector_size=300,
dm=0, min_count=1,
dbow_words=1,
alpha=0.025,
workers=num_workers,
window=10
#hs=hs_param
)
model.train(corp, total_examples=model.corpus_count, epochs=model.epochs)
2019-11-18 18:37:19,674 : WARNING : Effective 'alpha' higher than previous training cycles
2019-11-18 18:37:19,675 : INFO : training model with 8 workers on 44488 vocabulary and 300 features, using sg=1 hs=0 sample=0.001 negative=5 window=10
2019-11-18 18:37:20,719 : INFO : EPOCH 1 - PROGRESS: at 0.17%
...
2019-11-18 17:46:37,074 : INFO : EPOCH 5 - PROGRESS: at 99.86% examples, 98515 words/s, in_qsize 0, out_qsize 0
2019-11-18 17:46:37,815 : INFO : worker thread finished; awaiting finish of 7 more threads
2019-11-18 17:46:37,827 : INFO : worker thread finished; awaiting finish of 6 more threads
2019-11-18 17:46:37,832 : INFO : worker thread finished; awaiting finish of 5 more threads
2019-11-18 17:46:37,843 : INFO : worker thread finished; awaiting finish of 4 more threads
2019-11-18 17:46:37,976 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-11-18 17:46:38,038 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-11-18 17:46:38,162 : INFO : EPOCH 5 - PROGRESS: at 99.99% examples, 98488 words/s, in_qsize 1, out_qsize 1
2019-11-18 17:46:38,163 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-11-18 17:46:38,181 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-11-18 17:46:38,183 : INFO : EPOCH - 5 : training on 86187639 raw words (65876679 effective words) took 668.8s, 98496 effective words/s
2019-11-18 17:46:38,185 : INFO : training on a 430938195 raw words (329373877 effective words) took 3247.1s, 101435 effective words/s
> 2019-11-18 18:37:19,675 : INFO : training model with 8 workers on 44488 vocabulary and 300 features, using sg=1 hs=0 sample=0.001 negative=5 window=10
(There should be other even earlier log-lines that will also have described the effect of `min_count` on the surviving vocabulary of 44,488 words.)
- Gordon