Noob question - Unstable result of Doc2vec.infer_vector()

13 views

Skip to first unread message

Edward Tang

unread,

May 31, 2024, 7:35:44 PMMay 31

to Gensim

Hi,

I plan to use the gensim library to build a Doc2vec model to convert 30,000 texts in my corpus into 256-dimensional vectors for the next step of training. After training the model, I use `infer_vector()` to convert untrained corpus into vectors. However, I found that the vector for the same corpus changes significantly each time I run the code. Here is my code:

```python
train_corpus = []
for i, docu in enumerate(process_df['document']):
train_corpus.append(gensim.models.doc2vec.TaggedDocument(docu, [i]))
model = gensim.models.doc2vec.Doc2Vec(vector_size=256, min_count=2, epochs=40, seed=109, workers=1)
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)
vector = model.infer_vector(test_df['document'][0])
```

In the code, each element in `process_df['document']` is a text, and `test_df['document'][0]` is an untrained text. Specifically, the unstable vector result means that every time I run `vector = model.infer_vector(test_df['document'][0])` in Jupyter Notebook, the dimension of the vector is 256, but each dimension's value changes, even though the corpus (`test_df['document'][0]`) remains unchanged. After consulting the API documentation, I set "seed=109, workers=1" in the model parameters, but it did not work.

I would like to know how to ensure that the vectors generated by `infer_vector()` for the same corpus remain consistent after the model is trained. Additionally, my corpus is in Chinese, and I wonder if this could be the reason for the unstable vector results.

Gordon Mohr

unread,

May 31, 2024, 7:50:13 PMMay 31

to Gensim

The reasons that these algorithms don't give the exact same results in subsequent training or inference runs on the same data are described in the project FAQ answers:

"Q11: I've trained my Word2Vec / Doc2Vec / etc model repeatedly using the exact same text corpus, but the vectors are different each time. Is there a bug or have I made a mistake? (*2vec training non-determinism)"

https://github.com/piskvorky/gensim/wiki/Recipes-&-FAQ#q11-ive-trained-my-word2vec--doc2vec--etc-model-repeatedly-using-the-exact-same-text-corpus-but-the-vectors-are-different-each-time-is-there-a-bug-or-have-i-made-a-mistake-2vec-training-non-determinism

Generally, if your data & parameters are sufficient, each training run will result in a model that's about as capable on downstream tasks – even though the coordinate spaces are different, so words/texts are in different positions.

With regard to inference:

"Q12: I've used Doc2Vec infer_vector() on a single text, but the resulting vector is different each time. Is there a bug or have I made a mistake? (doc2vec inference non-determinism)"

https://github.com/piskvorky/gensim/wiki/Recipes-&-FAQ#q12-ive-used-doc2vec-infer_vector-on-a-single-text-but-the-resulting-vector-is-different-each-time-is-there-a-bug-or-have-i-made-a-mistake-doc2vec-inference-non-determinism

Withing a single model, each inference of sufficient text/parameters/epochs should result in *similar* vectors, not identical vectors – and so substantive evaluations of the usefulness of these inferred-vectors should be stable, even as the exact coordinates jitter a bit.

If the coordinates from subsequent inferences of the same text are very different from each other, that can be a useful hint that something else is wrong. An extremely overfit or undertrained model – too little training date or very inappropriate parameters – might not give stable re-inferences. Doing too few epoch passes on the inference might cause problems that more epochs could improve. A text with very few (or no) words known to the model may get fairly arbitrary/meaningless vectors.

So: evaluate the vectors by whether they prove useful in downstream tests, and are self-similar to each other (and equally useful for downstream tasks) between runs, rather than checking for identical results.