Noob question - Unstable result of Doc2vec.infer_vector()

21 views
Skip to first unread message

Edward Tang

unread,
May 31, 2024, 7:35:44 PM5/31/24
to Gensim
Hi,
I plan to use the gensim library to build a Doc2vec model to convert 30,000 texts in my corpus into 256-dimensional vectors for the next step of training. After training the model, I use `infer_vector()` to convert untrained corpus into vectors. However, I found that the vector for the same corpus changes significantly each time I run the code. Here is my code:

```python
train_corpus = []
for i, docu in enumerate(process_df['document']):
    train_corpus.append(gensim.models.doc2vec.TaggedDocument(docu, [i]))
model = gensim.models.doc2vec.Doc2Vec(vector_size=256, min_count=2, epochs=40, seed=109, workers=1)
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)
vector = model.infer_vector(test_df['document'][0])
```

In the code, each element in `process_df['document']` is a text, and `test_df['document'][0]` is an untrained text. Specifically, the unstable vector result means that every time I run `vector = model.infer_vector(test_df['document'][0])` in Jupyter Notebook, the dimension of the vector is 256, but each dimension's value changes, even though the corpus (`test_df['document'][0]`) remains unchanged. After consulting the API documentation, I set "seed=109, workers=1" in the model parameters, but it did not work.

I would like to know how to ensure that the vectors generated by `infer_vector()` for the same corpus remain consistent after the model is trained. Additionally, my corpus is in Chinese, and I wonder if this could be the reason for the unstable vector results.

Gordon Mohr

unread,
May 31, 2024, 7:50:13 PM5/31/24
to Gensim
The reasons that these algorithms don't give the exact same results in subsequent training or inference runs on the same data are described in the project FAQ answers:

"Q11: I've trained my Word2Vec / Doc2Vec / etc model repeatedly using the exact same text corpus, but the vectors are different each time. Is there a bug or have I made a mistake? (*2vec training non-determinism)"

Generally, if your data & parameters are sufficient, each training run will result in a model that's about as capable  on downstream tasks – even though the coordinate spaces are different, so words/texts are in different positions. 

With regard to inference:

"Q12: I've used Doc2Vec infer_vector() on a single text, but the resulting vector is different each time. Is there a bug or have I made a mistake? (doc2vec inference non-determinism)"

Withing a single model, each inference of sufficient text/parameters/epochs should result in *similar* vectors, not identical vectors – and so substantive evaluations of the usefulness of these inferred-vectors should be stable, even as the exact coordinates jitter a bit. 

If the coordinates from subsequent inferences of the same text are very different from each other, that can be a useful hint that something else is wrong. An extremely overfit or undertrained model – too little training date or very inappropriate parameters – might not give stable re-inferences. Doing too few epoch passes on the inference might cause problems that more epochs could improve. A text with very few (or no) words known to the model may get fairly arbitrary/meaningless vectors. 

So: evaluate the vectors by whether they prove useful in downstream tests, and are self-similar to each other (and equally useful for downstream tasks) between runs, rather than checking for identical results. 

- Gordon
Reply all
Reply to author
Forward
0 new messages